Efficiently retrieving large data objects from a PostgreSQL database - database

I am currently trying to find a way to read large geometry objects from my PostgreSQL/PostGIS database more efficiently. I've analyzed the plan for my query and the run time if I only pull back the primary key is very acceptable, but when I retrieve the - sometimes very large - geometry objects, the return time can go into the minutes for a single query.
I am wondering if there is any more efficient way to read large objects from the database besides just a typical query (maybe some kind of streaming, where I can process the input as I'm retrieving it, to cut down on the effective processing time?). I've looked into cursors, but I'm not sure that's really what I'm looking for since those seem to be mostly related to PL/pgsql scripts as opposed to a Java application.
I am running PostgreSQL 9.5, application is written in Java/Scala and uses JDBC. Any help is appreciated.
EDIT - I should also add that the database is local to the machine running the application.

You should check at what stage is Postgres spending the most time:
executing the query
retriving data from hardware
sending data over the network
If you retrieve a lot of data then a lot of time can be spend when sending data over the network.
I never tried this but maybe you can use Postgresql SSLCompresion option:
If set to 1 (default), data sent over SSL connections will be compressed (this requires OpenSSL version 0.9.8 or later). If set to 0, compression will be disabled (this requires OpenSSL 1.0.0 or later). This parameter is ignored if a connection without SSL is made, or if the version of OpenSSL used does not support it.
Compression uses CPU time, but can improve throughput if the network is the bottleneck. Disabling compression can improve response time and throughput if CPU performance is the limiting factor.
[https://www.postgresql.org/docs/9.2/static/libpq-connect.html]

Related

Best way to return huge dataset where data can be analyzed asynchronously

This is a general question on handling huge sets of data returned for analysis.
I am using python, but don't think the programming language or db server type is important for this.
I currently have a huge set of data that is returned which takes about 20mins. I can use the data as soon as it arrives for analysis by using multiple threads/processes. The problem is waiting for the data.
I belive I could use paging for the data, but this still requires me to wait for the data which is really the problem.
I could break the query for the data into many separate calls so I can have 10 calls going at once all grabbing a different parts of the table via a where clause.
Is there a better way? Again, I need to get the data from the table(s) to the application as fast as possible.
Also - when I query a database for data, what is it that really determines the speed that the data returns?
Let's say the db server and my app are on the same machine. Is there a limiting speed due to a single connection/request to the db server?
If I am on an intranet - will the db connection use up as much bandwith as possible when sending?
do you know any into tutorials on performance of a database query (no where clause - just return all rows)? Will the query use the max connection speed and bandwidth?
Thanks - yes I am a new to db considerations.

Indexing about 300.000 triples in sesame using Camel

I have a Camel context configured to do some manipulation of input data in order to build RDF triples.
There's a final route with a processor that, using Sesame Client API, talks to a separate Sesame instance (running on Tomcat with 3GB of RAM) and sends add commands (each command contains about 5 - 10 statements).
The processor is running as a singleton and the corresponding "from" endpoint has 10 concurrentConsumers (I tried with 1, then 5, then 10 - moreless same behaviour).
I'm using HttpRepository from my processor for sending add commands and, while running, I observe a (rapid and) progressive degrade of performance in indexing. The overall process starts indexing triples very quickly but after a little bit the committed statements grow very slowly.
On Sesame side I used both MemoryStore and NativeStore but (performance) behaviour seems moreless the same.
The questions:
which kind of store kind is reccommended in case I would like to speed up the indexing phase?
Is the Repository.getConnection doing some kind of connection pooling? In other words, can I open and close a connection each time the "add" processor does its work?
Having said that I need first to create a store will all those triples, is it preferred create a "local" Sail store instead of having that managed by a remote Sesame server (therefore I won't use a HTTPRepository)?
I am assuming that you're adding using transactions of 4 or 5 statements for good reason, but if you have a way to do larger transactions, that will significantly boost speed. Ideal (and quickest) would be to just send all 300,000 triples to the store in a single transaction.
Your questions, in order:
If you're only storing 300,000 statements the choice of store is not that important, as both native and memory can easily handle this kind of scale at good speed. I would expect memory store be slightly more performant, especially if you have configured it to use a non-zero sync delay for persistence, but native has a lower memory footprint and is of course more robust.
HTTPRepository.getConnection does not pool the actual RepositoryConnection itself, but internally pools resources (so the actual HttpConnections that Sesame uses internally are pooled). so getConnection is relatively cheap and opening and closing multiple connections is fine - though you might consider reusing the same connection for multiple adds, so that you can batch multiple adds in a single transaction.
Whether to store locally or on a remote server really depends on you. Obviously a local store will be quicker because you eliminate network latency as well as the cost of (de)serializing, but the downside is that a local store is not easily made available outside your own application.

Which approach and database to use in performance-critical solution

I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to
the server sending its id, status (online or offiline), IP, location
(latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending
signals in the process. So I need a way to "forget" the equipments if
they have not sent a signal in the last X days. Also new equipments
might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over
an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and
at the same time as the database is updating. So I need indexes on
the main attributes (id, status, IP, location and parent node). The
query results do not need to be 100% accurate, eventual consistency
is fine as long as it doesn't take too long (more than 20 min on
avarage) for them to appear in the queries results.
I don't need
persistence at all, if the power goes out it's okay to lose
everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know
beforehand if the signal I receive is already in the database or not.
Almost all of the signals will be in the same state as they were the
last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night
removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need
to run fast. So I guess I need to have a cluster that perform a
single query on multiple nodes of the cluster (does CouchDB MapReduce
splits the workload to multiple nodes of the cluster?). I'm not
enterily sure I need a cluster though, could a single more expensive
machine handle all the load?
I have never used a noSQL system before, but I have theoretical
knowledge of the subject.
Does this make sense?
Apache Flume for collecting the signals.
It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.
Hive for batch queries.
Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.
HBase for random real-time reads/writes.
Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.
Impala for fast, interactive queries.
Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.
HTH
Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

Managing high-volume writes to SQL Server database

I have a web service that is used to manage files on a filesystem that are also tracked in a Microsoft SQL Server database. We have a .NET system service that watches for files that are added using the FileSystemWatcher class. When a file-added callback comes from FileSystemWatcher, metadata about the file is added to our database, and it works fairly well.
I've now come to a bit of a scalability problem. I'm adding large quantities of files to the filesystem in rapid succession, and this ends up hammering the database with file adds which results in locking up my web front-end.
I have yet to work on database scability issues, so I'm trying to come up with mitigate tactics. I was thinking of perhaps caching file adds and only writing them off to the database every five minutes or so, but I'm not sure how practical that is. This is data that needs to find its way into our database at some point anyway, and so it's going to have to get hammered at some point. Maybe I could limit the number of file db entries written per second to a certain amount, but then I risk having that amount be less than the rate at which files are added. How can I best tackle this?
Have you thought about using something like SQL Server Service Broker? That way you could push through tons of entries in a burst and it would level out the inserts into your database.
Basically you'd be pushing messages onto a queue which would then be consumed by a receiver stored procedure that would perform the insert for you. You could limit the maximum number of receivers executing to help with the responsiveness issues in your web interface.
There's a nice intro paper here. Although it's for 2005, not much has changed between 2005 and the newer versions of SQL Server.
You have a performance problem and you should approach it with a performance investigation methodology like Waits and Queues. Once you identify the actual problem, we can discuss solutions.
This is just a guess but, assuming the notification 'update metadata' code is a stright forward insert, the likely problem is that you're generating one transaction per notification. This results in commit flush waits, see Diagnosing Transaction Log Performance . Batch commit (aggregate multiple notifications before committing) is the canonical solution.
first option is using Caching to handle high-volume data. or using clusters for analysis high volume data. please click here for more information.

How can I minimize the data in a SQL replication

I want to replicate data from a boat offshore to an onshore site. The connection is sometimes via a satellite link and can be slow and have a high latency.
Latency in our application is important, the people on-shore should have the data as soon as possible.
There is one table being replicated, consisting of an id, datetime and some binary data that may vary in length, usually < 50 bytes.
An application off-shore pushes data (hardware measurements) into the table constantly and we want these data on-shore as fast as possible.
Are there any tricks in MS SQL Server 2008 that can help to decrease the bandwith usage and decrease the latency? Initial testing uses a bandwidth of 100 kB/s.
Our alternative is to roll our own data transfer and initial prototyping here uses a bandwidth of 10 kB/s (while transferring the same data in the same timespan). This is without any reliability and integrity checks so this number is artificially low.
You can try out different replication profiles or create your own. Different profiles are optimized for different network/bandwidth scenarios.
MSDN talks about replication profiles here.
Have you considered getting a WAN accelerator appliance? I'm too new here to post a link, but there are several available.
Essentially, the appliance on the sending end compresses the outgoing data, and the receiving end decompresses it, all on the fly and completely invisibly. This has the benefit of increasing the apparent speed of the traffic and not requiring you to change your server configurations. It should be entirely transparent.
I'd suggest on the fly compression/decompression outside of SQL Server. That is, SQL replicates the data normally but something in the network stack compresses so it's much smaller and bandwidth efficient.
I don't know of anything but I'm sure these exist.
Don't mess around with the SQL files directly. That's madness if not impossible.
Do you expect it to always be only one table that is replicated? Are there many updates, or just inserts? The replication is implemented by calling an insert/update sproc on the destination for each changed row. One cheap optimization is to force the sproc name to be small. By default it is composed from the table name, but IIRC you can force a different sproc name for the article. Given an insert of around 58 bytes for a row, saving 5 or 10 characters in the sproc name is significant.
I would guess that if you update the binary field it is typically a whole replacement? If that is incorrect and you might change a small portion, you could roll your own diff patching mechanism. Perhaps a second table that contains a time series of byte changes to the originals. Sounds like a pain, but could have huge savings of bandwidth changes depending on your workload.
Are the inserts generally done in logical batches? If so, you could store a batch of inserts as one customized blob in a replicated table, and have a secondary process that unpacks them into the final table you want to work with. This would reduce the overhead of these small rows flowing through replication.

Resources