Should I use dataset class as cache to speed up SQLite loading - winforms

In my WinForms app I am loading a large amount of data from binary files into an SQLite DB. I do lookups on the DB to determine if each new piece of data is to be added e.g.
If LookUpResult Then
AddNewData
Else
DiscardNewData
End If
This is slow. The INSERTs seem to take a long time. Should I use DataTable or DataSet classes to load the data to RAM and then write to the SQLite DB as a background task? Or would inserting many rows at a time from a DataTable object be less costly?
I was previously loading all the data into a custom class and that was fast but likely to hit memory limits, hence the move to the DB. I would have thought SQLite would cache the INSERTs to RAM anyway but that doesn't seem to be the case.
Thanks for any advice.

To ensure crash consistency, changed data is synchronized with the disk at the end of every transaction.
If you don't use explicit transaction, each command will get its own automatic transaction, which implies that you get the synchronization overhead for each command.
Use one transaction for all your inserts.
(Your own changes will be visible, as long as you go through the same DB connection.)

Related

Possible bottlenecks when inserting and updating BYTEA rows?

The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven't much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you :)
Edit 1:
More detailed information:
there will be only one client accessing the database, using only one Linux process
database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
stream will last anywhere between 10 seconds and 5 minutes
It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using bytea.
PostgreSQL's MVCC design means that an UPDATE is logically equivalent to a DELETE and an INSERT. When you insert the row then update it, what's happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.
I question your testing methodology - can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You've given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
bytea insert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others' data so you don't need UPDATE locks, use async commit and/or a commit_delay if you don't have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.
Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it'd require about 60 transactions per second. You could use a commit_delay to achieve group commit and significantly lower fsync() overhead, or even use synchronous_commit = off if you can afford to lose a time window of transactions in case of a crash.
With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven't benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I'd be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here's the first approach I'd test, benchmark and load-test, as it's in my view probably the most practical:
One connection per data stream, synchronous_commit = off + a commit_delay.
INSERT each 16kb record as it comes in into a staging table (if possible UNLOGGED or TEMPORARY if you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.
For absolutely best speed with this approach, implement a bytea_agg aggregate function for bytea as an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it's likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:
CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
INSERT INTO final_table SELECT stream_id, bytea_agg(data_block) FROM temp_stream_table;
You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or UNLOGGED table rather than a TEMPORARY table to accumulate those 16kb records, you'd need to make sure it was being quite aggressively VACUUMed.
See also:
Whats the fastest way to do a bulk insert into Postgres?
How to speed up insertion performance in PostgreSQL

DB insert operations are slower when using parallel.foreach than regular foreach

I have a function that loops through my list of objects, makes some checks on them and saving to an SQL server DB.
The number of objects I should save can be tens of thousands, so I decided to use parallel framework to make it faster. I use parallel.foreach to do that.
This iteration is in a backgroundworker thread.
It works, but I recognized it is slower than 'normal' foreach. E.g. in my last test a list of my objects are processed in 2:41.35 minutes, regular foreach do the does it in 2:14.92. The difference is nearly half a minute.
Is it a common behavior for DB insert operations in parallel framework? Is it advised to use parallel.foreach to make DB insert operations?
Thx!
VS2010/.Net4/c#
If you have a single database connection, trying to save objects in parallel is probably just adding synchronization overhead on the client side (the connection object). See the potential pitfalls of parallelism. Without knowing anything about your particular code, I would guess that a better approach would be to try to parallelize the checking only, and then do the saving (using the single db connection) sequentially. Of course, if your objects are mutable you need to make sure they can't change between checking and saving, but that's still true in the current approach.
BTW, are you saving all of these objects to the db in a single transaction or do you just have autocommit on? It depends on what kind of integrity constraints apply to your insertions, of course, but 2-3 minutes seems like a long time even for tens of thousands of rows. Using transactions wisely, if you aren't currently, is likely to get you a significant increase in performance.
Edited to add: Using SQLite on my desktop machine, I can insert 100 1-field rows in 5 seconds using autocommit. I can insert 10,000 1-field rows in 13 seconds if they're in a single transaction.

What does "bulk load" mean?

Jumping from article to article, I can see everywhere the expression "bulk loading".
What does it really (technically) mean?
What does it imply?
Explanation based on use-cases is welcome.
Indexes are usually optimized for inserting rows one at a time. When you are adding a great deal of data at once, inserting rows one at a time may be inefficient. For instance, with a B-Tree, the optimal way to insert a single key is very poor way of adding a bunch of data to an empty index.
Instead you pursue a different strategy with B-Trees. You presort all of the data, and group it in blocks. You can then build a new B-Tree by transforming the blocks into tree nodes. Although both techniques have the same asymptotic performance, O(n log(n)), the bulk-load operation has much smaller factor.
Bulk loading is a way to load data (typically into a database) in 'large chunks'. Where you might enter a customer or a purchase order or information about items in inventory one at a time into your system, bulk loading takes a file of this same sort of information and loads hundreds/thousands/millions of records in a short period of time.
If you convert from one kind of DBMS to another, you would hope not to enter all the information into the new DB from the old DB. Instead, you would dump the information from the old DB to a file in a format that can be easily read by the new DB and then import that data into the new DB.
That's what bulk loading entails (at the 35K foot level, anyway)
Bulk loading is used to import/export large amounts of data. Usually bulk operations are not logged and transactional integrity might not work as expected. Often bulk operations bypass triggers and integrity checks like constraints. This improves performance, for large amounts of data, quite significantly.
One thing to remember is that bulk loading implies that the data content from the source to target is the same, but this is only true if the source system is acquiesced. For any data source, and especially true of large data, the source data can change after it has been read and the data transfer is happening. Traditionally online systems either have to go off line or suspend updates if an exact point it time capture that matches the source is required.

ADO.NET asynchronous reader (queue processing)

I have a large table, 1B+ records that I need to pull down and run an algorithm on every record. How can I use ADO.NET to exec a "select * from table" asynchronously and start reading the rows one by one while ado.net is receiving the data?
I also need to dispose of the records after I read them to save on memory. So I am looking of a way to pull a table down record by record and basically shove the record into a queue for processing.
My datasources are oracle and mssql. I have to do this for several datasources.
You should use SSIS for this.
You need a bit of background detail on how the ADO.Net data providers work to understand what you can do and what you can't do. Lets take the SqlClient provider for example. It is true that it is possible to execute queries asynchronously with BeginExecuteReader but this asynchronous execution is only until the query start returning results. At the wire level the SQL text is sent to the server, the server start churning the query execution and eventually will start pushing result rows back to the client. As soon as the first packet comes back to the client, the asynchronous execution is done and the completion callback is executed. After that the client uses the SqlDataReader.Read() method to advance the result set. There are no asynchronous methods in the SqlDataReader. This pattern work wonders for complex queries that return few results after some serious processing is done. While the server is busy producing the result, the client is idle with no threads blocked. However things are completely different for simple queries that produce large result sets (as seem to be the case for you): the server will immedeatly produce resutls and will continue to push them back to the client. The asynchronous callback will be almost instantenous and the bulk of the time will be spent by the client iterating over the SqlDataReader.
You say you're thinking of placing the records into an in memory queue first. What is the purpose of the queue? If your algorithm processing is slower than the throughput of the DataReader result set iteration then this queue will start to build up. It will consume live memory and eventualy will exhaust the memory on the client. To prevent this you would have to build in a flow control mechanism, ie. if the queue size is bigger than N don't put any more records into it. But to achieve this you would have to suspend the data reader iteration and if you do this you push flow control to the server which will suspend the query until the communication pipe is available again (until you start reading from the reader). Ultimately the flow control has to be proagated all the way to the server, which is always the case in any producer-consumer relation, the producer has to stop otherwise intermediate queues fill up. Your in-memory queue serves no purpose at all, other than complicating things. You can simply process items from the reader one by one and if your rate of processing is too slow, the data reader will cause flow control to be applied on the query running on the server. This happens automatically simply because you don't call the DataReader.Read method.
To summarise up, for a large set processing you cannot do asynchronous processing and there is no need for a queue.
Now the difficult part.
Is your processing doing any sort of update back in the database? If yes, then you have much bigger problems:
You cannot use the same connection to write back the result, because it is busy with the data reader. SqlClient for SQL Server supports MARS but that only solves the problem with SQL 2005/2008.
If you're going to enroll the read and update in a transaction if your updates occur on a different connection (see above), then this means using distributed transactions (even when the two conencitons involved point back to the same server). Distributed transactions are slow.
You will need to split the processing into several batches because is very bad to process 1B+ records in a single transaction. This means also that you are going to have to be able to resume processing of an aborted batch, which means you must be able to identify records that were already processed (unless processing is idempotent).
A combination of a DataReader and an iterator block (a.k.a. generator) should be a good fit for this problem. The default DataReaders provided by Microsoft pull data one record at a time from a datasource.
Here's an example in C#:
static IEnumerable<User> RetrieveUsers(DbDataReader reader)
{
while (reader.NextResult())
{
User user = new User
{
Name = reader.GetString(0),
Surname = reader.GetString(1)
};
yield return user;
}
}
A good approach to this would be to pull back the data in blocks, iterate through adding to your queue then calling again. This is going to be better than hitting the DB for each row. If you are pulling them back via a numeric PK then this will be easy, if you need to order by something you can use ROW_NUMBER() to do this.
Just use the DbDataReader (just like Richard Nienaber said). It is a forward-only way of scrolling through the retrieved data. You don't have to dispose of your data because a DbDataReader is forward only.
When you use the DbDataReader it seems that the records are retrieved one by one from the database.
It is however slightly more complicated:
Oracle (and probably MySQL) will fetch a few 100 rows at a time to decrease the number of round trips to the database. You can configure the fetch size of DataReader. Most of the time it will not matter whether you fetch 100 rows or 1000 rows per round trip. However, a very low value like 1 or 2 rows slows things down because with a low value retrieving the data will require too many round trips.
You probably don't have to set the fetch size manually, the default will be just fine.
edit1: See here for an Oracle example: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html

Using a Cache Table in SQLServer, am I crazy?

I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.

Resources