I'm a nodejs newbie and was wondering which way was better to insert huge number of rows into a DB. On the surface, it looks like inserting stuff one-at-a-time looks more like the way to go because I can free the event loop quickly and serve other requests. But, the code looks hard to understand that way. For bulk inserts, I'd have to prepare the data beforehand which would mean using loops for sure. This would cause less requests to be served during that period as the event loop is busy with the loop.
So, what's the preferred way ? Is my analysis correct ?
There's no right answer here. It depends on the details: why are you inserting a huge number of rows? How often? Is this just a one-time bootstrap or does your app do this every 10 seconds? It also matters what compute/IO resources are available. Is your app the only thing using the database or is blasting it with requests going to be a denial of service for other users?
Without the details, my rule of thumb would be bulk insert with a small concurrency limit, like fire off up to 10 inserts, and then wait until one of them finishes before sending another insert command to the database. This follows the model of async.eachLimit. This is how browsers handle concurrent requests to a given web site, and it has proven to be a reasonable default policy.
In general, loops on in-memory objects should be fast, very fast.
I know you're worried about blocking the CPU, but you should be considering the total amount of work to be done. Sending items one at time carries a lot of overhead. Each query to the DB has its own sequence of inner for loops that probably make your "batching" for loop look pretty small.
If you need to dump 1000 things in the DB, the minimum amount of work you can do is to run this all at once. If you make it 10 batches of 100 "things", you have to do all of the same work + you have to generate and track all of these requests.
So how often are you doing these bulk inserts? If this is a regular occurrence, you probably want to minimize the total amount of work and bulk insert everything at once.
The trade-off here is logging and retries. It's usually not enough to just perform some type of bulk insert and forget about it. The bulk insert is eventually going to fail (fully or partially) and you will need some type of logic for retries or consolidation.
If that's a concern, you probably want to manage the size of the bulk insert so that you can retry blocks intelligently.
Related
While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.
The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven't much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you :)
Edit 1:
More detailed information:
there will be only one client accessing the database, using only one Linux process
database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
stream will last anywhere between 10 seconds and 5 minutes
It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using bytea.
PostgreSQL's MVCC design means that an UPDATE is logically equivalent to a DELETE and an INSERT. When you insert the row then update it, what's happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.
I question your testing methodology - can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You've given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
bytea insert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others' data so you don't need UPDATE locks, use async commit and/or a commit_delay if you don't have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.
Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it'd require about 60 transactions per second. You could use a commit_delay to achieve group commit and significantly lower fsync() overhead, or even use synchronous_commit = off if you can afford to lose a time window of transactions in case of a crash.
With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven't benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I'd be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here's the first approach I'd test, benchmark and load-test, as it's in my view probably the most practical:
One connection per data stream, synchronous_commit = off + a commit_delay.
INSERT each 16kb record as it comes in into a staging table (if possible UNLOGGED or TEMPORARY if you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.
For absolutely best speed with this approach, implement a bytea_agg aggregate function for bytea as an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it's likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:
CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
INSERT INTO final_table SELECT stream_id, bytea_agg(data_block) FROM temp_stream_table;
You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or UNLOGGED table rather than a TEMPORARY table to accumulate those 16kb records, you'd need to make sure it was being quite aggressively VACUUMed.
See also:
Whats the fastest way to do a bulk insert into Postgres?
How to speed up insertion performance in PostgreSQL
I know NHibernate isn't meant to do batch inserts, because it's about 5x slower than SqlBulkCopy, but I decided to use it for code simplicity.
However, my code's not 5x slower. It's 2400x slower. I'm inserting about 2500 records. I've turned off log4net logging. I'm running it in release mode. I'm not using an id generator (I'm specifying it in the code via an integer counter). I'm using a stateless session. I've set a batch size of 100 (I could go more, but doesn't seem to help). I tried adding the generator back in, but setting its class to "assigned".
I'm not inserting any child elements. I've confirmed the batch inserts are occurring.
Is it still calling SELECT SCOPE_IDENTITY()? But even if it is, that's still a ridiculous amount of time.
I don't do too many batch operations, so I can continue to use SqlBulkCopy for this process, but I'm concerned that my entire application could be running faster.
I don't have a license for NHProf, but I'm wondering if now is the time to download the trial.
I'm using NHibernate 3.3 GA with Syscache2 -- but again, I'm using a stateless session.
Any HBMs, configuration, or code you want to see? Suggestions?
Thanks
because it's about 5x slower than SqlBulkCopy
You must be joking.
NHibrnate does inserts. Using batched inserts (i.e. more than one insert statement in a command), handwritten - something I do not think NHibernate does - I got around 400 inserts in a specific project.
Using SqlBUlkCopy i got 75000.
That is NOT a factor of 5, that is a factor of of 187.
However, my code's not 5x slower. It's 2400x slower
Not an NHibernate specialist. Log the connection - I would assume NHibernate sends one insert per batch, which means a LOT of slow processing etc. and is a LOT slower than the stuff I Did (beginning of my text).
Where the heck did you get the 5x factor from? That is a false start to start with.
I don't do too many batch operations, so I can continue to use SqlBulkCopy for this process, but I'm concerned
that my entire application could be running faster.
Here is a reality check for you: you do not use an ORM when you need extreme select or insert speed. They are there for business rule heavy objects - business objects. When you end up doing bulk inserts or reads, you DO NOT USE A FULL ORM. Simple like that.
When you think SqlBUlkCopy is fast, check this:
* Multiple SqlBulkCopy running on multiple threads...
* ...inerting into temporary tables and then
* ...using one insert into select statement to copy the data to the final table.
Why? Because SqlBulkCopy has some bad locking behavior for multi threads. This is how I got it that high.
AND: 2500 rows is low for SqlBulkCopy - the setup overhead is significant (i.e. before line 1).... So you will get less gain. I use 50k row batches.
What is NHibernate doing on the wire level?
I've confirmed the batch inserts are occurring.
How? What do you consider a batch insert?
Are there triggers on the table?
inserting 10000 objects with a couple long and string properties into a local mysql database on my devmashine takes:
StatelessSession: 4,6 seconds
Session: 5,7 seconds
I have found if you are doing many inserts the First Level Cache will clog and soon cause it to slow down to a crawl. You can try and use a stateless session or just open and close the Session periodically, ie every 5 inserts. You will of course lose things like transactions by closing session.
But ultimately, I tend to use SqlBulkCopy if I have more than about 100 rows to insert, many times quicker.
I have a function that loops through my list of objects, makes some checks on them and saving to an SQL server DB.
The number of objects I should save can be tens of thousands, so I decided to use parallel framework to make it faster. I use parallel.foreach to do that.
This iteration is in a backgroundworker thread.
It works, but I recognized it is slower than 'normal' foreach. E.g. in my last test a list of my objects are processed in 2:41.35 minutes, regular foreach do the does it in 2:14.92. The difference is nearly half a minute.
Is it a common behavior for DB insert operations in parallel framework? Is it advised to use parallel.foreach to make DB insert operations?
Thx!
VS2010/.Net4/c#
If you have a single database connection, trying to save objects in parallel is probably just adding synchronization overhead on the client side (the connection object). See the potential pitfalls of parallelism. Without knowing anything about your particular code, I would guess that a better approach would be to try to parallelize the checking only, and then do the saving (using the single db connection) sequentially. Of course, if your objects are mutable you need to make sure they can't change between checking and saving, but that's still true in the current approach.
BTW, are you saving all of these objects to the db in a single transaction or do you just have autocommit on? It depends on what kind of integrity constraints apply to your insertions, of course, but 2-3 minutes seems like a long time even for tens of thousands of rows. Using transactions wisely, if you aren't currently, is likely to get you a significant increase in performance.
Edited to add: Using SQLite on my desktop machine, I can insert 100 1-field rows in 5 seconds using autocommit. I can insert 10,000 1-field rows in 13 seconds if they're in a single transaction.
I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.
Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
cheers
Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.
I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).
Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.