While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.
Related
I'm a nodejs newbie and was wondering which way was better to insert huge number of rows into a DB. On the surface, it looks like inserting stuff one-at-a-time looks more like the way to go because I can free the event loop quickly and serve other requests. But, the code looks hard to understand that way. For bulk inserts, I'd have to prepare the data beforehand which would mean using loops for sure. This would cause less requests to be served during that period as the event loop is busy with the loop.
So, what's the preferred way ? Is my analysis correct ?
There's no right answer here. It depends on the details: why are you inserting a huge number of rows? How often? Is this just a one-time bootstrap or does your app do this every 10 seconds? It also matters what compute/IO resources are available. Is your app the only thing using the database or is blasting it with requests going to be a denial of service for other users?
Without the details, my rule of thumb would be bulk insert with a small concurrency limit, like fire off up to 10 inserts, and then wait until one of them finishes before sending another insert command to the database. This follows the model of async.eachLimit. This is how browsers handle concurrent requests to a given web site, and it has proven to be a reasonable default policy.
In general, loops on in-memory objects should be fast, very fast.
I know you're worried about blocking the CPU, but you should be considering the total amount of work to be done. Sending items one at time carries a lot of overhead. Each query to the DB has its own sequence of inner for loops that probably make your "batching" for loop look pretty small.
If you need to dump 1000 things in the DB, the minimum amount of work you can do is to run this all at once. If you make it 10 batches of 100 "things", you have to do all of the same work + you have to generate and track all of these requests.
So how often are you doing these bulk inserts? If this is a regular occurrence, you probably want to minimize the total amount of work and bulk insert everything at once.
The trade-off here is logging and retries. It's usually not enough to just perform some type of bulk insert and forget about it. The bulk insert is eventually going to fail (fully or partially) and you will need some type of logic for retries or consolidation.
If that's a concern, you probably want to manage the size of the bulk insert so that you can retry blocks intelligently.
Our DBA requires us to return all tabular data from stored procedures in a set of associative arrays rather than using a ref cursor which is what I see in most examples on the web. He says this is because it is much faster for Oracle to do things this way, but it seems counter intuitive to me because the data needs to be looped over twice, once in the stored procedure and then again in the application when it is processes. Also, values often need to be casted from their native types to varchar so they can be stored in the array and then casted back on the application side. Using this method also makes it difficult to use orm tools because they seem to want ref cursors in most cases.
An example of a stored procedure is the following:
PROCEDURE sample_procedure (
p_One OUT varchar_array_type,
p_Two OUT varchar_array_type,
p_Three OUT varchar_array_type,
p_Four OUT varchar_array_type
)
IS
p_title_procedure_name VARCHAR2(100) := 'sample_procedure';
v_start_time DATE :=SYSDATE;
CURSOR cur
IS
SELECT e.one, e.two, e.three, e.four FROM package.table
WHERE filter='something';
v_counter PLS_INTEGER := 0;
BEGIN
FOR rec IN cur LOOP
BEGIN
v_counter := v_counter + 1;
p_One(v_counter) := rec.one;
p_Two(v_counter) := rec.two;
p_Three(v_counter) := rec.three;
p_Four(v_counter) := rec.four;
END;
END LOOP;
END;
The cursor is used to populate one array for each column returned. I have tried to find information supporting his claim that this is method faster but have been unable to do so. Can anyone fill me in on why he might want us (the .net developers) to write stored procedures in this way?
The DBA's request does not make sense.
What the DBA is almost certainly thinking is that he wants to minimize the number of SQL to PL/SQL engine context shifts that go on when you're fetching data from a cursor. But the solution that is being suggested is poorly targetted at this particular problem and introduces other much more serious performance problems in most systems.
In Oracle, a SQL to PL/SQL context shift occurs when the PL/SQL VM asks the SQL VM for more data, the SQL VM responds by executing the statement further to get the data which it then packages up and hands back to the PL/SQL VM. If the PL/SQL engine is asking for rows one at a time and you're fetching a lot of rows, it is possible that these context shifts can be a significant fraction of your overall runtime. To combat that problem, Oracle introduced the concept of bulk operations back at least in the 8i days. This allowed the PL/SQL VM to request multiple rows at a time from the SQL VM. If the PL/SQL VM requests 100 rows at a time, you've eliminated 99% of the context shifts and your code potentially runs much faster.
Once bulk operations were introduced, there was a lot of code that could be refactored in order to be more efficient by explicitly using BULK COLLECT operations rather than fetching row-by-row and then using FORALL loops to process the data in those collections. By the 10.2 days, however, Oracle had integrated bulk operations into implicit FOR loops so an implicit FOR loop now automatically bulk collects in batches of 100 rather than fetching row-by-row.
In your case, however, since you're returning the data to a client application, the use of bulk operations is much less significant. Any decent client-side API is going to have functionality that lets the client specify how many rows need to be fetched from the cursor in each network round-trip and those fetch requests are going to go directly to the SQL VM, not through the PL/SQL VM, so there are no SQL to PL/SQL context shifts to worry about. Your application has to worry about fetching an appropriate number of rows in each round-trip-- enough that the application doesn't become too chatty and bottleneck on the network but not so many that you have to wait too long for the results to be returned or to store too much data in memory.
Returning PL/SQL collections rather than a REF CURSOR to a client application isn't going to reduce the number of context shifts that take place. But it is going to have a bunch of other downsides not the least of which is memory usage. A PL/SQL collection has to be stored entirely in the process global area (PGA) (assuming dedicated server connections) on the database server. This is a chunk of memory that has to be allocated from the server's RAM. That means that the server is going to have to allocate memory in which to fetch every last row that every client requests. That, in turn, is going to dramatically limit the scalability of your application and, depending on the database configuration, may steal RAM away from other parts of the Oracle database that would be very useful in improving application performance. And if you run out of PGA space, your sessions will start to get memory related errors. Even in purely PL/SQL based applications, you would never want to fetch all the data into collections, you'd always want to fetch it in smaller batches, in order to minimize the amount of PGA you're using.
In addition, fetching all the data into memory is going to make the application feel much slower. Almost any framework is going to allow you to fetch data as you need it so, for example, if you have a report that you are displaying in pages of 25 rows each, your application would only need to fetch the first 25 rows before painting the first screen. And it would never have to fetch the next 25 rows unless the user happened to request the next page of results. If you're fetching the data into arrays like your DBA proposes, however, you're going to have to fetch all the rows before your application can start displaying the first row even if the user never wants to see more than the first handful of rows. That's going to mean a lot more I/O on the database server to fetch all the rows, more PGA on the server, more RAM on the application server to buffer the result, and longer waits for the network.
I believe that Oracle will begin sending results from a system like this as it scans the database, rather than retrieving them all and then sending them back. This means that results are sent as they are found, speeding the system up. (Actually, if I remember correctly it returns results in batches to the loop.) This is mostly from memory from some training
HOWEVER, the real question, is why not ask him his reasoning directly. He may be referring to a trick Oracle can utilize, and if you understand the specifics you can utilize the speed trick to it's full potential. Generally, ultimate of "Always do this, as this is faster" as suspicious and deserve a closer look to fully understand their intentions. There may be situations where this is really not applicable (small query results for example), where all the readability issues and overhead are not helping performance.
That said, it may be done to keep the code consistent and more quickly recognizable. Communication on his reasoning is the most important tool with concerns like this, as chances are good that he knows a trade secret that he's not full articulating.
In one place of code I do something like this:
FormModel(.. some data here..).put()
And a couple lines below I select from the database:
FormModel.all().filter(..).fetch(100)
The problem I noticed - sometimes the fetch doesn't notice the data I just added.
My theory is that this happens because I'm using high replication storage, and I don't give it enough time to replicate the data. But how can I avoid this problem?
Unless the data is in the same entity group there is no way to guarantee that the data will be the most up to data (If I understand this section correctly).
Shay is right: there's no way to know when the datastore will be ready to return the data you just entered.
However, you are guaranteed that the data will be entered eventually, once the call to put completes successfully. That's a lot of information, and you can use it to work around this problem. When you get the data back from fetch, just append/insert the new entities that you know will be in there eventually! In most cases it will be good enough to do this on a per-request basis, I think, but you could do something more powerful that uses memcache to cover all requests (except cases where memcache fails).
The hard part, of course, is figuring out when you should append/insert which entities. It's obnoxious to have to do this workaround, but a relatively low price to pay for something as astonishingly complex as the HRD.
From https://developers.google.com/appengine/docs/java/datastore/transactions#Java_Isolation_and_consistency
This consistent snapshot view also extends to reads after writes
inside transactions. Unlike with most databases, queries and gets
inside a Datastore transaction do not see the results of previous
writes inside that transaction. Specifically, if an entity is modified
or deleted within a transaction, a query or get returns the original
version of the entity as of the beginning of the transaction, or
nothing if the entity did not exist then.
I saw this sentence not only in one place:
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
What does this really mean?
It puzzles me now because I want to use transactions for my app which in normal use will deal with inserting of hundreds of rows from many clients, concurrently.
For example, I have a service which exposes a method: AddObjects(List<Objects>) and of course these object contain other nested different objects.
I was thinking to start a transaction for each call from the client performing the appropriate actions (bunch of insert/update/delete for each object with their nested objects). EDIT1: I meant a transaction for entire "AddObjects" call in order to prevent undefined states/behaviour.
Am I going in the wrong direction? If yes, how would you do that and what are your recommendations?
EDIT2: Also, I understood that transactions are fast for bulk oeprations, but it contradicts somehow with the quoted sentence. What is the conclusion?
Thanks in advance!
A transaction has to cover a business specific unit of work. It has nothing to do with generic 'objects', it must always be expressed in domain specific terms: 'debit of account X and credit of account Y must be in a transaction', 'subtract of inventory item and sale must be in a transaction' etc etc. Everything that must either succeed together or fail together must be in a transaction. If you are down an abstract path of 'adding objects to a list is a transaction' then yes, you are on a wrong path. The fact that all inserts/updates/deletes triggered by a an object save are in a transaction is not a purpose, but a side effect. The correct semantics should be 'update of object X and update of object Y must be in a transaction'. Even a degenerate case of a single 'object' being updated should still be regarded in terms of domain specific terms.
That recommendation is best understood as Do not allow user interaction in a transaction. If you need to ask the user during a transaction, roll back, ask and run again.
Other than that, do use transaction whenever you need to ensure atomicity.
It is not a transactions' problem that they may cause "concurrency issues", it is the fact that the database might need some more thought, a better set of indices or a more standardized data access order.
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
The longer a transaction is kept open the more likely it will lock resources that are needed by other transactions. This blocking will cause other concurrent transactions to wait for the resources (or fail depending on the design).
Sql Server is usually setup in Auto Commit mode. This means that every sql statement is a distinct transaction. Many times you want to use a multi-statement transaction so you can commit or rollback multiple updates. The longer the updates take, the more likely other transactions will conflict.
I'm not sure I 100% understand what the database does. If I just have some misconception, please point it out.
Let's say I have a function that wants to create 100 new entry in the database with has 100,000 entries.
It seems a lot faster when those 100 entries get create and the commit is made after the last entry is created.
Now, if those 100 entries get created by different users, is there a easy way to commit only after 100 entries are created?
Edit:
Should I maybe write some sort of buffer?
Databases are optimized for set-based operations, so yes it wouldbe faster to insert 100 records in a set than one at a time. However, when you are talking about users entering records one ata atime, you would not want to group them together under any circumstances that I can think of. Why?
First, if there was one bad record, the others would fail. This would make for 99 cranky users out of 100 (actually 100, but one would not really have reason to be cranky becasue he did the bad data entry to begin with).
Second, users would not see the records immediately after being entered. It is also true that they would not be able to do something further with those records until they are entered such as enter data into related tables. Having a delay like this would make users cranky. If users are entering data from customers through a phone call, they will be especially cranky at the wait (I worked at a call center with a horribly slow commercial product and believe me I know how upset the users used to get!)
Third, users will have gone on to something else and would not realize that their data was rejected for bad information, not a good thing at all.
How long are you going to wait to get your set number of records? 5 seconds, ten minutes?
What happens if for some reason the netwrok connection is lost during that time, wouldn;t the users lose the data they entered.
You might be able to hack something like that together, but you really shouldn't, because it wrecks your data integrity, which is the whole point of using transactions.
In your proposed solution, a problem with any insert in the batch would cause all the other (possibly totally valid) inserts from completely different users to fail. Also, users wouldn't be able to see the data they just tried to insert because the system was waiting to do the insert until the batch was full.
P.S. Here's a quick intro to transaction processing.
I think you do have a misconception. It sounds like you're looking at the database as something that is only for some sort of "long-term" memory. This is a bad concept; the database is the only memory your application has. Even when this isn't true, it's best to pretend that it is.
To go a little deeper, your application has:
scoped memory: variables that you define within view functions, for example. These all get destroyed when flow leaves the function.
globals: variables that are defined in the outermost part of your code. It is really important not to use these for any sort of state except perhaps configuration constants. The important thing is that you should rely on any dynamic behavior. Otherwise you will have to battle concurrency and forked processes (depending on server gateway) that aren't aware of each other. Just don't do it.
a caching scheme, if you choose to implement one. This is entirely optional in django, and there are many ways to do it. However, one typically uses some scheme to ensure that even if the cache crashes, the database reflects the current state of the data accurately.
your local filesystem. From a design point of view, most ways of taking advantage of this will either resemble a caching system (above) or be clumsy and fragile. From a performance point of view, it might be about as slow as a database.
your database.
So you see that there's not much place for you to put your data besides the database.