I have a large table, 1B+ records that I need to pull down and run an algorithm on every record. How can I use ADO.NET to exec a "select * from table" asynchronously and start reading the rows one by one while ado.net is receiving the data?
I also need to dispose of the records after I read them to save on memory. So I am looking of a way to pull a table down record by record and basically shove the record into a queue for processing.
My datasources are oracle and mssql. I have to do this for several datasources.
You should use SSIS for this.
You need a bit of background detail on how the ADO.Net data providers work to understand what you can do and what you can't do. Lets take the SqlClient provider for example. It is true that it is possible to execute queries asynchronously with BeginExecuteReader but this asynchronous execution is only until the query start returning results. At the wire level the SQL text is sent to the server, the server start churning the query execution and eventually will start pushing result rows back to the client. As soon as the first packet comes back to the client, the asynchronous execution is done and the completion callback is executed. After that the client uses the SqlDataReader.Read() method to advance the result set. There are no asynchronous methods in the SqlDataReader. This pattern work wonders for complex queries that return few results after some serious processing is done. While the server is busy producing the result, the client is idle with no threads blocked. However things are completely different for simple queries that produce large result sets (as seem to be the case for you): the server will immedeatly produce resutls and will continue to push them back to the client. The asynchronous callback will be almost instantenous and the bulk of the time will be spent by the client iterating over the SqlDataReader.
You say you're thinking of placing the records into an in memory queue first. What is the purpose of the queue? If your algorithm processing is slower than the throughput of the DataReader result set iteration then this queue will start to build up. It will consume live memory and eventualy will exhaust the memory on the client. To prevent this you would have to build in a flow control mechanism, ie. if the queue size is bigger than N don't put any more records into it. But to achieve this you would have to suspend the data reader iteration and if you do this you push flow control to the server which will suspend the query until the communication pipe is available again (until you start reading from the reader). Ultimately the flow control has to be proagated all the way to the server, which is always the case in any producer-consumer relation, the producer has to stop otherwise intermediate queues fill up. Your in-memory queue serves no purpose at all, other than complicating things. You can simply process items from the reader one by one and if your rate of processing is too slow, the data reader will cause flow control to be applied on the query running on the server. This happens automatically simply because you don't call the DataReader.Read method.
To summarise up, for a large set processing you cannot do asynchronous processing and there is no need for a queue.
Now the difficult part.
Is your processing doing any sort of update back in the database? If yes, then you have much bigger problems:
You cannot use the same connection to write back the result, because it is busy with the data reader. SqlClient for SQL Server supports MARS but that only solves the problem with SQL 2005/2008.
If you're going to enroll the read and update in a transaction if your updates occur on a different connection (see above), then this means using distributed transactions (even when the two conencitons involved point back to the same server). Distributed transactions are slow.
You will need to split the processing into several batches because is very bad to process 1B+ records in a single transaction. This means also that you are going to have to be able to resume processing of an aborted batch, which means you must be able to identify records that were already processed (unless processing is idempotent).
A combination of a DataReader and an iterator block (a.k.a. generator) should be a good fit for this problem. The default DataReaders provided by Microsoft pull data one record at a time from a datasource.
Here's an example in C#:
static IEnumerable<User> RetrieveUsers(DbDataReader reader)
{
while (reader.NextResult())
{
User user = new User
{
Name = reader.GetString(0),
Surname = reader.GetString(1)
};
yield return user;
}
}
A good approach to this would be to pull back the data in blocks, iterate through adding to your queue then calling again. This is going to be better than hitting the DB for each row. If you are pulling them back via a numeric PK then this will be easy, if you need to order by something you can use ROW_NUMBER() to do this.
Just use the DbDataReader (just like Richard Nienaber said). It is a forward-only way of scrolling through the retrieved data. You don't have to dispose of your data because a DbDataReader is forward only.
When you use the DbDataReader it seems that the records are retrieved one by one from the database.
It is however slightly more complicated:
Oracle (and probably MySQL) will fetch a few 100 rows at a time to decrease the number of round trips to the database. You can configure the fetch size of DataReader. Most of the time it will not matter whether you fetch 100 rows or 1000 rows per round trip. However, a very low value like 1 or 2 rows slows things down because with a low value retrieving the data will require too many round trips.
You probably don't have to set the fetch size manually, the default will be just fine.
edit1: See here for an Oracle example: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html
Related
Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows
I am trying to run a lot of update statements from code, and we have a requirement to summarize what changed for every operation for an audit log.
The update basically persists an entire graph consisting of dozens of tables to SQL Server. Right now, before we begin, we collect the data from all the tables, assemble the graph(s) as a "before" picture, apply the updates, then re-collect the data from all the tables, re-assemble the graph(s) for the "after", serialize the before and after graph(s) to JSON, then create a message to an ESB queue for an off-process consumer to crunch through the graphs, identify the deltas, and update the audit log. All the sql operations occur in a single transaction.
Needless to say, this is an expensive and time-consuming process.
I've been playing with the OUTPUT directive in T-SQL, I like the idea of getting the results of the operation in the same command as the update, but it seems to have some limitations. For example, ideally, it'd be great if I could get the INSERTED and DELETED result sets back at the same time, but there doesn't seem to be a concept of UNION between the two tablesets, so that gets unwieldy very quickly. Also, because the updates don't actually modify every column, I can't take the changes I made and compare them to the DELETED, since we'd show deltas for columns we didn't change.
...but maybe I'm missing some syntax with the OUTPUT command, or I'm not using it correctly, so I figured I'd ask the SO community.
What is the most efficient way to collect the deltas of an update operation in SQL Server? The goal is to minimize the calls to SQL Server, and collect the minimum necessary amount of information for writing an accurate audit log, without writing a bunch of custom code for every single operation.
While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.
Our DBA requires us to return all tabular data from stored procedures in a set of associative arrays rather than using a ref cursor which is what I see in most examples on the web. He says this is because it is much faster for Oracle to do things this way, but it seems counter intuitive to me because the data needs to be looped over twice, once in the stored procedure and then again in the application when it is processes. Also, values often need to be casted from their native types to varchar so they can be stored in the array and then casted back on the application side. Using this method also makes it difficult to use orm tools because they seem to want ref cursors in most cases.
An example of a stored procedure is the following:
PROCEDURE sample_procedure (
p_One OUT varchar_array_type,
p_Two OUT varchar_array_type,
p_Three OUT varchar_array_type,
p_Four OUT varchar_array_type
)
IS
p_title_procedure_name VARCHAR2(100) := 'sample_procedure';
v_start_time DATE :=SYSDATE;
CURSOR cur
IS
SELECT e.one, e.two, e.three, e.four FROM package.table
WHERE filter='something';
v_counter PLS_INTEGER := 0;
BEGIN
FOR rec IN cur LOOP
BEGIN
v_counter := v_counter + 1;
p_One(v_counter) := rec.one;
p_Two(v_counter) := rec.two;
p_Three(v_counter) := rec.three;
p_Four(v_counter) := rec.four;
END;
END LOOP;
END;
The cursor is used to populate one array for each column returned. I have tried to find information supporting his claim that this is method faster but have been unable to do so. Can anyone fill me in on why he might want us (the .net developers) to write stored procedures in this way?
The DBA's request does not make sense.
What the DBA is almost certainly thinking is that he wants to minimize the number of SQL to PL/SQL engine context shifts that go on when you're fetching data from a cursor. But the solution that is being suggested is poorly targetted at this particular problem and introduces other much more serious performance problems in most systems.
In Oracle, a SQL to PL/SQL context shift occurs when the PL/SQL VM asks the SQL VM for more data, the SQL VM responds by executing the statement further to get the data which it then packages up and hands back to the PL/SQL VM. If the PL/SQL engine is asking for rows one at a time and you're fetching a lot of rows, it is possible that these context shifts can be a significant fraction of your overall runtime. To combat that problem, Oracle introduced the concept of bulk operations back at least in the 8i days. This allowed the PL/SQL VM to request multiple rows at a time from the SQL VM. If the PL/SQL VM requests 100 rows at a time, you've eliminated 99% of the context shifts and your code potentially runs much faster.
Once bulk operations were introduced, there was a lot of code that could be refactored in order to be more efficient by explicitly using BULK COLLECT operations rather than fetching row-by-row and then using FORALL loops to process the data in those collections. By the 10.2 days, however, Oracle had integrated bulk operations into implicit FOR loops so an implicit FOR loop now automatically bulk collects in batches of 100 rather than fetching row-by-row.
In your case, however, since you're returning the data to a client application, the use of bulk operations is much less significant. Any decent client-side API is going to have functionality that lets the client specify how many rows need to be fetched from the cursor in each network round-trip and those fetch requests are going to go directly to the SQL VM, not through the PL/SQL VM, so there are no SQL to PL/SQL context shifts to worry about. Your application has to worry about fetching an appropriate number of rows in each round-trip-- enough that the application doesn't become too chatty and bottleneck on the network but not so many that you have to wait too long for the results to be returned or to store too much data in memory.
Returning PL/SQL collections rather than a REF CURSOR to a client application isn't going to reduce the number of context shifts that take place. But it is going to have a bunch of other downsides not the least of which is memory usage. A PL/SQL collection has to be stored entirely in the process global area (PGA) (assuming dedicated server connections) on the database server. This is a chunk of memory that has to be allocated from the server's RAM. That means that the server is going to have to allocate memory in which to fetch every last row that every client requests. That, in turn, is going to dramatically limit the scalability of your application and, depending on the database configuration, may steal RAM away from other parts of the Oracle database that would be very useful in improving application performance. And if you run out of PGA space, your sessions will start to get memory related errors. Even in purely PL/SQL based applications, you would never want to fetch all the data into collections, you'd always want to fetch it in smaller batches, in order to minimize the amount of PGA you're using.
In addition, fetching all the data into memory is going to make the application feel much slower. Almost any framework is going to allow you to fetch data as you need it so, for example, if you have a report that you are displaying in pages of 25 rows each, your application would only need to fetch the first 25 rows before painting the first screen. And it would never have to fetch the next 25 rows unless the user happened to request the next page of results. If you're fetching the data into arrays like your DBA proposes, however, you're going to have to fetch all the rows before your application can start displaying the first row even if the user never wants to see more than the first handful of rows. That's going to mean a lot more I/O on the database server to fetch all the rows, more PGA on the server, more RAM on the application server to buffer the result, and longer waits for the network.
I believe that Oracle will begin sending results from a system like this as it scans the database, rather than retrieving them all and then sending them back. This means that results are sent as they are found, speeding the system up. (Actually, if I remember correctly it returns results in batches to the loop.) This is mostly from memory from some training
HOWEVER, the real question, is why not ask him his reasoning directly. He may be referring to a trick Oracle can utilize, and if you understand the specifics you can utilize the speed trick to it's full potential. Generally, ultimate of "Always do this, as this is faster" as suspicious and deserve a closer look to fully understand their intentions. There may be situations where this is really not applicable (small query results for example), where all the readability issues and overhead are not helping performance.
That said, it may be done to keep the code consistent and more quickly recognizable. Communication on his reasoning is the most important tool with concerns like this, as chances are good that he knows a trade secret that he's not full articulating.
I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.
Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
cheers
Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.
I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).
Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.