I'm trying to insert data into a database, but first I check if each row exists using a lookup, similar to the method suggested here:
How to prevent SSIS from importing data from a file that already exist in database?
SELECT DISTINCT VALUES // OleDb Source
|
LOOKUP // If exists
| // No Match Output
OLE DB DESTINATION // Insert new records
I'm using RetainSameConnection=True to enable transactions on my workflow. With a default buffer around 10,000 rows, as rows get passed to the OLE DB Destination, the destination INSERT will lock with the lookup SELECT.
I've tried SET READ_COMMITTED_SNAPSHOT ON, which will work, but performance during the lookup is now incredibly slow, which I believe is due to the RetainSameConnection property, and I can't tell that SSIS is even using the READ COMMITTED SNAPSHOT isolation level. I thought about ignoring failures on the destination, but I read it will cause bulk inserts to fail completely instead of by row. I've also considered using NOLOCK on all the reads, but it would turn all my lookups into SQL queries.
The source DB may read millions of rows. Is there a better way to accomplish this?
This question is getting into the area where the answers are going to be based more on preference and experience than there just being one easily defined way of doing this. There will likely be a good number of answers which will show different methodologies that might work for you, but the basics of any of these are going be based around a couple of things
Reducing the number of rows that are saved in your lookup during pre-caching of the data flow task
Generally speaking, the values for all lookups are populated before the data flow begins to execute and read data from the source based on the configuration of the lookup. If you have your SSIS configured this way, then there should not be any contention between the lookup and the insertion of rows into your table regardless of isolation level.
Based on what you said above, I'm thinking that perhaps the performance of the lookup is not changing drastically because of your configurations but more because the amount of data being cached into the lookup is increasing with each execution.
Changing the Lookup pattern to use a different pattern
The most basic implementation of the lookup is usually fine for most cases. However, if performance is a heavy concern for package execution then there may be other more ideal methods to accomplishing the same objective. One of these which I talk about in my blog refers to using a merge join as an alternative to the standard lookup. This may not be ideal for your situation as I designed this particular strategy to cover a corner case involving large data sets, but hopefully it should give you some ideas.
I have a fairly detailed walkthrough of that pattern at the link below
https://web.archive.org/web/20140819083150/http://bigdatabigdave.info/archive/2013/02/15/alternate-ssis-lookup-pattern-merge-join/
One place I remember they wrote all the insert lines to a raw file. Then there was a second dataflow (after the first dataflow finished) that inserted the raw file records into the destination table.
Related
Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows
I'm pretty proficient with VBA, but I know almost nothing about Access! I'm running a complex simulation using Arrrays in VBA, and I want to store the results somewhere. Since the results of the simulation will be quite large (~1GB in memory), I'd like to store this in Access rather than Excel.
I currently have a large number of Arrays populated with my data, but I'm not sure how to write these to a database, or even how to create one with VBA. Here's what I need to do, in a nutshell, with VBA:
Create a new Access Database
Create a new Access Table (the db will be only a single table)
Create ~1200 fields programmatically
Copy the results from my arrays to the new Access table.
I've looked at a number of answers on here, but none of them seem to answer my question fully. For instance, Adding field to MS Access Table using VBA talks about adding fields to a database. But I don't see doubles listed here. Most of my arrays are doubles. Will this be a problem?
EDIT:
Here are a few more details about the project:
I am running a network design simulation. Thus, I start by generating ~150,000 unique networks. Then, I run a lot of calculations (no, these can't be simplified to queries unfortunately!) of characteristics for the network. There end up being ~1200 for each possible network (unique record). Thus, I would like to store these in an Access database. Each record will be a unique network, and each field will be a specific characteristic associated with that network.
Virtually all of the fields (arrays at this point!) are doubles.
You (almost?) never want a database with one table. You might as well store it in a text file. One of the main benefits of databases is relating data in different tables, and with one table you don't need it.
Fortunately for you, you need more than one table and a database might be the way to go. You (almost) never need to create permanent tables in code (temp tables, sure, but not permanent ones). If your field names are variable, you need to change your design. When data is variable, it goes in the data part of a database. When it's fixed, it can be a table or a field. Based on what you've said, I think you need this:
In Access create a tabled called tblNetworks with these fields
NetworkID AutoNumber
NetworkName Short Text
Then create another tabled called tblCalculations with these fields
CalcID Autonumber
NetworkID Long (Relates to tblNetworks, one to many)
CalcDesc Short Text
Result Number (Double)
What you were going to name your fields in your Access table will be the CalcDesc data. You'll use ADODB to execute INSERT INTO sql statements that put the data in the tables.
You'll end with tblNetworks with 150k records and tblCalculations with 1,200 x 150k records or so. When you're tables grow longer and not wider as things change, that a good indication you designed it right.
If you're really unfamiliar with Access, I recommend learning how to create Tables, setting up Relationships, and Referential Integrity. If you don't know SQL, search for INSERT INTO. And if you haven't used ADO before in Excel, search for ADODB Connections and the Execute method.
Update
You can definitely get away with a CSV for this. Like you said, it's pretty low overhead. Whether a text file or a database is the right answer probably depends more on how you're going to use the data and how often.
If you're going to pull this into Excel a small number of times, do a few sorts or filters, maybe a pivot table, then any performance hit you get from a CSV isn't going to be that bad. And if you only need to deal with a subset of the data at a time, you can use ADO to read a text file and only pull in the data you want at that time, further mitigating the slowness of sorting and filtering 150k rows. Not to mention if you have a few gigs of RAM, 150k x 1,200 probably won't be bad at all.
If you find that the performance of a CSV stinks because your hardware isn't up to the task, you have to access it often, or you doing a ton of different queries against the data, it may be to your benefit to use the database. If you fields are structured as you say, you may benefit from even more tables. You'd still have the network table and the calc table, but you'd also have Market, Slot, and Characteristic tables. Then your Calc table would look like:
CalcID
CalcDesc
NetworkID
MarketID
SlotID
CharacteristicID
Result
If you looking for data a lot of times and you need it quickly, you're not going to do better than a bunch of INNER JOINs on those tables and a WHERE clause that limits what you want.
But only you can decide if it's worth all the setup and overhead of using a database. And because of that, I would start down the CSV path until the reason to change presented itself. I would design my code in a way that switching from CSV to database only touched a few procedure (like by using class modules) so that the change didn't affect any already-tested business logic.
I am parsing one flat file that results in a hierarchy of related records 4 levels deep.
I'd like to calculate the next Identity value from each table (using IDENT_CURRENT and IDENT_INCR functions) and then parse the files in memory, assigning and incrementing IDs as I process the file. Lastly I'd just BCP (or other task if I decide to do this in SSIS) the file(s) in, starting with the top of the hierarchy of course.
This will be done during off-hours, and I would be able to lock the tables to assure no inserts could be performed in the meantime
Aside from a lengthy transaction, I don't see any issues with this approach... It does seem a bit too easy though - am I missing something?
I don't think so. SSIS is quick with these sorts of transformations, too, should you choose to use that as your tool.
Jumping from article to article, I can see everywhere the expression "bulk loading".
What does it really (technically) mean?
What does it imply?
Explanation based on use-cases is welcome.
Indexes are usually optimized for inserting rows one at a time. When you are adding a great deal of data at once, inserting rows one at a time may be inefficient. For instance, with a B-Tree, the optimal way to insert a single key is very poor way of adding a bunch of data to an empty index.
Instead you pursue a different strategy with B-Trees. You presort all of the data, and group it in blocks. You can then build a new B-Tree by transforming the blocks into tree nodes. Although both techniques have the same asymptotic performance, O(n log(n)), the bulk-load operation has much smaller factor.
Bulk loading is a way to load data (typically into a database) in 'large chunks'. Where you might enter a customer or a purchase order or information about items in inventory one at a time into your system, bulk loading takes a file of this same sort of information and loads hundreds/thousands/millions of records in a short period of time.
If you convert from one kind of DBMS to another, you would hope not to enter all the information into the new DB from the old DB. Instead, you would dump the information from the old DB to a file in a format that can be easily read by the new DB and then import that data into the new DB.
That's what bulk loading entails (at the 35K foot level, anyway)
Bulk loading is used to import/export large amounts of data. Usually bulk operations are not logged and transactional integrity might not work as expected. Often bulk operations bypass triggers and integrity checks like constraints. This improves performance, for large amounts of data, quite significantly.
One thing to remember is that bulk loading implies that the data content from the source to target is the same, but this is only true if the source system is acquiesced. For any data source, and especially true of large data, the source data can change after it has been read and the data transfer is happening. Traditionally online systems either have to go off line or suspend updates if an exact point it time capture that matches the source is required.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.