I'm currently working on a project that involves a third party database and application. So far we are able to successfully TEST and interface data between our databases. However we are having trouble when we are extracting a large set of data (ex 100000 rows and 10 columns per row) and suddenly it stopped at the middle of transaction for whatever reason(ex blackouts, force exit or etc..), missing or duplication of data is happening in this type of scenario.
Can you please give us a suggestions to handle these types of scenarios? Thank you!
Here's our current interface structure
OurDB -> Interface DB -> 3rdParty DB
OurDB: we are extracting records from OurDB (with bit column as false) to the InterfaceDb
InterfaceDB: after inserting records from OurDB, we will update OurDB bit column as true
3rdPartyDB: they will extract and delete all records from InterfaceDB (they assume that all records is for extraction)
Well, you defintitely need a ETL tool then and preferably SSIS. First it will drastically improve your transfer rates while also providing robust error handling. Additionally you will have to use lookup transforms to ensure duplicates do not enter the sytsem. I would suggest go for Cache Connection Manager in order to perform the look-ups.
In terms of design, if your source system (OurDB) is having a primary key say recId, then have a column say source_rec_id in your InterfaceDB table. Say your first run has transferred 100 rows. Now in your second run, you would then need to pick 100+1th record and move on to the next rows. This way you will have a tracking mechanism and one-to-one correlation between source system and destination system to understand how many records have got transferred, how many are left etc.
For best understanding of SSIS go to Channel 9 - msdn - SSIS. Very helpful resource.
Related
I need to perform a task in which we have a table who has 19 columns with text data type. I want to delete these columns from this source table and move those columns to a new table with data type as varchar(max). The source table has currently 30k rows (with text data type data). This will increase eventually as client will use the database for record storage. For transferring this old data i tried to use "insert into..select.." query but it is taking around 25-30 mins to transfer these much rows(30k). Same is the case with "Select from..insert.." query. I have also tried creating data flow task of SSIS for transferring with OLE DB as source and destination as well. But still it's taking same amount of time. I'm really confused as all posts over internet suggests that SSIS is fastest way for data transfer. Can you please suggests me better way to improve performance of data transfer using any technique?
Thanks
SSIS probably isn't faster if the source and the destination are in the same database and the SSIS process is on the same box.
One approach might be to figure out where you are spending the time and optimise that. If you set Management Studio to "discard results after execution" and run just the select part of your query, how long does that take? If this is a substantial part of the 25-30 minutes then work on optimising that.
If the select statement turns out to be really fast, then all the time is being spent on the insert and you need to look at improving that part of the process instead. There are a couple of things you can try here before you go hardware shopping; are there any indexes or constraints (or triggers!) on the target table that you can drop for the duration of the insert and put back again at the end? Can you put the database in simple mode?
We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead
I'm a newbie in SQL Server and have the following dilemma:
I have two tables with the same structure. Call it runningTbl and finalTbl.
runningTbl contains about 600 000 to 1 million rows every 15 minutes.
After doing some data cleanup in runningTbl I want to move all the records to finalTbl.
finalTbl currently has about 38 million rows.
The above process needs to be repeated every 15-20 minutes.
The problem is that the moving of data from runningTbl to finalTbl is taking way longer than 20 minutes at times..
Initially when the tables were small it took anything from 10 seconds to 2 minutes to copy.
Now it just takes too long.
Any one that can assist with this? SQL query to follow..
Thanks
There are a number of things that you will need to do in order to get the most efficient method of copying the data. So far you are on the right track but you have a long way to go. I would suggest you first look at your indexes. There may be optimizations there that can help. Next, make sure you don't have triggers on this table that could cause a slowdown. Next, change the logging level (if that is permutable).
There is a bunch more help here (from Microsoft):
http://msdn.microsoft.com/en-us/library/ms190421(v=SQL.90).aspx
Basically you are on the right track using BCP. This is actually Microsoft's recommendation:
To bulk-copy data from one instance of SQL Server to another, use bcp to export the table data into a data file. Then use one of the bulk import methods to import the data from the file to a table. Perform both the bulk export and bulk import operations using either native or Unicode native format.
When you do this though, you need to also consider the possibility of dropping your indexes if there is too much data being brought in (based upon the type of index you use). If you use a clustered index, it may also be a good idea to order your data before import. Here is more information (including the source of the above quote):
http://msdn.microsoft.com/en-US/library/ms177445(v=SQL.90).aspx
For starters : one of the things I've learned over the years is that MSSQL does a great job at optimizing all kinds of operations but to do so heavily relies on the statistics for all tables involved. Hence, I would suggest to run "UPDATE STATISTICS processed_logs" & "UPDATE STATISTICS unprocessed_logs" before running the actual inserts; even on a large table these things don't take all that long.
Apart from that, based on the query above, a lot depends on the indexes of the target table. I'm assuming the target table has its clustered index (or PRIMARY KEY) on (at least) UnixTime, if not you'll create major data-fragmentation when you squeeze more and more data in-between the already existing records. To work around this you could try defragmenting the target table once in a while (can be done online, but takes a long time), but making the clustered index (or PK) so that data is always appended to the end of the table would be the better approach; well, at least in my opinion.
I suggest that you should have a window service and use timer and a boolean variable. Once your request is sent to server set the bool to high bit and the timer event should not execute code until the bit is low.
Are there best practices out there for loading data into a database, to be used with a new installation of an application? For example, for application foo to run, it needs some basic data before it can even be started. I've used a couple options in the past:
TSQL for every row that needs to be preloaded:
IF NOT EXISTS (SELECT * FROM Master.Site WHERE Name = #SiteName)
INSERT INTO [Master].[Site] ([EnterpriseID], [Name], [LastModifiedTime], [LastModifiedUser])
VALUES (#EnterpriseId, #SiteName, GETDATE(), #LastModifiedUser)
Another option is a spreadsheet. Each tab represents a table, and data is entered into the spreadsheet as we realize we need it. Then, a program can read this spreadsheet and populate the DB.
There are complicating factors, including the relationships between tables. So, it's not as simple as loading tables by themselves. For example, if we create Security.Member rows, then we want to add those members to Security.Role, we need a way of maintaining that relationship.
Another factor is that not all databases will be missing this data. Some locations will already have most of the data, and others (that may be new locations around the world), will start from scratch.
Any ideas are appreciated.
If it's not a lot of data, the bare initialization of configuration data - we typically script it with any database creation/modification.
With scripts you have a lot of control, so you can insert only missing rows, remove rows which are known to be obsolete, not override certain columns which have been customized, etc.
If it's a lot of data, then you probably want to have an external file(s) - I would avoid a spreadsheet, and use a plain text file(s) instead (BULK INSERT). You could load this into a staging area and still use techniques like you might use in a script to ensure you don't clobber any special customization in the destination. And because it's under script control, you've got control of the order of operations to ensure referential integrity.
I'd recommend a combination of the 2 approaches indicated by Cade's answer.
Step 1. Load all the needed data into temp tables (on Sybase, for example, load data for table "db1..table1" into "temp..db1_table1"). In order to be able to handle large datasets, use bulk copy mechanism (whichever one your DB server supports) without writing to transaction log.
Step 2. Run a script which as a main step will iterate over each table to be loaded, if needed create indexes on newly created temp table, compare the data in temp table to main table, and insert/update/delete differences. Then as needed the script can do auxillary tasks like the security role setup you mentioned.
I've inherited an SSIS package which loads 500K rows (about 30 columns) into a staging table.
It's been cooking now for about 120 minutes and it's not done --- this suggests it's running at less than 70 rows per second. I know that everybody's environment is different but I think this is a couple orders of magnitude off from "typical".
Oddly enough the staging table has a PK constraint on an INT (identity) column -- and now I'm thinking that it may be hampering the load performance. There are no other constraints, indexes, or triggers on the staging table.
Any suggestions?
---- Additional information ------
The source is a tab delimited file which connects to two separate Data Flow Components that add some static data (the run date, and batch ID) to the stream, which then connects to an OLE DB Destination Adapter
Access mode is OpenRowset using FastLoad
FastLoadOptions are TABLOCK,CHECK_CONSTRAINTS
Maximum insert commit size: 0
I’m not sure about the etiquette of answering my own question -- so sorry in advance if this is better suited for a comment.
The issue was the datatype of the input columns from the text file: They were all declared as “text stream [DT_TEXT]” and when I changed that to “String [DT_STR]” 2 million rows loaded in 58 seconds which is now in the realm of “typical” -- I'm not sure what the Text file source is doing when columns are declared that way, but it's behind me now!
I'd say there is a problem of some sort, I bulk insert a staging table from a file with 20 million records and more columns and an identity field in far less time than that and SSIS is supposed to be faster than SQL Server 2000 bulk insert.
Have you checked for blocking issues?
If it is running in one big transaction, that may explain things. Make sure that a commit is done every now and then.
You may also want to check processor load, memory and IO to rule out resource issues.
This is hard to say.
I there was complex ETL, I would check the max number of threads allowed in the data flows, see if some things can run in parallel.
But it sounds like it's a simple transfer.
With 500,000 rows, batching is an option, but I wouldn't think it necessary for that few rows.
The PK identity should not be an issue. Do you have any complex constraints or persisted calculated columns on the destination?
Is this pulling or pushing over a slow network link? Is it pulling or pushing from a complex SP or view? What is the data source?