I have a pretty large file 10GB in Size need to load the records into the DB,
I want to have two additional columns
LoadId which is a constant (this indicates the files unique Load number)
ChunkNumber which would indicate the Chunk of the batch size.
So if I have a batch size of 10,000 records I want
LoadId = {GUID}
ChunkNumber = 1
for the next 10,000 records i want
LoadId = {GUID}
ChunkNumber = 2
Is this possible in SSIS? I suppose I can write a custom component for this but there should be a inbuilt ID if i could use as SSIS is already running stuff in batches of size 10,000
Can some one help me to figure out this parameter if it exists and can it be used?
Ok little bit more detail on the background of what and why.
We we get the data into a Slice of 10,000 records then we can start calling the Stored Procedures to enrich the data in chunks, all i am trying to do is can the SSIS help here by putting a Chunk number and a Guid
this helps the stored proc to move the data in chunks, although i could do this after the fact with a row number, Select has to travel through the whole set again and update the chunk numbers. its a double effort.
A GUID will represent the the complete dataset and individual chunks are related to it.
Some more insight. There is a WorkingTable we import this large file into and if we start enriching all the data at once the Transaction log would be used up, it is more manageable if we can get the data into chunks, so that Transaction log would not blow up and also we can parallel the enrichment process.
The data moves from De-normalized format normalized format from here. SP is more maintainable in therms of release and management of day today, so any help is appreciated.
or is there an other better way of dealing with this?
For the LoadID, you could use the SSIS variable
System::ExecutionInstanceGUID
which is generated by SSIS when the package runs.
Related
I am encountering this weird problem and any help will be really appreciated.
I have a single container in which I have 2 data flow task in my SSIS package, data transfer is very huge. Breakdown of problem.
First Container is transferring from oracle to SQL around 130 million rows and it ran just fine and transfer the rows in about 40 to 60 mins which is very much acceptable.
Now come the second part another data flow task is there that is transferring around 86 million rows from SQL server to SQL server(one table) only, the data transfer flies very fast till 60 70 million and after that it just dies out or crawls just like anything for next 10 million rows it took 15 hours, I am not able to understand why is it happening so?
Table get truncated and then it gets loaded, I have tried increasing DataBuffer proeprties etc but with no avail.
Thanks in advance for any help.
You are creating a single transaction and the transaction log is filling up. You can get 10-100x faster speeds if you move 10000 rows at a time. You may also try setting Maximum Insert Commit Size to 0 or try 5000 and go up to see the impact on performance. This is on the OLE DB Destination component. In my experience 10000 rows is the current magic number that seems to be the sweet spot but of course it is very dependent on how large the rows are, version of SQL Server and the hardware setup.
You should also look if there are indexes on the target table you can try dropping the indexes, loading the table and recreating the indexes.
What is your destination recovery model? Full/Simple, etc...
Are there any transformations between the source and destination? Try sending the source to a RowCount to determine the maximum speed your source can send data. You may be seeing a slowdown on the source side as well.
Is there any difference in content of the rows once you notice the slow down? For example, maybe the more recent rows have lots of text in a varchar(max) column that the early rows did not make use of.
Is your destination running on a VM? If yes, have you pre-allocated the CPU and RAM? SSIS is multi-threaded, but it won't necessarily use 100% of each core. VM hosts may share the resources with other VMs because the SSIS VM is not reporting full usage of all of the resources.
I'm currently working on a project that involves a third party database and application. So far we are able to successfully TEST and interface data between our databases. However we are having trouble when we are extracting a large set of data (ex 100000 rows and 10 columns per row) and suddenly it stopped at the middle of transaction for whatever reason(ex blackouts, force exit or etc..), missing or duplication of data is happening in this type of scenario.
Can you please give us a suggestions to handle these types of scenarios? Thank you!
Here's our current interface structure
OurDB -> Interface DB -> 3rdParty DB
OurDB: we are extracting records from OurDB (with bit column as false) to the InterfaceDb
InterfaceDB: after inserting records from OurDB, we will update OurDB bit column as true
3rdPartyDB: they will extract and delete all records from InterfaceDB (they assume that all records is for extraction)
Well, you defintitely need a ETL tool then and preferably SSIS. First it will drastically improve your transfer rates while also providing robust error handling. Additionally you will have to use lookup transforms to ensure duplicates do not enter the sytsem. I would suggest go for Cache Connection Manager in order to perform the look-ups.
In terms of design, if your source system (OurDB) is having a primary key say recId, then have a column say source_rec_id in your InterfaceDB table. Say your first run has transferred 100 rows. Now in your second run, you would then need to pick 100+1th record and move on to the next rows. This way you will have a tracking mechanism and one-to-one correlation between source system and destination system to understand how many records have got transferred, how many are left etc.
For best understanding of SSIS go to Channel 9 - msdn - SSIS. Very helpful resource.
Edited on Oct 21th
Background
Raw datasets (changed everday) stored on a MS-SQL based server: Sourcelib.RAW and a LOCAL excel file (remained unchanged).
I need to refresh a dataset WANTlocates in Targerlib. Currently I have SQL codes perform such task in 2-3 minutes. But I want to know if SAS can do the same thing while the processing time won't increase much.
work.IMP is around 6M records and around 50 bytes per record.
The ideal method should be very efficient because as time goes by, the size of raw datasets on the server would be incrediblely huge.
The target file CANNOT be establied at one time and then append new data to it everyday. Because there maybe (even very unlikely) changes in previous data.
As per #Joe, I should allow the target file to just be updated, using proc compare, or update in data step. Here's a related question I posted How to use proc compare to update dataset
Still more than 10GB free space on the sever, which is enough. Avaiable memory in my pc is around 3.5GB (not sure if it matters)
Due to the architecture of server, it's very efficient to do it in MS-SQL. But I REALLY want to know if SAS can deal with this (when the server is not so "compatible")
Process
First I import the data from excel file and then subset&tranpose it to be work.IMP. For some reasons, this file can only be created in this way everday. It CANNOT be store in the server.
Then perform outer join for work.IMP and one raw dataset Sourcelib.RAW1 to get work.HAVE1. Please note that work.IMP is sorted but Sourcelib.RAW1 is unsorted. The outer join is only to (with some criteria) determine each data records.
i.e. case when a.COL1 is '' then b.COL1 else a.COL1 end as COL1
You can consider this process is to adjust Sourcelib.RAW1 by using work.IMP.
PS1: #sparc_spread suggests to do the import procedure directly to the server. But it won't have any benefit than do it in LOCAL. And a hash object here doesn't help either.
Then I subset another raw datasets Sourcelib.RAW2 to work.temp, and then sort it to be work.HAVE2. (The data in Sourcelib.RAW2 is mostly not in order.)
I concatenate work.HAVE1, work.HAVE2 by using proc append (because both two tables are huge) to be work.HAVE
PS2: The sorting in step3 is to avoid sorting at the end of step4. Actually the data Targerlib.WANT doesn't have to be in order. But it's better to be so.
At the very end, I copy work.HAVE to server Targetlib.HAVE.
I did most of the thing in WORK, which only took me few minutes. But step5 could take me half an hour to finish the copy.
As per #Joe, this may mainly due to something related to network transit. I.E minmize the network transit
Question
Any way to improve step5? Or any modification of the whole process will improve the performance?
Couple of thoughts.
First off, assuming this is a SAS dataset and not a SQL database or something else, options compress=binary; is a good idea assuming this is mostly numeric (and options compress=character; if not). Either will reduce the physical size of the dataset significantly in most cases.
Second, 300MB is not very much in the scheme of things. My network would write that in less than a minute. What the conditions of your network are may drive some of the other choices you make; if the single slow point is simply copying data across it, for example, then you need to figure out how to reduce that at the cost of anything else you do.
Assuming you don't change anything else, I would recommend writing have1 directly to the network as have, and then append have2 to it. IE, whatever step creates have1, have that directly write to the network. This includes sort steps, note: so if you create it then sort it, create it locally and sort it with out= the network library. This reduces the total amount of writing done (as you don't write a useless copy of have1 to your local drive). This helps if writing locally is a relevant cost to your total process, but won't help if it's almost entirely network congestion.
Copying files with the OS's copy is almost always superior to any other method of copying, so if network congestion is the only factor you care about, you may want to make it locally (in WORK or in a local but constant directory, like C:\temp\ or similar) and then have the last step of your process be executing copy c:\temp\have.sas7bdat \\networklocation\whereitgoes\. This will usually outperform SAS methods for same, as it can take advantage of efficient techniques.
PROC COPY is another option to get around network congestion; it's probably faster than PROC APPEND (if the local write-out is negligible, as it would be for me for <1 GB data), and has the advantage that it's a bit safer in case something happens during the transport. (Append should be fine also, but with COPY you know for sure nothing was changed from yesterday's file.)
Finally, you may want to figure out some way to allow the target file to just be updated. This isn't all that hard to do in most cases. One example would be to keep a copy of yesterday's file, do a PROC COMPARE to today's file, and then include in the update file every record that is changed (regardless of what the change is). Then delete any matching old records from the target file and append the new records. This is very fast comparatively in terms of total records sent over the network, so saves a lot of time overall if network congestion is the main issue (but takes more CPU time to do the PROC COMPARE).
You can use PROC APPEND to efficiently create a new dataset, not just append to an existing - thus you can use that to basically combines steps 3 and 4 into this:
/* To fulfill requirement 4, delete existing Targetlib.HAVE */
PROC DELETE LIBRARY="Targetlib" DATA="HAVE";
RUN;
/* Targetlib.HAVE does not yet exist, the first APPEND will create it */
PROC APPEND BASE="Targetlib.HAVE" DATA="work.HAVE1";
RUN;
PROC APPEND BASE="Targetlib.HAVE" DATA="work.HAVE2";
RUN;
That should save at least some time, but that still doesn't solve all your problems... I have some additional questions I have put in comments to the question and will alter this answer as much as I can based on them.
Update 1
Here is a way to do the left join and the concatenation in one step and write the results immediately to targetlib. I cannot guarantee this will be faster but it is worth a try. I used key and val as field names, replace as you see fit.
PROC SQL _METHOD;
CREATE TABLE targetlib.HAVE
AS
SELECT
a.key ,
CASE WHEN MISSING (a.val) THEN b.val ELSE a.val END AS val
FROM
Sourcelib.RAW1 AS a
LEFT JOIN
IMP AS b
ON
a.key = b.key
UNION
SELECT
c.*
FROM
Sourcelib.RAW2 AS c
ORDER BY
key
;
QUIT;
RUN;
The _METHOD is a sparsely documented SAS feature that will print the query plan, see this link. This may give you more insight. Also, I am assuming that IMP was already imported from the Excel, and that it is in WORK. Experiment to see whether importing it to targetlib and replacing IMP as b with targetlib.IMP as b is any faster here.
Since you are on Windows, experiment with the data option SGIO=YES after dataset names: e.g. Sourcelib.RAW1 AS a becomes Sourcelib.RAW1 (SGIO=YES) AS a. For more info on Windows SGIO and SAS, see this link and this older but more comprehensive one.
An approach that might be more efficient would be to avoid the join and use a hash object instead: good documentation for hash object can be found here and a good tip sheet is here. It's not clear whether this would be faster - imp has 6m records, but at 50 bytes per record, that's about 300 MB, which does fit into your RAM. The performance of a hash table with that many entries would depend a lot on SAS's hash algorithm. Anyway, here is the code using a hash object. In it, we assume that in the IMP dataset, the val field has been renamed to val2.
DATA targetlib.HAVE (DROP = rc val2);
LENGTH val2 8. ;
IF (_N_ = 1) THEN DO;
DECLARE HASH h (DATASET: "IMP") ;
h.DEFINEKEY ('key');
h.DEFINEDATA ('val2');
h.DEFINEDONE ();
END;
SET
sourcelib.RAW1
sourcelib.RAW2
;
IF MISSING (val) THEN DO;
rc = h.find();
IF (rc = 0) THEN DO;
val = val2;
END;
END;
RUN;
PROC SORT DATA = targetlib.HAVE ; BY KEY ; RUN ;
Try it and see if it is any faster. Once again, experiment with the location of IMP, using SGIO, etc. That PROC SORT at the end could be expensive; if the only reason you were sorting before was because of joining, then skip it.
In general, your approach to SAS should be to do as little I/O as possible, and find ways of combining multiple actions into a single write from a PROC or DATA step.
One of our sites has around 10,000 nodes. In each node, there is a simple cck text/integer field. This integer changes daily, so they need to be updated every day. The integer ranges from 1 to 20000000. The cck field is across all content types, so it has its own table in the database. We don't use revisions. I chose to have it read a csv file because this table is a very simple with 3 fields. All integers. I didn't need all the flexibility of doing a php array type import.
I created a cron job to execute a php script everyday which holds something similar to:
LOAD DATA LOCAL INFILE 'file.csv'
REPLACE INTO TABLE content_field_mycckfield
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(vid, nid, field_mycckfield_value);
At the end of the script, it counts how many records imported, reports success and errors.
The file is below public, and all the jazz.
Are there any other steps I am missing? Anything I should be aware of or be cautious of?
Should I have it optimize or defragment this table after every run? Or every (x) of runs?
Should I have it first imported into a temp_ table to normalize the data, then have it copied/moved into TABLE content_field_mycckfield?
10,000 records is big but not massive in MySQL terms and the table is simple enough that I don't think you need any optimisation. If the data in the table is reliable and your .csv is always well formed then there's not a lot to go wrong.
The separate issue is whether your import process is throwing errors. If there is even the remotest chance that the .csv could contain incorrect column references, lost commas etc then your idea to test everything in a temp table is certainly a good one.
The only other things I can thing of are (in order of neuroticism)
Perform this operation overnight or whenever your site is unused
Have the PHP script catch errors and email you the results of each run
Have the script backup the table, run the .csv, check for errors and if errors then email you and simultaneously restore the backup
Hope any of that helps!
I've inherited an SSIS package which loads 500K rows (about 30 columns) into a staging table.
It's been cooking now for about 120 minutes and it's not done --- this suggests it's running at less than 70 rows per second. I know that everybody's environment is different but I think this is a couple orders of magnitude off from "typical".
Oddly enough the staging table has a PK constraint on an INT (identity) column -- and now I'm thinking that it may be hampering the load performance. There are no other constraints, indexes, or triggers on the staging table.
Any suggestions?
---- Additional information ------
The source is a tab delimited file which connects to two separate Data Flow Components that add some static data (the run date, and batch ID) to the stream, which then connects to an OLE DB Destination Adapter
Access mode is OpenRowset using FastLoad
FastLoadOptions are TABLOCK,CHECK_CONSTRAINTS
Maximum insert commit size: 0
I’m not sure about the etiquette of answering my own question -- so sorry in advance if this is better suited for a comment.
The issue was the datatype of the input columns from the text file: They were all declared as “text stream [DT_TEXT]” and when I changed that to “String [DT_STR]” 2 million rows loaded in 58 seconds which is now in the realm of “typical” -- I'm not sure what the Text file source is doing when columns are declared that way, but it's behind me now!
I'd say there is a problem of some sort, I bulk insert a staging table from a file with 20 million records and more columns and an identity field in far less time than that and SSIS is supposed to be faster than SQL Server 2000 bulk insert.
Have you checked for blocking issues?
If it is running in one big transaction, that may explain things. Make sure that a commit is done every now and then.
You may also want to check processor load, memory and IO to rule out resource issues.
This is hard to say.
I there was complex ETL, I would check the max number of threads allowed in the data flows, see if some things can run in parallel.
But it sounds like it's a simple transfer.
With 500,000 rows, batching is an option, but I wouldn't think it necessary for that few rows.
The PK identity should not be an issue. Do you have any complex constraints or persisted calculated columns on the destination?
Is this pulling or pushing over a slow network link? Is it pulling or pushing from a complex SP or view? What is the data source?