The fastest way to copy a huge dataset to server in SAS - sql-server

Edited on Oct 21th
Background
Raw datasets (changed everday) stored on a MS-SQL based server: Sourcelib.RAW and a LOCAL excel file (remained unchanged).
I need to refresh a dataset WANTlocates in Targerlib. Currently I have SQL codes perform such task in 2-3 minutes. But I want to know if SAS can do the same thing while the processing time won't increase much.
work.IMP is around 6M records and around 50 bytes per record.
The ideal method should be very efficient because as time goes by, the size of raw datasets on the server would be incrediblely huge.
The target file CANNOT be establied at one time and then append new data to it everyday. Because there maybe (even very unlikely) changes in previous data.
As per #Joe, I should allow the target file to just be updated, using proc compare, or update in data step. Here's a related question I posted How to use proc compare to update dataset
Still more than 10GB free space on the sever, which is enough. Avaiable memory in my pc is around 3.5GB (not sure if it matters)
Due to the architecture of server, it's very efficient to do it in MS-SQL. But I REALLY want to know if SAS can deal with this (when the server is not so "compatible")
Process
First I import the data from excel file and then subset&tranpose it to be work.IMP. For some reasons, this file can only be created in this way everday. It CANNOT be store in the server.
Then perform outer join for work.IMP and one raw dataset Sourcelib.RAW1 to get work.HAVE1. Please note that work.IMP is sorted but Sourcelib.RAW1 is unsorted. The outer join is only to (with some criteria) determine each data records.
i.e. case when a.COL1 is '' then b.COL1 else a.COL1 end as COL1
You can consider this process is to adjust Sourcelib.RAW1 by using work.IMP.
PS1: #sparc_spread suggests to do the import procedure directly to the server. But it won't have any benefit than do it in LOCAL. And a hash object here doesn't help either.
Then I subset another raw datasets Sourcelib.RAW2 to work.temp, and then sort it to be work.HAVE2. (The data in Sourcelib.RAW2 is mostly not in order.)
I concatenate work.HAVE1, work.HAVE2 by using proc append (because both two tables are huge) to be work.HAVE
PS2: The sorting in step3 is to avoid sorting at the end of step4. Actually the data Targerlib.WANT doesn't have to be in order. But it's better to be so.
At the very end, I copy work.HAVE to server Targetlib.HAVE.
I did most of the thing in WORK, which only took me few minutes. But step5 could take me half an hour to finish the copy.
As per #Joe, this may mainly due to something related to network transit. I.E minmize the network transit
Question
Any way to improve step5? Or any modification of the whole process will improve the performance?

Couple of thoughts.
First off, assuming this is a SAS dataset and not a SQL database or something else, options compress=binary; is a good idea assuming this is mostly numeric (and options compress=character; if not). Either will reduce the physical size of the dataset significantly in most cases.
Second, 300MB is not very much in the scheme of things. My network would write that in less than a minute. What the conditions of your network are may drive some of the other choices you make; if the single slow point is simply copying data across it, for example, then you need to figure out how to reduce that at the cost of anything else you do.
Assuming you don't change anything else, I would recommend writing have1 directly to the network as have, and then append have2 to it. IE, whatever step creates have1, have that directly write to the network. This includes sort steps, note: so if you create it then sort it, create it locally and sort it with out= the network library. This reduces the total amount of writing done (as you don't write a useless copy of have1 to your local drive). This helps if writing locally is a relevant cost to your total process, but won't help if it's almost entirely network congestion.
Copying files with the OS's copy is almost always superior to any other method of copying, so if network congestion is the only factor you care about, you may want to make it locally (in WORK or in a local but constant directory, like C:\temp\ or similar) and then have the last step of your process be executing copy c:\temp\have.sas7bdat \\networklocation\whereitgoes\. This will usually outperform SAS methods for same, as it can take advantage of efficient techniques.
PROC COPY is another option to get around network congestion; it's probably faster than PROC APPEND (if the local write-out is negligible, as it would be for me for <1 GB data), and has the advantage that it's a bit safer in case something happens during the transport. (Append should be fine also, but with COPY you know for sure nothing was changed from yesterday's file.)
Finally, you may want to figure out some way to allow the target file to just be updated. This isn't all that hard to do in most cases. One example would be to keep a copy of yesterday's file, do a PROC COMPARE to today's file, and then include in the update file every record that is changed (regardless of what the change is). Then delete any matching old records from the target file and append the new records. This is very fast comparatively in terms of total records sent over the network, so saves a lot of time overall if network congestion is the main issue (but takes more CPU time to do the PROC COMPARE).

You can use PROC APPEND to efficiently create a new dataset, not just append to an existing - thus you can use that to basically combines steps 3 and 4 into this:
/* To fulfill requirement 4, delete existing Targetlib.HAVE */
PROC DELETE LIBRARY="Targetlib" DATA="HAVE";
RUN;
/* Targetlib.HAVE does not yet exist, the first APPEND will create it */
PROC APPEND BASE="Targetlib.HAVE" DATA="work.HAVE1";
RUN;
PROC APPEND BASE="Targetlib.HAVE" DATA="work.HAVE2";
RUN;
That should save at least some time, but that still doesn't solve all your problems... I have some additional questions I have put in comments to the question and will alter this answer as much as I can based on them.
Update 1
Here is a way to do the left join and the concatenation in one step and write the results immediately to targetlib. I cannot guarantee this will be faster but it is worth a try. I used key and val as field names, replace as you see fit.
PROC SQL _METHOD;
CREATE TABLE targetlib.HAVE
AS
SELECT
a.key ,
CASE WHEN MISSING (a.val) THEN b.val ELSE a.val END AS val
FROM
Sourcelib.RAW1 AS a
LEFT JOIN
IMP AS b
ON
a.key = b.key
UNION
SELECT
c.*
FROM
Sourcelib.RAW2 AS c
ORDER BY
key
;
QUIT;
RUN;
The _METHOD is a sparsely documented SAS feature that will print the query plan, see this link. This may give you more insight. Also, I am assuming that IMP was already imported from the Excel, and that it is in WORK. Experiment to see whether importing it to targetlib and replacing IMP as b with targetlib.IMP as b is any faster here.
Since you are on Windows, experiment with the data option SGIO=YES after dataset names: e.g. Sourcelib.RAW1 AS a becomes Sourcelib.RAW1 (SGIO=YES) AS a. For more info on Windows SGIO and SAS, see this link and this older but more comprehensive one.
An approach that might be more efficient would be to avoid the join and use a hash object instead: good documentation for hash object can be found here and a good tip sheet is here. It's not clear whether this would be faster - imp has 6m records, but at 50 bytes per record, that's about 300 MB, which does fit into your RAM. The performance of a hash table with that many entries would depend a lot on SAS's hash algorithm. Anyway, here is the code using a hash object. In it, we assume that in the IMP dataset, the val field has been renamed to val2.
DATA targetlib.HAVE (DROP = rc val2);
LENGTH val2 8. ;
IF (_N_ = 1) THEN DO;
DECLARE HASH h (DATASET: "IMP") ;
h.DEFINEKEY ('key');
h.DEFINEDATA ('val2');
h.DEFINEDONE ();
END;
SET
sourcelib.RAW1
sourcelib.RAW2
;
IF MISSING (val) THEN DO;
rc = h.find();
IF (rc = 0) THEN DO;
val = val2;
END;
END;
RUN;
PROC SORT DATA = targetlib.HAVE ; BY KEY ; RUN ;
Try it and see if it is any faster. Once again, experiment with the location of IMP, using SGIO, etc. That PROC SORT at the end could be expensive; if the only reason you were sorting before was because of joining, then skip it.
In general, your approach to SAS should be to do as little I/O as possible, and find ways of combining multiple actions into a single write from a PROC or DATA step.

Related

Large file, SSIS to Small chunks, parallel enrichment

I have a pretty large file 10GB in Size need to load the records into the DB,
I want to have two additional columns
LoadId which is a constant (this indicates the files unique Load number)
ChunkNumber which would indicate the Chunk of the batch size.
So if I have a batch size of 10,000 records I want
LoadId = {GUID}
ChunkNumber = 1
for the next 10,000 records i want
LoadId = {GUID}
ChunkNumber = 2
Is this possible in SSIS? I suppose I can write a custom component for this but there should be a inbuilt ID if i could use as SSIS is already running stuff in batches of size 10,000
Can some one help me to figure out this parameter if it exists and can it be used?
Ok little bit more detail on the background of what and why.
We we get the data into a Slice of 10,000 records then we can start calling the Stored Procedures to enrich the data in chunks, all i am trying to do is can the SSIS help here by putting a Chunk number and a Guid
this helps the stored proc to move the data in chunks, although i could do this after the fact with a row number, Select has to travel through the whole set again and update the chunk numbers. its a double effort.
A GUID will represent the the complete dataset and individual chunks are related to it.
Some more insight. There is a WorkingTable we import this large file into and if we start enriching all the data at once the Transaction log would be used up, it is more manageable if we can get the data into chunks, so that Transaction log would not blow up and also we can parallel the enrichment process.
The data moves from De-normalized format normalized format from here. SP is more maintainable in therms of release and management of day today, so any help is appreciated.
or is there an other better way of dealing with this?
For the LoadID, you could use the SSIS variable
System::ExecutionInstanceGUID
which is generated by SSIS when the package runs.

Bulk Copy from small table to larger one in SQL Server 2005

I'm a newbie in SQL Server and have the following dilemma:
I have two tables with the same structure. Call it runningTbl and finalTbl.
runningTbl contains about 600 000 to 1 million rows every 15 minutes.
After doing some data cleanup in runningTbl I want to move all the records to finalTbl.
finalTbl currently has about 38 million rows.
The above process needs to be repeated every 15-20 minutes.
The problem is that the moving of data from runningTbl to finalTbl is taking way longer than 20 minutes at times..
Initially when the tables were small it took anything from 10 seconds to 2 minutes to copy.
Now it just takes too long.
Any one that can assist with this? SQL query to follow..
Thanks
There are a number of things that you will need to do in order to get the most efficient method of copying the data. So far you are on the right track but you have a long way to go. I would suggest you first look at your indexes. There may be optimizations there that can help. Next, make sure you don't have triggers on this table that could cause a slowdown. Next, change the logging level (if that is permutable).
There is a bunch more help here (from Microsoft):
http://msdn.microsoft.com/en-us/library/ms190421(v=SQL.90).aspx
Basically you are on the right track using BCP. This is actually Microsoft's recommendation:
To bulk-copy data from one instance of SQL Server to another, use bcp to export the table data into a data file. Then use one of the bulk import methods to import the data from the file to a table. Perform both the bulk export and bulk import operations using either native or Unicode native format.
When you do this though, you need to also consider the possibility of dropping your indexes if there is too much data being brought in (based upon the type of index you use). If you use a clustered index, it may also be a good idea to order your data before import. Here is more information (including the source of the above quote):
http://msdn.microsoft.com/en-US/library/ms177445(v=SQL.90).aspx
For starters : one of the things I've learned over the years is that MSSQL does a great job at optimizing all kinds of operations but to do so heavily relies on the statistics for all tables involved. Hence, I would suggest to run "UPDATE STATISTICS processed_logs" & "UPDATE STATISTICS unprocessed_logs" before running the actual inserts; even on a large table these things don't take all that long.
Apart from that, based on the query above, a lot depends on the indexes of the target table. I'm assuming the target table has its clustered index (or PRIMARY KEY) on (at least) UnixTime, if not you'll create major data-fragmentation when you squeeze more and more data in-between the already existing records. To work around this you could try defragmenting the target table once in a while (can be done online, but takes a long time), but making the clustered index (or PK) so that data is always appended to the end of the table would be the better approach; well, at least in my opinion.
I suggest that you should have a window service and use timer and a boolean variable. Once your request is sent to server set the bool to high bit and the timer event should not execute code until the bit is low.

'tail -f' a database table

Is it possible to effectively tail a database table such that when a new row is added an application is immediately notified with the new row? Any database can be used.
Use an ON INSERT trigger.
you will need to check for specifics on how to call external applications with the values contained in the inserted record, or you will write your 'application' as a SQL procedure and have it run inside the database.
it sounds like you will want to brush up on databases in general before you paint yourself into a corner with your command line approaches.
Yes, if the database is a flat text file and appends are done at the end.
Yes, if the database supports this feature in some other way; check the relevant manual.
Otherwise, no. Databases tend to be binary files.
I am not sure but this might work for primitive / flat file databases but as far as i understand (and i could be wrong) the modern database files are encrypted. Hence reading a newly added row would not work with that command.
I would imagine most databases allow for write triggers, and you could have a script that triggers on write that tells you some of what happened. I don't know what information would be available, as it would depend on the individual database.
There are a few options here, some of which others have noted:
Periodically poll for new rows. With the way MVCC works though, it's possible to miss a row if there were two INSERTS in mid-transaction when you last queried.
Define a trigger function that will do some work for you on each insert. (In Postgres you can call a NOTIFY command that other processes can LISTEN to.) You could combine a trigger with writes to an unpublished_row_ids table to ensure that your tailing process doesn't miss anything. (The tailing process would then delete IDs from the unpublished_row_ids table as it processed them.)
Hook into the database's replication functionality, if it provides any. This should have a means of guaranteeing that rows aren't missed.
I've blogged in more detail about how to do all these options with Postgres at http://btubbs.com/streaming-updates-from-postgres.html.
tail on Linux appears to be using inotify to tell when a file changes - it probably uses similar filesystem notifications frameworks on other operating systems. Therefore it does detect file modifications.
That said, tail performs an fstat() call after each detected change and will not output anything unless the size of the file increases. Modern DB systems use random file access and reuse DB pages, so it's very possible that an inserted row will not cause the backing file size to change.
You're better off using inotify (or similar) directly, and even better off if you use DB triggers or whatever mechanism your DBMS offers to watch for DB updates, since not all file updates are necessarily row insertions.
I was just in the middle of posting the same exact response as glowcoder, plus another idea:
The low-tech way to do it is to have a timestamp field, and have a program run a query every n minutes looking for records where the timestamp is greater than that of the last run. The same concept can be done by storing the last key seen if you use a sequence, or even adding a boolean field "processed".
With oracle you can select an psuedo-column called 'rowid' that gives a unique identifier for the row in the table and rowid's are ordinal... new rows get assigned rowids that are greater than any existing rowid's.
So, first select max(rowid) from table_name
I assume that one cause for the raised question is that there are many, many rows in the table... so this first step will be taxing the db a little and take some time.
Then, select * from table_name where rowid > 'whatever_that_rowid_string_was'
you still have to periodically run the query, but it is now just a quick and inexpensive query

SQL Cursor w/Stored Procedure versus Query with UDF

I'm trying to optimize a stored procedure I'm maintaining, and am wondering if anyone can clue me in to the performance benefits/penalities of the options below. For my solution, I basically need to run a conversion program on an image stored in an IMAGE column in a table. The conversion process lives in an external .EXE file. Here are my options:
Pull the results of the target table into a temporary table, and then use a cursor to go over each row in the table and run a stored procedure on the IMAGE column. The stored proc calls out to the .EXE.
Create a UDF that calls the .EXE file, and run a SQL query similar to "select UDFNAME(Image_Col) from TargetTable".
I guess what I'm looking for is an idea of how much overhead would be added by the creation of the cursor, instead of doing it as a set?
Some additional info:
The size of the set in this case is max. 1000
As an answer mentions below, if done as a set with a UDF, will that mean that the external program is opened 1000 times all at once? Or are there optimizations in place for that? Obviously, on a multi-processor system, it may not be a bad thing to have multiple instances of the process running, but 1000 might be a bit much.
define set base in this context?
If you have 100 rows will this open up the app 100 times in one shot? I would say test and just because you can call an extended proc from a UDF I would still use a cursor for this because setbased doesn't matter in this case since you are not manipulating data in the tables directly
I did a little testing and experimenting, and when done in a UDF, it does indeed process each row at a time - SQL server doesn't run 100 processes for each of the 100 rows (I didn't think it would).
However, I still believe that doing this as a UDF instead of as a cursor would be better, because my research tends to show that the extra overhead of having to pull the data out in the cursor would slow things down. It may not make a huge difference, but it might save time versus pulling all of the data out into a temporary table first.

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Resources