rxImport potential issue in RevoScaleR

rxImport potential issue in RevoScaleR - sql-server

I have a SQL connection to a table on my SQLServer, which I have imported with the following line:
master_table <- RxSqlServerData(etc...)
Then, my goal is to save/import this table using rxImport and save it to a .xdf file, which I have called readTest <- 'read_test.xdf
The table is quite large, so I have set this in my rxImport:
rxImport(master_table, outFile=readTest, rowsPerRead=100000,reportProgress=1)
However, it has been running for 10 minutes now, and NO progress of rows being read/imported is being printed on the screen. Did I do this correctly? I wanted to output similar "progress" that is printed when a ML algorithm is run like RxForest or similar?
Thanks.

It's possible that the connection to your SQL Server database is relatively slow, report progress will only show progress when a batch of rows is complete. If the rows are relatively large, you could see nothing returned on the terminal for quite some time.
For best performance with rxImport(), ensure that rowsPerRead is the largest possible size that your local machine memory can handle. This will make progress reports less frequent, but, it will give you a faster import time. The only case where this isn't true is when importing an XDF file.

Related

SSIS data transfer slows down after inserting few million rows

I am encountering this weird problem and any help will be really appreciated.
I have a single container in which I have 2 data flow task in my SSIS package, data transfer is very huge. Breakdown of problem.
First Container is transferring from oracle to SQL around 130 million rows and it ran just fine and transfer the rows in about 40 to 60 mins which is very much acceptable.
Now come the second part another data flow task is there that is transferring around 86 million rows from SQL server to SQL server(one table) only, the data transfer flies very fast till 60 70 million and after that it just dies out or crawls just like anything for next 10 million rows it took 15 hours, I am not able to understand why is it happening so?
Table get truncated and then it gets loaded, I have tried increasing DataBuffer proeprties etc but with no avail.
Thanks in advance for any help.

You are creating a single transaction and the transaction log is filling up. You can get 10-100x faster speeds if you move 10000 rows at a time. You may also try setting Maximum Insert Commit Size to 0 or try 5000 and go up to see the impact on performance. This is on the OLE DB Destination component. In my experience 10000 rows is the current magic number that seems to be the sweet spot but of course it is very dependent on how large the rows are, version of SQL Server and the hardware setup.
You should also look if there are indexes on the target table you can try dropping the indexes, loading the table and recreating the indexes.

What is your destination recovery model? Full/Simple, etc...
Are there any transformations between the source and destination? Try sending the source to a RowCount to determine the maximum speed your source can send data. You may be seeing a slowdown on the source side as well.
Is there any difference in content of the rows once you notice the slow down? For example, maybe the more recent rows have lots of text in a varchar(max) column that the early rows did not make use of.
Is your destination running on a VM? If yes, have you pre-allocated the CPU and RAM? SSIS is multi-threaded, but it won't necessarily use 100% of each core. VM hosts may share the resources with other VMs because the SSIS VM is not reporting full usage of all of the resources.

Why adding another LOOKUP transformation slows down performance significantly SSIS

I have a simple SSIS package that transfer data between source and destination from one server to another.
If its new records - it inserts, otherwise it checks HashByteValue column and if it different its update record.
Table contains approx 1.5 million rows, and updates around 50 columns.
When I start debug the package, for around 2 minutes nothing happens, I cant even see the green check-mark. After that I can see data starts flowing through, but sometimes it stops, then flowing again, then stops again and so on.
The whole package looks like this:
But if I do just INSERT part (without update) then it works perfectly, 1 min and all 1.5 million records in a destination table.
So why adding another LOOKUP transformation to the package that updates records slows down performance so significantly.
Is it something to do with memory? I am using FULL CACHE option in both lookups.
what would be the way to increase performance?
Can the reason be in Auto Growth File size:

Besides changing AutoGrowth size to 100MB, your Database Log file is 29GB. That means you most likely are not doing Transaction Log backups.
If you're not, and only do Full Backups nightly or periodically. Change the Recovery Model of your Database from Full to Simple.
Database Properties > Options > Recovery Model
Then Shrink your Log file down to 100MB using:
DBCC SHRINKFILE(Catalytic_Log, 100)

I don't think that your problem is in the lookup. The OLE DB Command is realy slow on SSIS and I don't think it is meant for a massive update of rows. Look at this answer in the MSDN: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/4f1a62e2-50c7-4d22-9ce9-a9b3d12fd7ce/improve-data-load-perfomance-in-oledb-command?forum=sqlintegrationservices
To verify that the error is not the lookup, try disabling the "OLE DB Command" and rerun the process and see how long it takes.
In my personal experience it is always better to create a Stored procedure to do the whole "dataflow" when you have to update or insert based on certain conditions. To do that you would need a Staging table and a Destination table (where you are going to load the transformed data).
Hope it helps.

Creating Azure SQL V12 Full Text Index is very slow

I'm creating a full text index on 7 columns of a 30 million row table.
5 days on this query:
SELECT count(*) FROM sys.dm_fts_index_keywords( 5, OBJECT_ID('[dbo].[FormattedAddress]'))
Returns 500,000 rows and seems to be slowing.
This database is a standard S2.
Is there anything I can do to speed things up?

A few things you can do are
Scale up to higer SKU say P1 and let index build complete
Scale down to S2
Please also see resource usage of the database to see if you are hitting any resource limits. You can't turn off the singing as those config options are not available in SQL DB.

Lets try a few things. First lets run the query till it completes. Second we will want to clean the buffer cache and then rerun the query. It might also be a problem they said was fixed in 2008. Check to make sure that the signature verification is turned off. Also one more thing to check is the memory allocation for FTS. The FullText service runs independantly from the SQL service, so there's a chance it's starved for memory.

Bulk Copy from small table to larger one in SQL Server 2005

I'm a newbie in SQL Server and have the following dilemma:
I have two tables with the same structure. Call it runningTbl and finalTbl.
runningTbl contains about 600 000 to 1 million rows every 15 minutes.
After doing some data cleanup in runningTbl I want to move all the records to finalTbl.
finalTbl currently has about 38 million rows.
The above process needs to be repeated every 15-20 minutes.
The problem is that the moving of data from runningTbl to finalTbl is taking way longer than 20 minutes at times..
Initially when the tables were small it took anything from 10 seconds to 2 minutes to copy.
Now it just takes too long.
Any one that can assist with this? SQL query to follow..
Thanks

There are a number of things that you will need to do in order to get the most efficient method of copying the data. So far you are on the right track but you have a long way to go. I would suggest you first look at your indexes. There may be optimizations there that can help. Next, make sure you don't have triggers on this table that could cause a slowdown. Next, change the logging level (if that is permutable).
There is a bunch more help here (from Microsoft):
http://msdn.microsoft.com/en-us/library/ms190421(v=SQL.90).aspx
Basically you are on the right track using BCP. This is actually Microsoft's recommendation:
To bulk-copy data from one instance of SQL Server to another, use bcp to export the table data into a data file. Then use one of the bulk import methods to import the data from the file to a table. Perform both the bulk export and bulk import operations using either native or Unicode native format.
When you do this though, you need to also consider the possibility of dropping your indexes if there is too much data being brought in (based upon the type of index you use). If you use a clustered index, it may also be a good idea to order your data before import. Here is more information (including the source of the above quote):
http://msdn.microsoft.com/en-US/library/ms177445(v=SQL.90).aspx

For starters : one of the things I've learned over the years is that MSSQL does a great job at optimizing all kinds of operations but to do so heavily relies on the statistics for all tables involved. Hence, I would suggest to run "UPDATE STATISTICS processed_logs" & "UPDATE STATISTICS unprocessed_logs" before running the actual inserts; even on a large table these things don't take all that long.
Apart from that, based on the query above, a lot depends on the indexes of the target table. I'm assuming the target table has its clustered index (or PRIMARY KEY) on (at least) UnixTime, if not you'll create major data-fragmentation when you squeeze more and more data in-between the already existing records. To work around this you could try defragmenting the target table once in a while (can be done online, but takes a long time), but making the clustered index (or PK) so that data is always appended to the end of the table would be the better approach; well, at least in my opinion.

I suggest that you should have a window service and use timer and a boolean variable. Once your request is sent to server set the bool to high bit and the timer event should not execute code until the bit is low.

SQL Cursor w/Stored Procedure versus Query with UDF

I'm trying to optimize a stored procedure I'm maintaining, and am wondering if anyone can clue me in to the performance benefits/penalities of the options below. For my solution, I basically need to run a conversion program on an image stored in an IMAGE column in a table. The conversion process lives in an external .EXE file. Here are my options:
Pull the results of the target table into a temporary table, and then use a cursor to go over each row in the table and run a stored procedure on the IMAGE column. The stored proc calls out to the .EXE.
Create a UDF that calls the .EXE file, and run a SQL query similar to "select UDFNAME(Image_Col) from TargetTable".
I guess what I'm looking for is an idea of how much overhead would be added by the creation of the cursor, instead of doing it as a set?
Some additional info:
The size of the set in this case is max. 1000
As an answer mentions below, if done as a set with a UDF, will that mean that the external program is opened 1000 times all at once? Or are there optimizations in place for that? Obviously, on a multi-processor system, it may not be a bad thing to have multiple instances of the process running, but 1000 might be a bit much.

define set base in this context?
If you have 100 rows will this open up the app 100 times in one shot? I would say test and just because you can call an extended proc from a UDF I would still use a cursor for this because setbased doesn't matter in this case since you are not manipulating data in the tables directly

I did a little testing and experimenting, and when done in a UDF, it does indeed process each row at a time - SQL server doesn't run 100 processes for each of the 100 rows (I didn't think it would).
However, I still believe that doing this as a UDF instead of as a cursor would be better, because my research tends to show that the extra overhead of having to pull the data out in the cursor would slow things down. It may not make a huge difference, but it might save time versus pulling all of the data out into a temporary table first.