I am using the latest(sqljdbc42) SQL Server JDBC driver available for official download. The SQL Server source dbs can vary from 2008 to 2016.
I flipped through several existing threads about setFetchSize(...) and it seems :
If there are 1000 rows and the fetch size is 100, the result set will have only 100 records in memory, at a time and will make 10 network trips to fetch the next 100 records, when rs.next() is called
It's unclear whether the SQL Server JDBC driver honors the fetch size
I am writing a custom library for our internal use. It will select all the data from specific tables, iterate over the results and write it to a stream. I get an OutOfMemoryError when I try to run it for several tables(each one with thousands to hundred thousand rows) but when not if I am iterating over a single table with several thousand rows. I suspect somewhere, a large table is causing the issue.
While I will continue to debug my code, I wish to know if the setFetchSize(...) really works with the latest SQL Server jdbc driver.
Note : The table is NOT having any incremental columns/surrogate keys on which I can manually paginate, I need something out-of-box
*****Edit-1*****
As specified in the comment by #a_horse_with_no_name, I think 'responseBuffering=adaptive' should be explicitly specified and the document specifies some ways to do it. The challenge is :
I receive a custom DB connection pool instance as an argument and I have to use it to get a Connection object, thus, there is no way I can specify 'responseBuffering=adaptive' in the db connect URL or else where. All I get is a Connection object for usage
The only relevant(?) method I can use is setClientInfo(...) but I am unsure if that would help
Related
We have an SSIS package that is apparently termed as 'slow' by the development team. Since they do not have a person with SSIS ETL, as a DBA I tried digging into it. Below is the information I found:
SQL Server was 2014 version upgraded -inplace to 2017 so it has SSIS of both versions.
They load a SQL Server table of size 200 GB into SSIS and then zip the data into flatfile using command line zip functionality.
The data flow task simple hits a select * from view - the view is nothing but containing the table with no other fancy joins.
While troubleshooting I found that on SQL Server, there is hardly any load coming, possibly because the select command is running in single thread and not utilizing SQL server cores.
When I run the same select * command (only for 5 seconds, since it is 200 GB table), even my command is single threaded.
The package has a configuration file that the SQL job shows (this is how the package runs) with some connection settings.
Opening the package in BIDS show defaultBufferMaxRows as 10000 only (possibly default value) (since configuration file or any variables does not has a customer value, I guess this is what the package is using too).
Both SQL and SSIS are on same server. SQL has been allocated max memory leaving around 100 GB for SSIS and OS.
Kindly share any ideas on how can I force the SQL Server to run this select command using multiple threads so that entire table gets inside SSIS buffer pool faster.
Edit: I am aware that bcp can read data faster than any process and save it to flatfile but at this point changes to the SSIS package has to be kept minimum and exploring options that can be incorporated within SSIS package.
Edit2: Parallelism works perfectly for my SQL Server as I verified for a lot of other queries.The table in question is 200 GB. It is something with SSIS only which is not hammering my DB as hard as it should.
Edit3: I have made some progress, adjusted the buffer value to 100 MB and max rows to 100000 and now the package seem to be doing better. when I run this package on the server directly using dtexec utility, it generates good load of 40- 50 MB per second but through SQL job it never generates lod more than 10 MB. so I am trying to figure out this behavior.
Edit4: I found that when I run the package directly from logging to the server and invoking dtexec utility, it runs good because it generates good load on the DB causing data I\O to remain steady between 30-50 MB\sec.
The same thing from SQL job never exceeds the I\O more than 10 MB\sec.
I even tried to run the package using agent and opting for cmdline operation but no changes. Agent literally sucks here, any pointers on what could be wrong here?
Final Try:
I am stumped at the observation I have finally:
1)Same package runs 3x faster when run from command prompt from windows node by invoking dtexc utility
2) Exact same package runs 3 times slower than above when involked by SQL agent which has sysadmin permissions on windows as well as SQL Server
In both cases, I tried to see the version of DTEXEC they invoke, and they both invoke the same version. So why one would be so slow is out of my understanding.
I don't think that there is a general solution to this issue since it is a particular case that you didn't provide much information. Since there are two components in your data flow task (OLE DB Source and Flat File Destination), I will try to give some suggestions related to each component.
Before giving suggestions for each component, it is good to mention the following:
If no transformations are applied within the data flow task, It is not recommended to use this task. It is preferable to use bcp utility
Check the TempDb and the database log size.
If a clustered index exists, try to rebuild it. If not, try to create a clustered index.
To check the component that is slowing the package execution, open the package in Visual Studio and try to remove the flat file destination and replace it with a dummy Script Component (write any useless code, for example: string s = "";). And then run the package; if it is fast enough, then the problem is caused by the Flat File Destination, else you need to troubleshoot the OLE DB Source.
Try executing the query in the SQL Server management studio and shows the execution plan.
Check the package TargetServerVersion property within the package configuration and make sure it is correct.
OLE DB Source
As you mentioned, you are using a Select * from view query where data is stored in a table that contains a considerable amount of data. The SQL Server query optimizer may find that reading data using Table Scan is more efficient than reading from indexes, especially if your table does not have a clustered index (row store or column store).
There are many things you may try to improve data load:
Try replacing the Select * from view with the original query used to create the view.
Try changing the data provider used in the OLE DB Connection Manager: SQL Server Native Client, Microsoft OLE DB provider for SQL Server (not the old one).
Try increasing the DefaultBufferMaxRows and DefaultBufferSize properties. more info
Try replacing using SQL Command with specific column names instead of selecting the view name (Table of View data access mode). more info
Try to load data in chunks
Flat File Destination
Check that the flat file directory is not located on the same drive where SQL Server instance is installed
Check that the flat file is not located on a busy drive
Try to export data into multiple flat files instead of one huge file (split data into smaller files) , since when the exported data size increase in a single file, writing to this file become slower, then the package will become slower. (Check the 5th suggestion above)
Any indexes on the table could slow loading. If there are any indexes, try dropping them before the load and then recreating them after. This would also update the index statistics, which would be skewed by the bulk insert.
Are you seeing SQL server utilizing other cores too for other queries? If not, maybe someone played with the following settings:
Check these under server configuration setting:
Maximum Degree of Parallelism
Cost Threshold for Parallelism (server configuration setting).
Does processors affinitized to a CPU.
Also, MaxDOP query hint can cause this too but you said there is no fancy stuff in the view.
Also, it seems you have enough memory on error, why not increase defaultBufferMaxRows to an extremely large number so that SQL server doesn't get slowed down waiting for the buffer to get empty. Remember, they are using the same disk and they will have to wait for each other to use the disk, which will cause extra wait times for the both. It's better SQL server uses it, put into the buffer, and then SSIS starts processing and writing it into disk.
DefaultBufferSize : default is 10MB, max possible 2^31-1 bytes
DefaultBufferMaxRows : default is 10000
you can set AutoAdjustBufferSize so that DefaultBufferSize is automatically calculated based on DefaultBufferMaxRows
See other performance troubleshooting ideas here
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features?view=sql-server-ver15
Edit 1: Some other properties you can check out. These are explained in the above link as well
MaxConcurrentExecutables (package property): This defines how many threads a package can use.
EngineThreads (Data Flow property): how many threads the data flow engine can use
Also try running dtsexec under the same proxy user used by SQL agent to see if you get different result with this account versus your account. You can use runas /user:... cmd to open a command window under that user and then execute dtexec.
Try changing the proxy user used in SQL Agent to a new one and see if it will help. Or try giving elevated permissions in the directories it needs access to.
Try keeping the package in file-system and execute through dtexec from the SQL Agent directly instead of using catalog.start_execution.
Not your case but for other readers: if you have "Execute Package Task", make sure the child packages to be executed are set to run in-process via ExecuteOutOfProcess property. This just reduces overhead of using more processes.
Not your case but for other readers: if you're testing in BIDS, it will run in debug mode by default and thus run slow. Use CTRL-F5 (start without debugging). The best is to use dtexec directly to test the performance
A data flow task may not be the best choice to move this data. SSIS Data Flow tasks are an ETL tool where you can do transformations, look ups, redirect invalid rows, add derived columns and a lot more. If the data flow task is simple and only moves data with no manipulation or redirection of rows then ditch the Data Flow task and use a simple Execute SQL Task and OPENROWSET to import the flat file that was generated from command line and zipped up. Assuming the flat file is a .csv file here are some working examples to query a .csv and insert the data to a table.
You need [Ad Hoc Distributed Queries] run_value set to 1
into dbo.Destination
SELECT *
from openrowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=D:\YourCsv.csv;Extensions=csv;','select * from YourCsv.csv') File;
Here is some additional examples https://sqlpowershell.blog/2015/02/09/t-sql-read-csv-files-using-openrowset/
There are suggestions in this MSDN article: MSDN DataFlow performance features
Key ones appear to be:
Check the EngineThreads property of the DataFlow task, which tells SSIS how may source and worker threads it should use
If using OLE DB Source to select data from a view uses "SQL Command" and write a SELECT * From View rather than Table or View
Let us know how you get on
You may be facing I/O bottleneck while writing the 200GB to the flat file. I don't see any problem with SQL Query.
If possible create multiple files and split the data (either by modifying SSIS or changing the select query)
There appears to be no solution to continuing to use a "front-end" (form) in MS Access once the single table exceeds 2GB, regardless of where the data resides. Further, since there is no way to query-SPAN linked tables in multiple Access databases, there is no solution there either. Am I correct?
MS Access 2016 on Win10, & SQL Server 2014 on Server 2012, tons of storage & memory everywhere.
Because I have used OLE "pictures", the table has exceeded the 2GB limit in Access. I exported the table to SQL Server 2014 and linked to it, then changed the properties of the form (in Access) to attach to it. It seems to work fine except that the end of the table cannot be accessed using the end of table control, or a 'find' for data I know is in a specific field near the end of the table, so it appears that moving my data to SQL Server is pointless.
The error I get when jumping or finding is:
The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or there is not enough temporary storage space on the disk to store the query result.'
There are vbscript blocks in use, but the failure comes when using Access-native controls.
If I'm doing something wrong, please advise.
As stated, I tried "spanning" table segments, but there doesn't seem to be a way to query ALL records from both tables, or a way to construct a query to mask the origin of fields, or very likely any benefit as it would likely produce the same "exceeds 2GB" error as SQL did above.
I would consider exporting the OLE "pictures" to files, and reading them back into the form, record by record when viewing, but there are no functions in MS Access that support this, or any of the other functions I'd need to do it going forward. (write file from form.OLE, etc.)
Any ideas?
Thanks!
I have an odd situation. We have an AS400 with a Prod side and a Dev side (same machine, 2 nic cards) From a production SQL Server, we run a query from a MS-SQL server that is using a linked Server, I'll call 'as400' The query does 3 joins, and the execution plan looks roughly like [Remote Query] => Got Results. It does the joins on the remote server (the Production AS400) This will execute in no more than 0:00:02 (2 seconds) One of the joined tables has 391 MILLION rows. It is pulling 100 rows - joined to the other table.
Now, it gets weird. On the Dev side of the same machine, running on a different SQL Server, coming in the other NIC card, executing the same query with a different database (the dev one) the execution plan is quite different! It is:
[Query 1] hash match (inner join) with [Query2] Hash with [Query3] Hash with [Query4]
Expecting that each query returns 10,000 rows (I'm sure it is just using that as a guess as it doesn't know the actual rows as it is remote). What it appears to be doing is pulling 391 million rows back on query2 and it takes > 23 HOURS before I give up and kill the query. (Running in SSMS)
Clearly, the MS SQL Server is making the decision to not pass this off to the AS400 as a single query. Why?
I can 'solve' the problem by using a OpenQuery (as400, cmd) instead, but then it will open us up to SQL Injection, can't do simple syntax checking on the query, and other things I don't like. It takes 6 seconds to do the query using OpenQuery, and returns the correct data.
If we solve this by rewriting all our (working, fast) queries that we use in production so they can also run against dev - it involves a LOT of effort and there is down-side to it in actual production.
I tried using the 'remote' hint on the join, but that isn't supported by the AS400 :-(
Tried looking at the configuration/versions of the SQL Servers and that didn't seem to offer a clue either. (SQL Servers are nearly the same version/are same, 10.50.6000 for the one that works, and 10.50.6220 for one that fails (newer), and also 10.50.6000 for the other one that is failing.)
Any clues anyone? Would like to figure this out, we have had several people looking at this for a couple of weeks - including the Database Architect and the IBM AS400 guru, and me. (So far, my OpenQuery is the only thing that has worked)
One other point, the MS Servers seem to be opening connections 5 per second to the AS400 from the machines that are not working (while the query runs for 23 hours) - I don't understand that, and I'm not 100% sure it is related to this issue, but it was brought up by the AS400 guy.
I despise linked servers for this reason (among many others). I have always had good luck with openquery() and using sp_executesql to help prevent SQL injection.
There is mention of this here: including parameters in OPENQUERY
Without seeing the queries and execution plans it sounds like this is a problem with permissions when accessing statistics on the remote server. For the query engine to make use of all available statistics and build a plan properly, make sure the db user that is used to connect to the linked server is one of the following on the linked server:
The owner of the table(s).
A member of the sysadmin fixed SERVER role.
A member of the db_owner fixed DATABASE role.
A member of the db_ddladmin fixed DATABASE role.
To check what db user you're using to connect to the linked server use Object Explorer...
Expand the Server\Instance > Server Objects > Linked Servers > right click your linked server and select properties, then go to the Security page.
If you're not mapping logins in the top section (which I wouldn't suggest) then select the last radio button at the bottom to make connections using a specific security context and enter the username and password for the appropriate db user. Rather than using 'sa' create a new db user on the linked server that is #2 or #3 from above and use that. Then every time the linked server is used it will connect with the necessary permissions.
Check the below article for more detail. Hope this helps!
http://www.benjaminnevarez.com/2011/05/optimizer-statistics-on-linked-servers/
I am using MS SQL Server 2014.
I have a "main" server that contains a database which contains a table with data. This table receives new data around every second. Per day it is currently around 500'000 new rows. As the data only stay for some days in this table, the table contains a maximum of 1'500'000 rows.
I have a "mandate" server that contains the same database (structure) which contains the same table (structure).
I want to write (distribute) the new data arrived from my "main" server table to my "mandate" server table (also around every second). Updates or deletes do not need to be distributed (only the new arrived rows of the last second(s)).
I want to keep the resources used on my "main" server for the distribution as low as possible. Currently, there is only one "mandate" server, however, in the future there might be more of them (with the same needs).
Is replication the best thing to use here (esp. regrading the resources (as low as possible on the "main" server but on the burden of the "mandate" server(s)) and frequency (very second)?
Alternatively, a self-made SQL job on the "mandate" server to pull the data from the "main" server is also possible but I think if there are multiple "mandate" servers frequently pulling data from the "main" server table it might use to many resources.
What is the best way to distribute the data in my case here?
We're in the middle of doing a new data warehouse roll-out using SQL Server 2014. One of my data sources is Oracle, and unfortunately the recommended Attunity component for quick data access is not available for SSIS 2014 just yet.
I've tried avoiding using OLEDB, as that requires installation of specific Oracle client tools that have caused me a lot of frustration before, and with the Attunity stuff supposedly being in the works (MS promised they'd arrive in August already), I'm reluctant to go through the ordeal again.
Therefore, I'm using ADO.NET. All things considered, performance is acceptable for the time being, with the exception of 1 particular table.
This particular table in Oracle has a bunch of varchar columns, and I've come to the conclusion that it's because of the width of the selected row that this table performs particularly slow. To prove that, rather than selecting all columns as they exist in Oracle (which is the original package I created), I truncated all widths to the maximum length of the values actually stored (CAST(column AS varchar(46)). This reduced the time to run the same package to 17 minutes (still way below what I'd call acceptable, and it's not something I'd put in production because it'll open up a world of future pain, but it proves the width of the columns are definitely a factor).
I increased the network packet size in SQL Server, but that did not seem to help much. I have not managed to figure out a good way to alter the packet size on the ADO.NET connector for Oracle (SQL Server does have that option). I attempted to see if adding Packet size=32000;to the connection string for the Oracle connector, but that just threw an error, indicating it simply won't be accepted. The same applies to FetchSize.
Eventually, I came up with a compromise where I split the load into three different parts, dividing the varchar columns between these parts, and using two MERGE JOIN objects to well, merge the data back into a single combined dataset. Running that and doing some extrapolation leads me to think that method would have taken roughly 30 minutes to complete (but without the potential of data loss using the CAST solution from above). However, that's still not acceptable.
I'm currently in the process of trying some other options (not using MERGE JOIN but dumping into three different tables, and then merging those on the SQL Server itself, and splitting the package up into even more different loads in an attempt to further speed up the individual parts of the load), but surely there must be something easier.
Does anyone have experience with how to load data from Oracle through ADO.NET, where wide rows would cause delays? If so, are there any particular guidelines I should be aware of, or any additional tricks you might have come across that could help me reduce load time while the Attunity component is unavailable?
Thanks!
The updated Attunity drivers have just been released by Microsoft:
Hi all, I am pleased to inform you that the Oracle and TeraData connector V3.0 for SQL14
SSIS is now available for download!!!!!
Microsoft SSIS Connectors by Attunity Version 3.0 is a minor release.
It supports SQL Server 2014 Integration Services and includes bug
fixes and support for updated Oracle and Teradata product releases.
For details, please look at the download page.
http://www.microsoft.com/en-us/download/details.aspx?id=44582
Source: https://connect.microsoft.com/SQLServer/feedbackdetail/view/917247/when-will-attunity-ssis-connector-support-sql-server-2014