SQL Server - ETL approach

SQL Server - ETL approach - sql-server

We get daily files that need to be loaded into our database. The files will get delivered on a separate server than the database. Which one of the 2 approaches are better for the ETL from a performance perspective?
Transfer files over from the delivery server to the database server. Do bulk load.
Open DB connection from delivery server and load
Edited to add: The servers are all on the same network.

Depends whether source servers are SQL servers or other technology, the driver used (if it's oracle the Microsoft driver will nerf your perf badly, oracle is better), the amount of database overhead You want to impose (while one server is feeding the other they are probably both IO bound), the disk layout You have (ie reading from one raid and writing to the other, conpressing and transferring through 1gig or 100mb might be more efficient. Usually the dumps compress nicely but as Beth have noticed, test it.
With dumps You can abuse parallel transformations (like multiple disk shares, and multiple processors use for compression - use 7zip period.) With ethernet YOu probably wont abuse as much parallelism. Same thing affects the target server.
All in all, as usual with performance, test, quantify, test, quantify, repeat:)

The universal response of 'It Depends'. It depends particularly on what ETL technology you are using. If your ETL is tied to the database server for its processing power (SSIS, BODI (to a lesser degree) then you need to get your files onto the database server asap. If you have a more file based ETL package (Abinitio, Informatica) then you are free to do your transformation on your delivery server and then move your 'ready-to-load' data onto the database server for bulk loading.

in all cases.
Espacially if the files are very large, you can compress data files before transporting over network.

Related

Oracle performance when doing bulk inserts from Delphi

We have a Delphi application which can connect to either Oracle or SQL Server. We use Devart components to connect to the databases, and everything is very generic when it comes to database access. i.e. we use the lowest common denominator. Ultimately we use the databases as data stores and do not use any of the more "advanced" features which maybe specific to the database.
However we have a serious performance issue with Oracle. It is to do with inserting data. I know that inserting data by running off a load of insert statements is not great for performance, but due to some business logic that needs to be done on the raw data before it gets uploaded to the database, we are a little restricted to multiple inserts. To get an idea of performance differences, a recent test we did, inserts 1000 items into our database and takes 5 minutes in SQL Server (acceptable) but 44 minutes in Oracle.
Is there anything we could do to improve performance? The inserting of data needs to be done by the user and NOT an Oracle DBA, so absolutely no Oracle skills is one of the pre-requisites for any solution. Basically, the users need to press a button and everything is done.
Edit: Business Logic happens before the insert (although there is a little going on during the actual insert, so more realistic number would be 2 minutes for SQL Server and 40 or so minutes for Oracle. Bear in mind we are inserting a few large blobs per record, so perhaps that explains the slowish performance, but not why there is such a difference. The 1000 items are part of a transaction.

Oracle supports array DML, which can speed up performance. Also if BLOB are involved, performance may depend on caching settings, and how the BLOB are setup in the destination table. Some db client parameters tuning may be also beneficial to increase network speed.
Anyway, without knowing which version of Oracle you're using, how it is configured, your table(s) deinition (and its tablespaces), how large are the BLOBS, and the SQL actually used (did you trace it?), it's very difficult to diagnose the real problem.
Oracle has some powerful diagnostic tools to identify bottlenecks, but they may not be easy to use and require to know enough about how Oracle works. From the Enterprise Manager Console you can access some of them in a more readable format - did you check it?
Update: because I can't comment to other answers, Oracle support differet type of LOB storage:
LOBs stored into the database (under transaction managment)
BFILES, external file system LOBs yet still managed by Oracle (LOB data not under transaction)
SecureFiles (11g onwards, alike BFILES but with transaction support and other features)
Oracle is designed for and can manage large LOBs - just it needs to be configured properly. Parameter that will affect LOB performance:
ENABLE/DISABLE STORAGE IN ROW
CACHE/NOCACHE/CACHE READS
LOGGING/NOLOGGING
CHUNK
PCTVERSION/RETENTION (especially for updates and deletes)
TABLESPACE (usually, a dedicated tablespace for lobs is advisable)
These parameters needs to be set taking into account the average LOB size, how LOBs are accessed, amd how often are modified. There's no "one size fits all".
But there are also the client side: OCI can buffer LOBs client side, so small read/write operations are cached, minimizing the number of network roundtrips and LOB versioning - that's up to the OCI wrapper you're using.

Array DML (only available with FireDac, ODAC, DOA and our SynDbOracle unit afaik) won't change much if your problem is about blob transfer.
First idea is to compress the data before transmission.
Try several access libraries. Our open source SynDBOracle directly accesses the oci.dll client but may be slightly faster.
But perhaps the problem may be on the server side. Oracle does not like transactions with huge data, since it tends to overflow its wal files. Try to tune the write ahead log files of the table.
IMHO a rdbms is not the best option to store huge blobs. Plain files, indexed via a rdbms for metadata is usually better. Or switch to a big SQL storage, like key/value stores or mongodb blob api.
Remember that both Oracle and mssql do ask money proportional to the data size....

SQL Server Table > MS Access Local Copy?

I'm looking for a little advice.
I have some SQL Server tables I need to move to local Access databases for some local production tasks - once per "job" setup, w/400 jobs this qtr, across a dozen users...
A little background:
I am currently using a DSN-less approach to avoid distribution issues
I can create temporary LINKS to the remote tables and run "make table" queries to populate the local tables, then drop the remote tables. Works as expected.
Performance here in US is decent - 10-15 seconds for ~40K records. Our India teams are seeing >5-10 minutes for the same datasets. Their internet connection is decent, not great and a variable I cannot control.
I am wondering if MS Access is adding some overhead here than can be avoided by a more direct approach: i.e., letting the server do all/most of the heavy lifting vs Access?
I've tinkered with various combinations, with no clear improvement or success:
Parameterized stored procedures from Access
SQL Passthru queries from Access
ADO vs DAO
Any suggestions, or an overall approach to suggest? How about moving data as XML?
Note: I have Access 7, 10, 13 users.
Thanks!

It's not entirely clear but if the MSAccess database performing the dump is local and the SQL Server database is remote, across the internet, you are bound to bump into the physical limitations of the connection.
ODBC drivers are not meant to be used for data access beyond a LAN, there is too much latency.
When Access queries data, is doesn't open a stream, it fetches blocks of it, wait for the data wot be downloaded, then request another batch. This is OK on a LAN but quickly degrades over long distances, especially when you consider that communication between the US and India has probably around 200ms latency and you can't do much about it as it adds up very quickly if the communication protocol is chatty, all this on top of the connection's bandwidth that is very likely way below what you would get on a LAN.
The better solution would be to perform the dump locally and then transmit the resulting Access file after it has been compacted and maybe zipped (using 7z for instance for better compression). This would most likely result in very small files that would be easy to move around in a few seconds.
The process could easily be automated. The easiest is maybe to automatically perform this dump every day and making it available on an FTP server or an internal website ready for download.
You can also make it available on demand, maybe trough an app running on a server and made available through RemoteApp using RDP services on a Windows 2008 server or simply though a website, or a shell.
You could also have a simple windows service on your SQL Server that listens to requests for a remote client installed on the local machines everywhere, that would process the dump and sent it to the client which would then unpack it and replace the previously downloaded database.
Plenty of solutions for this, even though they would probably require some amount of work to automate reliably.
One final note: if you automate the data dump from SQL Server to Access, avoid using Access in an automated way. It's hard to debug and quite easy to break. Use an export tool instead that doesn't rely on having Access installed.

Renaud and all, thanks for taking time to provide your responses. As you note, performance across the internet is the bottleneck. The fetching of blocks (vs a continguous DL) of data is exactly what I was hoping to avoid via an alternate approach.
Or workflow is evolving to better leverage both sides of the clock where User1 in US completes their day's efforts in the local DB and then sends JUST their updates back to the server (based on timestamps). User2 in India, also has a local copy of the same DB, grabs just the updated records off the server at the start of his day. So, pretty efficient for day-to-day stuff.
The primary issue is the initial DL of the local DB tables from the server (huge multi-year DB) for the current "job" - should happen just once at the start of the effort (~1 wk long process) This is the piece that takes 5-10 minutes for India to accomplish.
We currently do move the DB back and forth via FTP - DAILY. It is used as a SINGLE shared DB and is a bit LARGE due to temp tables. I was hoping my new timestamped-based push-pull of just the changes daily would have been an overall plus. Seems to be, but the initial DL hurdle remains.

Merge multiple Access database into one big database

I have multiple ~50MB Access 2000-2003 databases (MDB files) that only contain tables with data. The data-databases are located on a server in my enterprise that can take ~1-2 second to respond (and about 10 seconds to actually open the 50 MDB file manually while browsing in the file explorer). I have other databases that only contain forms. Most of those forms-database (still MDB files) are actually copied from the server to the client (after some testing, the execution looks smoother) before execution with a batch file. Most of those forms-databases use table-links to fetch the data from the data-databases.
Now, my question is: is there any advantage/disadvantage to merge all data-databases from my ~50MB databases to make one big database (let's say 500MB)? Will it be slower? It would actually help to clean up my code if I wouln't have to connect to all those different databases and I don't think 500MB is a lot, but I don't pretend to be really used to Access by any mean and that's why I'm asking. If Access needs to read the whole MDB file to get the data from a specific table, then it would be slower. It wouldn't be really that surprising from Microsoft, but I've been pleased so far with MS Access database performances.
There will never be more than ~50 people connected to the database at the same time (most likely, this number won't in fact be more than 10, but I prefer being a little bit conservative here just to be sure).

The db engine does not read the entire MDB file to get information from a specific table. It must read information from the system tables (hidden tables whose names start with MSys) to determine where the data you need is stored. Furthermore, if you're using a query to retrieve information from the table, and the db engine can use an index to determine which rows satisfy the query's WHERE clause, it may read only those rows from the table.
However, you have issues with your network's performance. When those lead to dropped connections, you risk corrupting the MDB. That is why Access is not well suited for use in wide area networks or with wireless connections. And even on a wired LAN, you can suffer such problems when the network is flaky.
So while reducing the amount of data you pull across the network is a good thing, it is not the best remedy for Access on a flaky network. Instead you should migrate the data to a client-server db so it can be kept safe in spite of dropped connections.

You are walking on thin ice here.
Access will handle your scenario, but is not really meant to allow so many concurrent connections.
Merging everything in a big database (500mb) is not a wise move.
Have you tried to open it from a network location?
As far as I can suggest, I will use a backend SqlServer Express to merge all the tables in a single real client-server database.
The changes required by client mdb front-end should not be very pervasive.

Transactional Replication For Write Heavy Medium Sized Database

We have a decent sized, write-heavy database that is about 426 GB (including indexes) and about 300 million rows . We currently collect location data from devices that report to our server every couple of minutes, and we serve about 10,000 devices - so lots of writes every second. The location table that stores the location of each device has about 223 million rows. The data is currently archived by year.
Problems occur when users run large reports on this database, the whole database grinds down almost to a stop.
I understand I need a reporting database, but my question is if anyone has experience of using SQL Server Transactional Replication on a database of equivalent size, and their experience of using this technology?
My rough plan is to point all the reports in our application to the Reporting Database, use Transactional Replication to replicate the data over from the master to the slave (Reporting Database).
Anyone have any thoughts on this strategy and the problems I may encounter?
Many thanks!

Transactional replication should work well in this scenario (the only effect the size of the database will have is the time taken to generate the initial snapshot). However, it may not solve your problem.
I think the issue you'll have if you choose transactional replication is that the slave server is going to be under the same load as the master machine as changes are applied - it will still crawl when users run large reports (assuming it's of a similar spec).
Depending on the acceptable latency of reporting data to the live data, this may or may not be OK for your users.
If some latency is acceptable you may get better performance from log shipping, since changes are applied in batches.
Before acquiring a reporting server, another approach would be to investigate the queries that your users are running and look at modifying either their code or the indexing strategy to better match what they're trying to do.

Transactional Replication could work well for you. The things to consider:
The target database tables must be read-only.
The server containing the target database should be stout enough to handle the SELECT traffic from the reporting applications.
Depending on the INSERT/UPDATE traffic, you may need to have a third server act as the Distribution server.
You also have to consider the size of the Distribution database.
Based on what I read here, I'd use a pull subscription from the Reporting server to offload traffic from the OLTP server.
You can skip the torment of a snapshot by initializing the reporting database from a backup of the OLTP database. See https://msdn.microsoft.com/en-us/library/ms151705.aspx
There will be INSERT/UPDATE/DELETE traffic from the Replication into both the Distribution and the Subscriber databases. That requires consideration, but lock/block issues should be no worse (and probably better) than running those reports off of OLTP.
I am running multiple publications on a 2.6TB database with 2.5GB/day of growth, using both pure transactional to drive reports (to two reporting servers) and Peer-to-Peer Transactional to replicate data in a scale-out for a SaaS offering (to three more servers). Because of this, we have a separate distributor.
Hope this helps.
Thanks
John.

Integration transport choice (Oracle + SQL Server)

We have several systems with Oracle (A) and SQL Server (B) databases on backend. I have to consolidate data from those systems into the new SQL Server database.
Something like that:
(A) =>|---------------|
| some software | => SQL Server
(B) =>|---------------|
where some software is:
transport (A and B systems located in the network)
processing business logic (custom .NET code)
Due to first point, I need some queue software or something similar (like MSMQ, Service Broker or something). In another hand, I can implement a web-service instead of queue.
(A) =>|---------------|-------------|
| queue/service | custom code | => SQL Server
(B) =>|---------------|-------------|
The question is: which queue/transport framework should I use with Oracle and SQL Server databases?
It would be nice, if I can post messages to MSMQ in both Oracle and SQL Server stored procedures (can I?)
It would be nice, if I can call a web-service in both Oracle and SQL Server stored procedures (can I?)
It would be nice, if I can use something similar in both Oracle and SQL Server stored procedures (what exactly?)
What software should I prefer to my requirements?
UPD: some techspec
This would be a regular sync process. Once a day I think.
Latency is not critical (>0.5-1 hour is ok).
Amount of data: 1-50 MB per sync from each system.
Encryption is required while transfer.

I would suggest creating an SSIS package that transfer the new data from the server A,B to the new server when invoked. You would launch the SSIS package on a schedule, say every 30 min, from the new server.
If both A and B would be SQL Server then Service Broker would make sense in order to provide a very low latency. But with one of them being Oracle, and with no real-time requirements, it looses its appeal. As a side note, you can see here an example of using Service Broker for High volume real time contiguous ETL.
Doing the transfer as an SSIS package makes for ease of maintenance (you can modify the package with relative ease), it does not require invasive changes in the existing system, is quite performant and there is a large tonne of SSIS know-how available online.
I would advice against using MSMQ for several reasons:
when transactional reliability is needed you'll have to involve all MSMQ related operations into distributed transaction (DTC between the MSSMQ dequeue and the SQL server insert/update on the new server) which will slow doen the processing throughput significantly
You'll need to come up with quite a few lines of code for the marshaling/unmarshaling and shredding the deltagram messages into the target system (I know codding is fun, but SSIS is simply better at this kind of jobs, and easier to maintain)
MSMQ limitation of 2GB per queue is quite small in real world (fills up quickly if your traffic increases and you have a maintenance downtime)
The real problem I'd be worried about is how to detect changes on A and B: when the SSIS job comes every 30 minutes, how does it know what data is new? Specially, how it detects deletes...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight