How can I efficiently update a database with external data sources? - sql-server

I'm trying to populate a table with user information in a MS SQL database with information from multiple data sources (i.e. LDAP and some other MS SQL databases). The process needs to run as a daily scheduled task to ensure that the user information table is updated frequently.
The initial attempt at this query/ update script was written in VBScript and would query each data source and then update the user information table. Unfortunately, this takes very long to run and update the user information table.
I'm curious if anyone has written anything similar and if you recommend or noticed a performance improvement by writing the script in another language. Some have recommended Perl because of multi-threading, but if anyone has any other suggestions on ways to improve the process or other approaches could you share tips or lessons learned.

It's good practise to use Data Transformation Services (DTS) or SSIS as it has become known for doing repetitive DB tasks. Although this won't solve your problem, it may give some pointers to what is going on as you can log each stage of the process, wrap it in transactions etc. It is especially well suited for bulk loading and updates, and it understands VBScript natively so there should be no problem there.
Other than that I have to agree with Brian, find out what's making it slow and fix that, changing languages is unlikely to fix it on its own, especially if you have an underlying issue. As a general point my experience when using LDAP, which is pretty small, was it could be incredibly slow reader bulk user details.

I can't tell you how to solve your particular problem, but whenever you run into this situation you want to find out why it is slow before you try to solve it. Where is the slow down? Some major things to consider and investigate include:
getting the data
interacting with the network
querying the database
updating indices in the database
Get some timing and profiling information to figure out where to concentrate your efforts.

Hmmm. Seems like you could cron a script that uses dump utils from the various sources, then seds the output into good form for the load util for the target database. The script could be in bash or Perl, whatever.
Edit: In terms of performance, I think the first thing you want to try is to make sure that you disable any autocommit at the beginning of the load process, then issue the commit after writing all the records. This can make a HUGE performance difference.

AS MrTelly said, use SSIS or DTS. Then schedule the package to run. Just converting to this alone will probaly fix your speed issue as they have tasks that are optimized for bulk inserting. I would never do this in a script language rather that t-SQl anyway. Likely your script works row by row instead of on sets of data but that is just a guess.

Related

Database Insert into script preparation for tests (logging data state?)

It is hard to formulate question in title so I'll try to explain.
I need to do automatic UI tests for application which is already written. These application has a big testing database loaded with a lot of data. Sometimes it is hard to understand relationships between tables as they are not trivial also there are foreign keys missing (logic is implemented in Java application, some logic is in stored procedures).
Problem is, that I can run my tests only once: after they are finished, some data is moved and some of data is deleted by application. So I need to prepare scripts and do Insert Into statements before every test.
Is it possible to make such a script preparing easier? Of course, good solution will be to investigate all db structure and dependencies (or review application logic in Java), but it would take a lot of time. I cannot save database data and look for the changes after tests are finished as it will take a lot of time for data export/import in SQL Developer. Maybe DB administrator have another Oracle DB tools for doing this?
I suggest you obtain and use a good testing framework. They're usually free. For Java the standard is JUnit. For PL/SQL, I like utPlsql but there are several others available. Using the testing framework you can insert the needed data into the database at the start of the test, run the test, and make sure the end results are as expected, then clean up the data so you can run the test again cleanly. This requires a fair amount of coding to make it all work right, but it's much easier to do this coding once than to perform the same tasks manually many times over.
Share and enjoy.

Writing to PostgreSQL database format without using PostgreSQL

I am collecting lots of data from lots of machines. These machines cannot run PostgreSQL and the cannot connect to a PostgreSQL database. At the moment I save the data from these machines in CSV files and use the COPY FROM command to import the data into the PostgreSQL database. Even on high-end hardware this process is taking hours. Therefore, I was thinking about writing the data to the format of PostgreSQL database directly. I would then simply copy these files into the /data directory, start the PostgreSQL server. The server would then find the database files and accept them as databases.
Is such a solution feasible?
Theoretically this might be possible if you studied the source code of PostgreSQL very closely.
But you essentially wind up (re)writing the core of PostgreSQL, which qualifies as "not feasible" from my point of view.
Edit:
You might want to have a look at pg_bulkload which claims to be faster than COPY (haven't used it though)
Why can't they connect to the database server? If it is because of library-dependencies, I suggest that you set up some sort of client-server solution (web services perhaps) that could queue and submit data along the way.
Relying on batch operations will always give you a headache when dealing with large amount of data, and if COPY FROM isn't fast enough for you, I don't think anything will be.
Yeah, you can't just write the files out in any reasonable way. In addition to the data page format, you'd need to replicate the commit logs, part of the write-ahead logs, some transaction visibility parts, any conversion code for types you use, and possibly the TOAST and varlena code. Oh, and the system catalog data, as already mentioned. Rough guess, you might get by with only needing to borrow 200K lines of code from the server. PostgreSQL is built from the ground up around being extensible; you can't even interpret what an integer means without looking up the type information around the integer type in the system catalog first.
There are some tips for speeding up the COPY process at Bulk Loading and Restores. Turning off synchronous_commit in particular may help. Another trick that may be useful: if you start a transaction, TRUNCATE a table, and then COPY into it, that COPY goes much faster. It doesn't bother with the usual write-ahead log protection. However, it's easy to discover COPY is actually bottlenecked on CPU performance, and there's nothing useful you can do about that. Some people split the incoming file into pieces and run multiple COPY operations at once to work around this.
Realistically, pg_bulkload is probably your best bet, unless it too gets CPU bound--at which point a splitter outside the database and multiple parallel loading is really what you need.

Feasible way to do automated performance testing on various database technologies?

A lot of guys on this site state that: "Optimizing something for performance is the root of all evil". My problem now is that I have a lot of complex SQL queries, many of them utilizing user created functions in PL/pgSQL or PL/python. My problem is that I do not have any performance profiling tool to show me, which functions actually make the queries slow. My current method is to exclude the various functions and take the time on the query for each one. I know that I could use explain analyze as well, but I do not think it will provide me with the information about user created functions.
My current method is quite tedious, especially since there is not query progress in PostgreSQL so I have sometimes have to wait for the query to run for 60 seconds, if I choose to run it on too much data.
Therefore, I am thinking whether it could be a good idea to create a tool, which will automatically do a performance profiling of SQL queries by modifying the SQL query and take the actual processing time on various versions of it. Each version would be a simplified one, which would maybe just contain a single user created function. I know that I am not describing how to do this clearly, and I can think of a lot of complicating factors, but I can also see that there are workarounds for many of these factors. I basically need your gut feeling on whether such a method is feasible.
Another similar idea is to run the query setting server settings work_mem to various values, and showing how this would impact the performance.
Such a tool could be written using JDBC so it could be modified to work across all major databases. In this case it might be a viable commercial product.
Apache JMeter can be used to load test and monitor the performance of SQL Queries (using JDBC). It will howerever not modify your SQL.
Actually I don't think any tool out there could simplify and then re-run your SQL. How should that "simplifying" work?

Need help designing big database update process

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

Most compatible way to listen for database changes?

I have a process that reads raw data and writes this to a database every few seconds.
What is the best way to tell if the database has been written to? I know that Oracle and MS-SQL can use triggers or something to communicate with other services, but I was hoping there would be a technique that would work with more types of SQL databases (SQL lite, MySQL, PostGRES).
Your question is lacking specifics needed for a good answer but I'll give it a try. Triggers are good for targeting tables but if you are interested in system-wide writes then you'll need a better method that is easier to maintain. For system-wide writes I'd investigate methods that detect changes in the transaction log. Unfortunately, each vendor implements this part differently, so one method that works for all vendors is not likely. That is, a method that works within the database server is unlikely. But there may be more elegant ways outside of the server at the OS level. For instance, if the transaction log is a file on disk then a simple script of some sort that detects changes in the file would indicate the DB was written to.
Keep in mind you have asked only to detect a db write. If you need to know what type of write it was then you'll need to get into the transaction log to see what is there. And that will definitely be vendor specific.
It depends on what you wish to do. If it is something external to the database that needs to be kicked off then a simple poll of the database would do the trick, otherwise a db specific trigger is probably best.
If you want to be database independant, polling can work. It's not very efficient or elegant. It also works if you are cursed to using a database that doesn't support triggers. A workaround that we've used in the past is to use a script that is timed (say via cron) to do a select MAX(primary_key_id) from saidTable. I am assuming that your primary key is an a sequential integer and is indexed.
And then compare that to the value you obtained the last time you ran the script. If they match, tell the script to exit or sleep. If not, do your thing.
There are other issues with this approach (ie: backlogs if the script takes too long, or concurrency issues, etc.). And of course performance can become an issue too!

Resources