How to continuously import data from external service into SQL Server

How to continuously import data from external service into SQL Server - sql-server

I have nearly a hundred services sending data to our message queue. This data is processed by Java service and loaded into import tables in our SQL Server. After data is loaded, a few procedures are executed that load this data into proper tables. Recently we had to add new instances of service reading and loading messages. It was suggested that we should change database isolation model to snapshot (I'm not very accustomed with databases so I simply did what was proposed). Unfortunately we had a lot of problems with it, so we had to duplicate import tables and aforementioned procedures - this of course resulted in a huge mess that I'm currently trying to clean up.
My current understanding is such that snapshot isolation was suggested so that services could work using the same table without problems and errors that we encountered stem from some misunderstanding or improper implementation on our (developers) side.
My question is: is it possible, and if yes then how, to bulk load data into single table, transform it and load into target table (everything in parallel, so lets say that there are 3 or 4 services doing it) in a way that causes no deadlocks or data loss.
Our SQL Server is: Microsoft SQL Server 2014 (SP2-GDR) (KB4019093) - 12.0.5207.0 (X64)
I don't know much more, but I know that for example we don't have support for partitioning or online index creation - maybe this will help somehow.

I ended modifying services loading data and import tables in such a way that each record loaded has its own identifier. Also services no longer execute import procedures, they are scheduled using SQL Agent and run once per minute. Solution is really simple and while on average data is stored in destination tables 30 seconds after they are received by the services, this is something that we can live with - we can load much, much, MUCH more data that way.

Related

SQL Server copy/replicate data from one table to another

I have 2 servers. I need to copy some columns from 4 different tables from server 1 into the corresponding (empty) tables in server 2.
So basically, it's about replicating data from one table to another. How is this done best (and easiest)? Also, how do I make sure that the copied/replicated data is updated at the same frequency as the source (which runs completely fine and automatically)?
I want to avoid using Linked Server.

How is this done best (and easiest)?
For a one time replication consider a SQL Server Import and Export Wizard. This approach can also be scheduled by saving a final package and schedule it by SQL Agent
Example: Simple way to import data into SQL Server
For a continuous, low latency data syncronization - SQL Server Transactional Replication.
Further read: Tutorial: Configure replication between two fully connected servers (transactional)
Worth to mention, that transactional replication is not the easiest topic, however, it fits quite good to a requirement.

Detect Table Changes In A Database Without Modifications

I have a database ("DatabaseA") that I cannot modify in any way, but I need to detect the addition of rows to a table in it and then add a log record to a table in a separate database ("DatabaseB") along with some info about the user who added the row to DatabaseA. (So it needs to be event-driven, not merely a periodic scan of the DatabaseA table.)
I know that normally, I could add a trigger to DatabaseA and run, say, a stored procedure to add log records to the DatabaseB table. But how can I do this without modifying DatabaseA?
I have free-reign to do whatever I like in DatabaseB.
EDIT in response to questions/comments ...
Databases A and B are MS SQL 2008/R2 databases (as tagged), users are interacting with the DB via a proprietary Windows desktop application (not my own) and each user has a SQL login associated with their application session.
Any ideas?

Ok, so I have not put together a proof of concept, but this might work.
You can configure an extended events session on databaseB that watches for all the procedures on databaseA that can insert into the table or any sql statements that run against the table on databaseA (using a LIKE '%your table name here%').
This is a custom solution that writes the XE session to a table:
https://github.com/spaghettidba/XESmartTarget
You could probably mimic functionality by writing the XE events table to a custom user table every 1 minute or so using the SQL job agent.
Your session would monitor databaseA, write the XE output to databaseB, you write a trigger that upon each XE output write, it would compare the two tables and if there are differences, write the differences to your log table. This would be a nonstop running process, but it is still kind of a period scan in a way. The XE only writes when the event happens, but it is still running a check every couple of seconds.

I recommend you look at a data integration tool that can mine the transaction log for Change Data Capture events. We are recently using StreamSets Data Collector for Oracle CDC but it also has SQL Server CDC. There are many other competing technologies including Oracle GoldenGate and Informatica PowerExchange (not PowerCenter). We like StreamSets because it is open source and is designed to build realtime data pipelines between DB at the schema level. Till now we have used batch ETL tools like Informatica PowerCenter and Pentaho Data Integration. I can near real-time copy all the tables in a schema in one StreamSets pipeline provided I already deployed DDL in the target. I use this approach between Oracle and Vertica. You can add additional columns to the target and populate them as part of the pipeline.
The only catch might be identifying which user made the change. I don't know whether that is in the SQL Server transaction log. Seems probable but I am not a SQL Server DBA.

I looked at both solutions provided by the time of writing this answer (refer Dan Flippo and dfundaka) but found that the first - using Change Data Capture - required modification to the database and the second - using Extended Events - wasn't really a complete answer, though it got me thinking of other options.
And the option that seems cleanest, and doesn't require any database modification - is to use SQL Server Dynamic Management Views. Within this library residing, in the System database, are various procedures to view server process history - in this case INSERTs and UPDATEs - such as sys.dm_exec_sql_text and sys.dm_exec_query_stats which contain records of database transactions (and are, in fact, what Extended Events seems to be based on).
Though it's quite an involved process initially to extract the required information, the queries can be tuned and generalized to a degree.
There are restrictions on transaction history retention, etc but for the purposes of this particular exercise, this wasn't an issue.
I'm not going to select this answer as the correct one yet partly because it's a matter of preference as to how you approach the problem and also because I'm yet to provide a complete solution. Hopefully, I'll post back with that later. But if anyone cares to comment on this approach - good or bad - I'd be interested in your views.

SSIS 2014 ADO.NET connection to Oracle slow for 1 table (the rest is fine)

We're in the middle of doing a new data warehouse roll-out using SQL Server 2014. One of my data sources is Oracle, and unfortunately the recommended Attunity component for quick data access is not available for SSIS 2014 just yet.
I've tried avoiding using OLEDB, as that requires installation of specific Oracle client tools that have caused me a lot of frustration before, and with the Attunity stuff supposedly being in the works (MS promised they'd arrive in August already), I'm reluctant to go through the ordeal again.
Therefore, I'm using ADO.NET. All things considered, performance is acceptable for the time being, with the exception of 1 particular table.
This particular table in Oracle has a bunch of varchar columns, and I've come to the conclusion that it's because of the width of the selected row that this table performs particularly slow. To prove that, rather than selecting all columns as they exist in Oracle (which is the original package I created), I truncated all widths to the maximum length of the values actually stored (CAST(column AS varchar(46)). This reduced the time to run the same package to 17 minutes (still way below what I'd call acceptable, and it's not something I'd put in production because it'll open up a world of future pain, but it proves the width of the columns are definitely a factor).
I increased the network packet size in SQL Server, but that did not seem to help much. I have not managed to figure out a good way to alter the packet size on the ADO.NET connector for Oracle (SQL Server does have that option). I attempted to see if adding Packet size=32000;to the connection string for the Oracle connector, but that just threw an error, indicating it simply won't be accepted. The same applies to FetchSize.
Eventually, I came up with a compromise where I split the load into three different parts, dividing the varchar columns between these parts, and using two MERGE JOIN objects to well, merge the data back into a single combined dataset. Running that and doing some extrapolation leads me to think that method would have taken roughly 30 minutes to complete (but without the potential of data loss using the CAST solution from above). However, that's still not acceptable.
I'm currently in the process of trying some other options (not using MERGE JOIN but dumping into three different tables, and then merging those on the SQL Server itself, and splitting the package up into even more different loads in an attempt to further speed up the individual parts of the load), but surely there must be something easier.
Does anyone have experience with how to load data from Oracle through ADO.NET, where wide rows would cause delays? If so, are there any particular guidelines I should be aware of, or any additional tricks you might have come across that could help me reduce load time while the Attunity component is unavailable?
Thanks!

The updated Attunity drivers have just been released by Microsoft:
Hi all, I am pleased to inform you that the Oracle and TeraData connector V3.0 for SQL14
SSIS is now available for download!!!!!
Microsoft SSIS Connectors by Attunity Version 3.0 is a minor release.
It supports SQL Server 2014 Integration Services and includes bug
fixes and support for updated Oracle and Teradata product releases.
For details, please look at the download page.
http://www.microsoft.com/en-us/download/details.aspx?id=44582
Source: https://connect.microsoft.com/SQLServer/feedbackdetail/view/917247/when-will-attunity-ssis-connector-support-sql-server-2014

Client-side Replication for SQL Server?

I'd like to have some degree of fault tolerance / redundancy with my SQL Server Express database. I know that if I upgrade to a pricier version of SQL Server, I can get "Replication" built in. But I'm wondering if anyone has experience in managing replication on the client side. As in, from my application:
Every time I need to create, update or delete records from the database -- issue the statement to all n servers directly from the client side
Every time I need to read, I can do so from one representative server (other schemes seem possible here, too).
It seems like this logic could potentially be added directly to my Linq-To-SQL Data Context.
Any thoughts?

Every time I need to create, update or
delete records from the database --
issue the statement to all n servers
directly from the client side
Recipe for disaster.
Are you going to have a distributed transaction or just let some of the servers fail? If you have a distributed transaction, what do you do if a server goes offline for a while.
This type of thing can only work if you do it at a server-side data-portal layer where application servers take in your requests and are aware of your database farm. At that point, you're better off just using a higher grade of SQL Server.

I have managed replication from an in-house client. My database model worked on an insert-only mode for all transactions, and insert-update for lookup data. Deletes were not allowed.
I had a central table that everything was related to. I added a field to this table for a date-time stamp which defaulted to NULL. I took data from this table and all related tables into a staging area, did BCP out, cleaned up staging tables on the receiver side, did a BCP IN to staging tables, performed data validation and then inserted the data.
For some basic Fault Tolerance, you can scheduling a regular backup.

Copying data from a local database to a remote one

I'm writing a system at the moment that needs to copy data from a clients locally hosted SQL database to a hosted server database. Most of the data in the local database is copied to the live one, though optimisations are made to reduce the amount of actual data required to be sent.
What is the best way of sending this data from one database to the other? At the moment I can see a few possibly options, none of them yet stand out as being the prime candidate.
Replication, though this is not ideal, and we cannot expect it to be supported in the version of SQL we use on the hosted environment.
Linked server, copying data direct - a slow and somewhat insecure method
Webservices to transmit the data
Exporting the data we require as XML and transferring to the server to be imported in bulk.
The data copied goes into copies of the tables, without identity fields, so data can be inserted/updated without any violations in that respect. This data transfer does not have to be done at the database level, it can be done from .net or other facilities.
More information
The frequency of the updates will vary completely on how often records are updated. But the basic idea is that if a record is changed then the user can publish it to the live database. Alternatively we'll record the changes and send them across in a batch on a configurable frequency.
The amount of records we're talking are around 4000 rows per table for the core tables (product catalog) at the moment, but this is completely variable dependent on the client we deploy this to as each would have their own product catalog, ranging from 100's to 1000's of products. To clarify, each client is on a separate local/hosted database combination, they are not combined into one system.
As well as the individual publishing of items, we would also require a complete re-sync of data to be done on demand.
Another aspect of the system is that some of the data being copied from the local server is stored in a secondary database, so we're effectively merging the data from two databases into the one live database.

Well, I'm biased. I have to admit. I'd like to hypnotize you into shelling out for SQL Compare to do this. I've been faced with exactly this sort of problem in all its open-ended frightfulness. I got a copy of SQL Compare and never looked back. SQL Compare is actually a silly name for a piece of software that synchronizes databases It will also do it from the command line once you have got a working project together with all the right knobs and buttons. Of course, you can only do this for reasonably small databases, but it really is a tool I wouldn't want to be seen in public without.
My only concern with your requirements is where you are collecting product catalogs from a number of clients. If they are all in separate tables, then all is fine, whereas if they are all in the same table, then this would make things more complicated.

How much data are you talking about? how many 'client' dbs are there? and how often does it need to happen? The answers to those questions will make a big difference on the path you should take.

There is an almost infinite number of solutions for this problem. In order to narrow it down, you'd have to tell us a bit about your requirements and priorities.
Bulk operations would probably cover a wide range of scenarios, and you should add that to the top of your list.

I would recommend using Data Transformation Services (DTS) for this. You could create a DTS package for appending and one for re-creating the data.
It is possible to invoke DTS package operations from your code so you may want to create a wrapper to control the packages that you can call from your application.

In the end I opted for a set of triggers to capture data modifications to a change log table. There is then an application that polls this table and generates XML files for submission to a webservice running at the remote location.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight