I am confused as to what problems ssis packages solve. I need to create an application to copy content from our local network to our live servers across a dedicated line, that may be unreliable. From our live server the content needs to be replicated across all other servers. The database also needs to be updated with all the files that arrived successfully so it may be available to the user.
I was told that ssis can do this but my question is, is this the right thing to us? SSIS is for data transformation, not for copying files from one network to the other. Can ssis really do this?
My rule of thumb is: if no transformation, no aggregation, no data mapping and no disparate sources then no SSIS.
You may want to explore Transactional Replication:
http://technet.microsoft.com/en-us/library/ms151176.aspx
and if you are on SQL Server 2012 you can also take a look at Availability Groups: http://technet.microsoft.com/en-us/library/ff877884.aspx
I would use SSIS for this scenario. It has built in restart functionality ("checkpoints") which I would use to manage partial retries when your line fails. It is also easy to configure the Control Flow so tasks can run in parallel, e.g. Site 2 isn't left waiting for data if Site 1 is slow.
Related
I'm looking for a little advice.
I have some SQL Server tables I need to move to local Access databases for some local production tasks - once per "job" setup, w/400 jobs this qtr, across a dozen users...
A little background:
I am currently using a DSN-less approach to avoid distribution issues
I can create temporary LINKS to the remote tables and run "make table" queries to populate the local tables, then drop the remote tables. Works as expected.
Performance here in US is decent - 10-15 seconds for ~40K records. Our India teams are seeing >5-10 minutes for the same datasets. Their internet connection is decent, not great and a variable I cannot control.
I am wondering if MS Access is adding some overhead here than can be avoided by a more direct approach: i.e., letting the server do all/most of the heavy lifting vs Access?
I've tinkered with various combinations, with no clear improvement or success:
Parameterized stored procedures from Access
SQL Passthru queries from Access
ADO vs DAO
Any suggestions, or an overall approach to suggest? How about moving data as XML?
Note: I have Access 7, 10, 13 users.
Thanks!
It's not entirely clear but if the MSAccess database performing the dump is local and the SQL Server database is remote, across the internet, you are bound to bump into the physical limitations of the connection.
ODBC drivers are not meant to be used for data access beyond a LAN, there is too much latency.
When Access queries data, is doesn't open a stream, it fetches blocks of it, wait for the data wot be downloaded, then request another batch. This is OK on a LAN but quickly degrades over long distances, especially when you consider that communication between the US and India has probably around 200ms latency and you can't do much about it as it adds up very quickly if the communication protocol is chatty, all this on top of the connection's bandwidth that is very likely way below what you would get on a LAN.
The better solution would be to perform the dump locally and then transmit the resulting Access file after it has been compacted and maybe zipped (using 7z for instance for better compression). This would most likely result in very small files that would be easy to move around in a few seconds.
The process could easily be automated. The easiest is maybe to automatically perform this dump every day and making it available on an FTP server or an internal website ready for download.
You can also make it available on demand, maybe trough an app running on a server and made available through RemoteApp using RDP services on a Windows 2008 server or simply though a website, or a shell.
You could also have a simple windows service on your SQL Server that listens to requests for a remote client installed on the local machines everywhere, that would process the dump and sent it to the client which would then unpack it and replace the previously downloaded database.
Plenty of solutions for this, even though they would probably require some amount of work to automate reliably.
One final note: if you automate the data dump from SQL Server to Access, avoid using Access in an automated way. It's hard to debug and quite easy to break. Use an export tool instead that doesn't rely on having Access installed.
Renaud and all, thanks for taking time to provide your responses. As you note, performance across the internet is the bottleneck. The fetching of blocks (vs a continguous DL) of data is exactly what I was hoping to avoid via an alternate approach.
Or workflow is evolving to better leverage both sides of the clock where User1 in US completes their day's efforts in the local DB and then sends JUST their updates back to the server (based on timestamps). User2 in India, also has a local copy of the same DB, grabs just the updated records off the server at the start of his day. So, pretty efficient for day-to-day stuff.
The primary issue is the initial DL of the local DB tables from the server (huge multi-year DB) for the current "job" - should happen just once at the start of the effort (~1 wk long process) This is the piece that takes 5-10 minutes for India to accomplish.
We currently do move the DB back and forth via FTP - DAILY. It is used as a SINGLE shared DB and is a bit LARGE due to temp tables. I was hoping my new timestamped-based push-pull of just the changes daily would have been an overall plus. Seems to be, but the initial DL hurdle remains.
I am build an application which needs to consume data from a source database. The source database has several issues including:
Performance issues
Legacy structure with terrible keys, naming conventions, etc.
Lots of data my application doesn’t care about
I would like to setup an application specific SQL Server database. The new database will be populated with a subset of data from the source database (and from a few other source systems). The data will always move one way from the source databases to the application specific database (i.e. - data won't sync back to the source). It will have a different DDL model than the source database.
The data doesn't need to be synced absolutely real time, but any longer than a few minute lag could cause issues.
How should I move data from the source database into the application database? Should I use
Replication
Write Custom SSIS Packages
Abstact to higher level SOA
solution like nServiceBus, AppFabric, etc?
Some other ideas?
Pros/cons to each?
Sounds to me like you don't need a messaging service like NServiceBus - this would involve modifying the legacy system to publish events whenever data changes, something I expect you don't want to get into. Because it is acceptable in your case for your local store of data to be slightly out of date, an SSIS package could be acceptable.
However, if the source database is very large, this could be an issue, as you will be doing it every few minutes. Also, if users of the legacy system are already experiencing performance problems, an SSIS package running every few minutes won't help. Maybe you could introduce a timestamp of the source data, so that it only copies new/modified data?
If the source data is very large and performance is seriously an issue, then maybe NServiceBus would be a good idea. You could also consider Mass Transit or your own simple solution built on MSMQ. But this will mean getting you hands dirty with the legacy code.
I have few questions about the memory usage in ssis package.
If i am loading data from server A to Server B and the ssis package is in my desktop System and running through BIDS,Whether the buffer creation(memory usage) will happen in my desktop system?If this is the case,the performance(low memory compare to servers) will be slow right?
How to enable the usage of server resources while developing package in my desktop system?
Please help me, if i have 3 ssis developer and all are developing different packages at a time,What is the best development method?
To expand on #3, the best way I have found to allow teams to work on a single SSIS solution is to decompose a problem (package) down into smaller and smaller chunks and control their invocation through a parent-child/master-slave type relationship.
For example, the solution concerns loading the data warehouse. I'd maybe have 2 Controller packages, FactController.dtsx and DimensionController.dtsx. Their responsibility is to call the various packages that solve the need (loading facts or dimensions). Perhaps my DimensionProductLoader package is dealing with a snowflake (it needs to update the Product and the SubProduct table) so that gets decomposed into 2 packages.
The goal of all of this is to break the development process down into manageable chunks to avoid concurrent access to a single package. Merging the XML will not be a productive use of your time.
The only shared resource for all of this is the SSIS project file (dtproj) which is just an XML document enumerating the packages that compromise the project. Create an upfront skeleton project with well-named, blank packages and you can probably skip some of the initial pain surrounding folks trying to merge the project back into your repository. I find that one-off type of merges go much better, for TFS at least, than everyone checking their XML globs back in.
Yes. A package runs on the same computer as the program that launches it. Even when a program loads a package that is stored remotely on another server, the package runs on the local computer.
If by server resources you mean server CPU, you cant. Is like using resources of any other computer on the network. Of course, if you have an OleDBSource that runs a select on SQL server, the CPU that "runs" the select will be the one on SQL Server, obviously, but once the resultset is retrieved, it is handled by the computer where the package is running.
Like any other development method. If you have a class on a C# project being developed by 3 developer, how do you do it? You can have each developer working on the same file and merge the changes, after all a package is a xml file, but is more complicated. I wouldn't recommend. I've been on situations where more than one developer worked on the same package but not at the exact same time.
Expanding on Diego's and Bill's Answers:
1) Diego has this mostly correct, I would just add: The package runs on the computer that runs it, but even worse, running a package through BIDS is not even close to what you will see on a server since the process BIDS uses to run the package is a 32bit process running local. You will be slower due to limits related to running in the 32bit subsystem, as well as copying all of your data for the buffer across the network to the buffer in memory on your workstation, transforming it as your package flows, and then pushing it again across the network to your destination server. This is fine for testing small subsets of your data in a test environment, but should not be used to estimate the performance on a server system.
2) Diego has this correct. If you want to see server performance, deploy it to a test server and run it there.
3) billinkc has this correct. One of the big drawbacks to SSIS in TFS is that there is not an elegant way to share work on a single package. If you want to use more than one developer in a single process break it into smaller chunks and let only one developer work on each piece. As long as they are not developing the same package at the same time, you should be fine.
I am new to SSIS, and am trying to use its Parallelism Feature to import data from a database.
My job is to do this: Import a multi terabyte database into a set of flat files as quickly as possible.
I was thinking of this:
I have a Microsoft Server 2008 HPC Cluster (of 3 nodes) at my disposal. I was thinking of writing a HPC SOA job so that all the three compute nodes can make independent connections to the SQL Server and import a portion of the data in parallel. Ofcourse this would have nothing to do with SSIS and be an independent utility.
Then I came across SSIS, and its parallel import features. MY SSIS Server is not very high end - only a 4GB Machine. I am somehow inclined to use SSIS because that's the ideal Microsoft way of doing data import - and I won't have to rewrite a lot of stuff and possibly use existing transformations etc.
What is the best way to use Custom Tasks (or available ones) and do this import in parallel?
Gitmo, I may misunderstand your question but will give it a shot. You need to move data from a SQL Server instance to multiple files, correct? You want to leverage the parallelised data movement functionality provided by SSIS. That means multiple simultaneously running Data Flow Tasks (DFTs). For each target file you can have only one DFT because of problems with concurrent writes.
To get multiple simultaneously running Data Flow Tasks where your source is a SQL Server database and your target is a set of files, you can possibly try the following ways (please note there are upper limits on the parallelization you can get out of SSIS based upon many factors including your CPU Core count, whether you are running in BIDS/Visual Studio or not, and various settings in your packages, your server(s), your SQL Server instance, and many other considerations):
The Multiple Simultaneous DFT Solution: A single SSIS Package with one Connection Manager pointed to the source SQL Server database and many Connection Managers each pointed to a separate target file, plus one DFT for each target file. The DFTs are all disconnected from one another (no precedence constraints or green/red/blue lines/arrows). If there are pre or post ETL steps that need to run a great way to parallelize these DFTs is to drop them all in a Sequence Container that is connected to the earlier and later tasks through precedence constraints/arrows. These disconnected DFTs in their own Sequence Container will try to all run simultaneously.
The Multiple Simultaneous DTEXEC Solution: Multiple SSIS packages each with their own target file-specific DFT. You manually run separate DTEXEC processes either through separate CMD windows or through the GUI. #3 below is a variation on this solution and possibly a better one.
The Parent Master Package Running Multiple Children Packages Solution: Wrap the per target file packages developed in #2 above in a single Parent Master package. In the Parent package have multiple simultaneously running Execute Package Tasks. Again these Execute Package Tasks would be disconnected from other tasks. A good way to do this is to drop the multiple Execute Package Tasks in their own Sequence Container. As before if the Execute Package Tasks are disconnected (no precedence constraints/arrows) they will all try to run simultaneously.
Take a look at this excellent article from the Microsoft SQLCAT Team for some more ideas/insight: Top 10 SQL Server Integration Services Best Practices
There are likely variations on these same ideas and possibly other solutions available both inside and outside of SSIS. Good luck!
please look this post ..... using multi threading out side ssis and acheiveing parallelism Multithreaded serial execution
with out modifying much of package
http://sqljunkieshare.com/2011/12/21/parallelism-in-etl-process-ssis-2008-and-ssis-2012/
I'm writing a system at the moment that needs to copy data from a clients locally hosted SQL database to a hosted server database. Most of the data in the local database is copied to the live one, though optimisations are made to reduce the amount of actual data required to be sent.
What is the best way of sending this data from one database to the other? At the moment I can see a few possibly options, none of them yet stand out as being the prime candidate.
Replication, though this is not ideal, and we cannot expect it to be supported in the version of SQL we use on the hosted environment.
Linked server, copying data direct - a slow and somewhat insecure method
Webservices to transmit the data
Exporting the data we require as XML and transferring to the server to be imported in bulk.
The data copied goes into copies of the tables, without identity fields, so data can be inserted/updated without any violations in that respect. This data transfer does not have to be done at the database level, it can be done from .net or other facilities.
More information
The frequency of the updates will vary completely on how often records are updated. But the basic idea is that if a record is changed then the user can publish it to the live database. Alternatively we'll record the changes and send them across in a batch on a configurable frequency.
The amount of records we're talking are around 4000 rows per table for the core tables (product catalog) at the moment, but this is completely variable dependent on the client we deploy this to as each would have their own product catalog, ranging from 100's to 1000's of products. To clarify, each client is on a separate local/hosted database combination, they are not combined into one system.
As well as the individual publishing of items, we would also require a complete re-sync of data to be done on demand.
Another aspect of the system is that some of the data being copied from the local server is stored in a secondary database, so we're effectively merging the data from two databases into the one live database.
Well, I'm biased. I have to admit. I'd like to hypnotize you into shelling out for SQL Compare to do this. I've been faced with exactly this sort of problem in all its open-ended frightfulness. I got a copy of SQL Compare and never looked back. SQL Compare is actually a silly name for a piece of software that synchronizes databases It will also do it from the command line once you have got a working project together with all the right knobs and buttons. Of course, you can only do this for reasonably small databases, but it really is a tool I wouldn't want to be seen in public without.
My only concern with your requirements is where you are collecting product catalogs from a number of clients. If they are all in separate tables, then all is fine, whereas if they are all in the same table, then this would make things more complicated.
How much data are you talking about? how many 'client' dbs are there? and how often does it need to happen? The answers to those questions will make a big difference on the path you should take.
There is an almost infinite number of solutions for this problem. In order to narrow it down, you'd have to tell us a bit about your requirements and priorities.
Bulk operations would probably cover a wide range of scenarios, and you should add that to the top of your list.
I would recommend using Data Transformation Services (DTS) for this. You could create a DTS package for appending and one for re-creating the data.
It is possible to invoke DTS package operations from your code so you may want to create a wrapper to control the packages that you can call from your application.
In the end I opted for a set of triggers to capture data modifications to a change log table. There is then an application that polls this table and generates XML files for submission to a webservice running at the remote location.