Which systems memory will be used by ssis package? - sql-server

I have few questions about the memory usage in ssis package.
If i am loading data from server A to Server B and the ssis package is in my desktop System and running through BIDS,Whether the buffer creation(memory usage) will happen in my desktop system?If this is the case,the performance(low memory compare to servers) will be slow right?
How to enable the usage of server resources while developing package in my desktop system?
Please help me, if i have 3 ssis developer and all are developing different packages at a time,What is the best development method?

To expand on #3, the best way I have found to allow teams to work on a single SSIS solution is to decompose a problem (package) down into smaller and smaller chunks and control their invocation through a parent-child/master-slave type relationship.
For example, the solution concerns loading the data warehouse. I'd maybe have 2 Controller packages, FactController.dtsx and DimensionController.dtsx. Their responsibility is to call the various packages that solve the need (loading facts or dimensions). Perhaps my DimensionProductLoader package is dealing with a snowflake (it needs to update the Product and the SubProduct table) so that gets decomposed into 2 packages.
The goal of all of this is to break the development process down into manageable chunks to avoid concurrent access to a single package. Merging the XML will not be a productive use of your time.
The only shared resource for all of this is the SSIS project file (dtproj) which is just an XML document enumerating the packages that compromise the project. Create an upfront skeleton project with well-named, blank packages and you can probably skip some of the initial pain surrounding folks trying to merge the project back into your repository. I find that one-off type of merges go much better, for TFS at least, than everyone checking their XML globs back in.

Yes. A package runs on the same computer as the program that launches it. Even when a program loads a package that is stored remotely on another server, the package runs on the local computer.
If by server resources you mean server CPU, you cant. Is like using resources of any other computer on the network. Of course, if you have an OleDBSource that runs a select on SQL server, the CPU that "runs" the select will be the one on SQL Server, obviously, but once the resultset is retrieved, it is handled by the computer where the package is running.
Like any other development method. If you have a class on a C# project being developed by 3 developer, how do you do it? You can have each developer working on the same file and merge the changes, after all a package is a xml file, but is more complicated. I wouldn't recommend. I've been on situations where more than one developer worked on the same package but not at the exact same time.

Expanding on Diego's and Bill's Answers:
1) Diego has this mostly correct, I would just add: The package runs on the computer that runs it, but even worse, running a package through BIDS is not even close to what you will see on a server since the process BIDS uses to run the package is a 32bit process running local. You will be slower due to limits related to running in the 32bit subsystem, as well as copying all of your data for the buffer across the network to the buffer in memory on your workstation, transforming it as your package flows, and then pushing it again across the network to your destination server. This is fine for testing small subsets of your data in a test environment, but should not be used to estimate the performance on a server system.
2) Diego has this correct. If you want to see server performance, deploy it to a test server and run it there.
3) billinkc has this correct. One of the big drawbacks to SSIS in TFS is that there is not an elegant way to share work on a single package. If you want to use more than one developer in a single process break it into smaller chunks and let only one developer work on each piece. As long as they are not developing the same package at the same time, you should be fine.

Related

Can SSIS Packages copy large amount of data between servers

I am confused as to what problems ssis packages solve. I need to create an application to copy content from our local network to our live servers across a dedicated line, that may be unreliable. From our live server the content needs to be replicated across all other servers. The database also needs to be updated with all the files that arrived successfully so it may be available to the user.
I was told that ssis can do this but my question is, is this the right thing to us? SSIS is for data transformation, not for copying files from one network to the other. Can ssis really do this?
My rule of thumb is: if no transformation, no aggregation, no data mapping and no disparate sources then no SSIS.
You may want to explore Transactional Replication:
http://technet.microsoft.com/en-us/library/ms151176.aspx
and if you are on SQL Server 2012 you can also take a look at Availability Groups: http://technet.microsoft.com/en-us/library/ff877884.aspx
I would use SSIS for this scenario. It has built in restart functionality ("checkpoints") which I would use to manage partial retries when your line fails. It is also easy to configure the Control Flow so tasks can run in parallel, e.g. Site 2 isn't left waiting for data if Site 1 is slow.

SSIS Parallelism - Microsoft HPC Cluster?

I am new to SSIS, and am trying to use its Parallelism Feature to import data from a database.
My job is to do this: Import a multi terabyte database into a set of flat files as quickly as possible.
I was thinking of this:
I have a Microsoft Server 2008 HPC Cluster (of 3 nodes) at my disposal. I was thinking of writing a HPC SOA job so that all the three compute nodes can make independent connections to the SQL Server and import a portion of the data in parallel. Ofcourse this would have nothing to do with SSIS and be an independent utility.
Then I came across SSIS, and its parallel import features. MY SSIS Server is not very high end - only a 4GB Machine. I am somehow inclined to use SSIS because that's the ideal Microsoft way of doing data import - and I won't have to rewrite a lot of stuff and possibly use existing transformations etc.
What is the best way to use Custom Tasks (or available ones) and do this import in parallel?
Gitmo, I may misunderstand your question but will give it a shot. You need to move data from a SQL Server instance to multiple files, correct? You want to leverage the parallelised data movement functionality provided by SSIS. That means multiple simultaneously running Data Flow Tasks (DFTs). For each target file you can have only one DFT because of problems with concurrent writes.
To get multiple simultaneously running Data Flow Tasks where your source is a SQL Server database and your target is a set of files, you can possibly try the following ways (please note there are upper limits on the parallelization you can get out of SSIS based upon many factors including your CPU Core count, whether you are running in BIDS/Visual Studio or not, and various settings in your packages, your server(s), your SQL Server instance, and many other considerations):
The Multiple Simultaneous DFT Solution: A single SSIS Package with one Connection Manager pointed to the source SQL Server database and many Connection Managers each pointed to a separate target file, plus one DFT for each target file. The DFTs are all disconnected from one another (no precedence constraints or green/red/blue lines/arrows). If there are pre or post ETL steps that need to run a great way to parallelize these DFTs is to drop them all in a Sequence Container that is connected to the earlier and later tasks through precedence constraints/arrows. These disconnected DFTs in their own Sequence Container will try to all run simultaneously.
The Multiple Simultaneous DTEXEC Solution: Multiple SSIS packages each with their own target file-specific DFT. You manually run separate DTEXEC processes either through separate CMD windows or through the GUI. #3 below is a variation on this solution and possibly a better one.
The Parent Master Package Running Multiple Children Packages Solution: Wrap the per target file packages developed in #2 above in a single Parent Master package. In the Parent package have multiple simultaneously running Execute Package Tasks. Again these Execute Package Tasks would be disconnected from other tasks. A good way to do this is to drop the multiple Execute Package Tasks in their own Sequence Container. As before if the Execute Package Tasks are disconnected (no precedence constraints/arrows) they will all try to run simultaneously.
Take a look at this excellent article from the Microsoft SQLCAT Team for some more ideas/insight: Top 10 SQL Server Integration Services Best Practices
There are likely variations on these same ideas and possibly other solutions available both inside and outside of SSIS. Good luck!
please look this post ..... using multi threading out side ssis and acheiveing parallelism Multithreaded serial execution
with out modifying much of package
http://sqljunkieshare.com/2011/12/21/parallelism-in-etl-process-ssis-2008-and-ssis-2012/

How do you manage databases during development?

My development team of four people has been facing this issue for some time now:
Sometimes we need to be working off the same set of data. So while we develop on our local computers, the dev database is connected to remotely.
However, sometimes we need to run operations on the db that will step on other developers' data, ie we break associations. For this a local db would be nice.
Is there a best practice for getting around this dilemma? Is there something like an "SCM for data" tool?
In a weird way, keeping a text file of SQL insert/delete/update queries in the git repo would be useful, but I think this could get very slow very quickly.
How do you guys deal with this?
You may find my question How Do You Build Your Database From Source Control useful.
Fundamentally, effective management of shared resources (like a database) is hard. It's hard because it requires balancing the needs of multiple people, including other developers, testers, project managers, etc.
Often, it's more effective to give individual developers their own sandboxed environment in which they can perform development and unit testing without affecting other developers or testers. This isn't a panacea though, because you now have to provide a mechanism to keep these multiple separate environments in sync with one another over time. You need to make sure that developers have a reasonable way of picking up each other changes (both data, schema, and code). This isn't necesarily easier. A good SCM practice can help, but it still requires a considerable level of cooperation and coordination to pull it off. Not only that, but providing each developer with their own copy of an entire environment can introduce costs for storage, and additional DBA resource to assist in the management and oversight of those environments.
Here are some ideas for you to consider:
Create a shared, public "environment whiteboard" (it could be electronic) where developers can easily see which environments are available and who is using them.
Identify an individual or group to own database resources. They are responsible for keeping track of environments, and helping resolve the conflicting needs of different groups (developers, testers, etc).
If time and budgets allow, consider creating sandbox environments for all of your developers.
If you don't already do so, consider separating developer "play areas", from your integration, testing, and acceptance testing environments.
Make sure you version control critical database objects - particularly those that change often like triggers, stored procedures, and views. You don't want to lose work if someone overwrites someone else's changes.
We use local developer databases and a single, master database for integration testing. We store creation scripts in SCM. One developer is responsible for updating the SQL scripts based on the "golden master" schema. A developer can make changes as necessary to their local database, populating as necessary from the data in the integration DB, using an import process, or generating data using a tool (Red Gate Data Generator, in our case). If necessary, developers wipe out their local copy and can refresh from the creation script and integration data as needed. Typically databases are only used for integration testing and we mock them out for unit tests so the amount of work keeping things synchronized is minimized.
I recommend that you take a look at Scott AllenĀ“s views on this matter. He wrote a series of blogs which are, in my opinion, excellent.
Three Rules for Database Work,
The Baseline,
Change scripts,
Views, stored procs etc,
Branching and Merging.
I use these guidelines more or less, with personal changes and they work.
In the past, I've dealt with this several ways.
One is the SQL Script repository that creates and populates the database. It's not a bad option at all and can keep everything in sync (even if you're not using this method, you should still maintain these scripts so that your DB is in Source Control).
The other (which I prefer) was having a single instance of a "clean" dev database on the server that nobody connected to. When developers needed to refresh their dev databases, they ran a SSIS package that copied the "clean" database onto their dev copy. We could then modify our dev databases as needed without stepping on the feet of other developers.
We have a database maintenance tool that we use that creates/updates our tables and our procs. we have a server that has an up-to-date database populated with data.
we keep local databases that we can play with as we choose, but when we need to go back to "baseline" we get a backup of the "master" from the server and restore it locally.
if/when we add columns/tables/procs we update the dbMaintenance tool which is kept in source control.
sometimes, its a pain, but it works reasonably well.
If you use an ORM such as nHibernate, create a script that generate both the schema & the data in the LOCAL development database of your developers.
Improve that script during the development to include typical data.
Test on a staging database before deployment.
We do replicate production database to UAT database for the end users. That database is not accessible by developers.
It takes less than few seconds to drop all tables, create them again and inject test data.
If you are using an ORM that generates the schema, you don't have to maintain the creation script.
Previously, I worked on a product that was data warehouse-related, and designed to be installed at client sites if desired. Consequently, the software knew how to go about "installation" (mainly creation of the required database schema and population of static data such as currency/country codes, etc.).
Because we had this information in the code itself, and because we had pluggable SQL adapters, it was trivial to get this code to work with an in-memory database (we used HSQL). Consequently we did most of our actual development work and performance testing against "real" local servers (Oracle or SQL Server), but all of the unit testing and other automated tasks against process-specific in-memory DBs.
We were quite fortunate in this respect that if there was a change to the centralised static data, we needed to include it in the upgrade part of the installation instructions, so by default it was stored in the SCM repository, checked out by the developers and installed as part of their normal workflow. On reflection this is very similar to your proposed DB changelog idea, except a little more formalised and with a domain-specific abstraction layer around it.
This scheme worked very well, because anyone could build a fully working DB with up-to-date static data in a few minutes, without stepping on anyone else's toes. I couldn't say if it's worthwhile if you don't need the install/upgrade functionality, but I would consider it anyway because it made the database dependency completely painless.
What about this approach:
Maintain a separate repo for a "clean db". The repo will be a sql file with table creates/inserts, etc.
Using Rails (I'm sure could be adapted for any git repo), maintain the "clean db" as a submodule within the application. Write a script (rake task, perhaps) that queries a local dev db with the SQL statements.
To clean your local db (and replace with fresh data):
git submodule init
git submodule update
then
rake dev_db:update ......... (or something like that!)
I've done one of two things. In both cases, developers working on code that might conflict with others run their own database locally, or get a separate instance on the dev database server.
Similar to what #tvanfosson recommended, you keep a set of SQL scripts that can build the database from scratch, or
On a well defined, regular basis, all of the developer databases are overwritten with a copy of production data, or with a scaled down/deidentified copy of production, depending on what kind of data we're using.
I would agree with all the LBushkin has said in his answer. If you're using SQL Server, we've got a solution here at Red Gate that should allow you to easily share changes between multiple development environments.
http://www.red-gate.com/products/sql_source_control/index.htm
If there are storage concerns that make it hard for your DBA to allow multiple development environments, Red Gate has a solution for this. With Red Gate's HyperBac technology you can create virtual databases for each developer. These appear to be exactly the same as ordinary database, but in the background, the common data is being shared between the different databases. This allows developers to have their own databases without taking up an impractical amount of storage space on your SQL Server.

What is the current trend for SQL Server Integration Services?

Could anybody tell me what the current trend for SQL Server Integration Services is? Is it better than other ETL tools available in market like Informatica, Cognos, etc?
I was introduced to SSIS a couple of weeks ago. Executive summary: I am unlikey to consider it for future projects.
I'm pretty sure flow charts (i.e. non-structured) were discredited as an effective programming paradigm a long time, except in a tiny minority of cases.
There's no point replacing a clean textual (source code) interface with a colourful connect-the-dots one if the user still needs to think like a programmer to know where to drag the arrows.
A program design that you can't access (e.g. fulltext search, navigate using alternative methods, effectively version control, ...) except by one prescribed method is a massive productivity killer. And a wonderful source of RSI.
It's possible there is a particular niche where it's just right, but I imagine most ETL tasks would outgrow it pretty quickly.
SSIS isn't great for production applications from my experience for the following reasons:
To call an SSIS package remotely, you have to call a stored procedure, that calls a job, that calls the SSIS
Using the above method, you can't pass in parameters.
Passing parameters means you have to call the SSIS on a local server - meaning code running on a remote server will have to call code running on the SQL server to execute the package.
I would always rather write specific code to handle ETL and use SSIS for one off transforms.
In my opinion it's quite good platform, and I see a good progress on it. Many of the drwabacks that 2005 version had and that the community complained about, have been corrected on 2008.
From my point of view, the best thing is that you can extend and complement it with SQL or .NET code in an organized way as much as you want.
For instance, you can decide if in your solution you want 80% of c# code and 20% of ETL componenets or 5% of c# code and 95% of ETL components.
disclaimer - i work for microsoft
now the answer
SSIS or SQL Server Integration services is a great tool for ETL operations, there is a lot of uptake in the market place. there is no additional cost other than licensing SQL server and you can also use .Net languages to write tasks.
http://www.microsoft.com/sqlserver/2008/en/us/integration.aspx
http://msdn.microsoft.com/en-us/library/ms141026.aspx
I would list as benefits:
you use SSIS for bigger projects, probably/preferably once or in one run, and then use the integration project for many months with minor changes; the tasks, packages and everything in general is easily readable (of course, depends on perspective)
the tool itself handles the scheduled runs, sending you mails with the logs, and - as long as my experience reaches - it communicates very well with all the other tools (such as SSAS, SQL Server Management Studio, Microsoft Office Excel, Access etc., and other, non-Microsoft tools)
the manually, in-detail configured tasks seem to take over the responsibility in all ways, letting only small chance for errors
as also mentioned above, there are many former problems corrected in the new versions
I would recommend it for ETL, especially if you would continue with analytical processes, since the SSIS, SSAS and SSRS tools blend together quite smoothly.
Drawback: debugging/looking for errors is a bit harder until you get used to it.

Step-by-step instructions for updating an (SQL Server) database?

Just a question about best-practices when upgrading an existing database. Assuming there will be all kinds of modifications to the data itself, the structure, the relations, additional columns, disappearing columns and whatever more.
My problem is a simple one. I'm working on a project that will use SQL Server. No problem there, since I'm enough of an expert to handle this. But this project will be upgraded later on and I need to specify a protocol that needs to be followed by the upgrade mechanism. Basically, this protocol needs to be followed when creating upgrade scripts...
Right now, I have these simple steps:
Add the new columns to the tables.
Add constraints to the new columns.
Add new tables.
Drop constraints where needed.
Drop columns that need to be removed.
Drop tables that need to be removed.
Somehow, this list feels incomplete. Is there a more extended list somewhere describing the proper steps which needs to be followed during an upgrade?
Also, is it always possible to do a complete upgrade within a single database transaction (with SQL Server) or are there breakpoints that need to be included within the protocol where one transaction should end and another one starts?While automated tools will provide a nice, automated solution, I still can't really use them. The development team working on this system has 4 developers, each with their own database on their local system. Every developer keeps track of their own updates to the structure and keeps track of them by generating both an Upgrade and Downgrade script for his own modifications, both for structural changes and data changes. These scripts can then be used by the other developers to keep their own system up-to-date. Whenever the system is going to be released, those scripts are all merged into one big script.
The system does not include any stored procedures or other "special" features. The database is just that: a data storage with just tables and relations between them. No roles, no users, no stored procedures, no triggers, no complex datatypes...The DB is used by an application where users work from 9-to-5 so shutting down can be done easily, including upgrades for the clients. We also add a version number to the database and applications will check if they're linked to the correct database version.
During development, all developers use their own database instance, which they can fully control. Since we're not the ones who use the application, we tend to develop for the Express edition, not any more expensive one. To be honest, we don't develop our application to support a lot of users, but we'll inform our users that since it uses SQL Server, they could install the system on a bigger SQL Server platform, according to their own needs. They will need their own DBA for this, though. We do have a bigger SQL Server available for ourselves, which we also use for our own web interface, but this server is located in a special dataserver where it is being maintained for us, not by us.
The project previously used MS Access for it's data storage and was intended for single-user development, but as it turned out, many users still decided to share their databases and this had shown that the datamodel itself is reliable enough for multi-user environments. So we migrated to SQL Server to support smaller offices with 3 or more users and some big organisation who will have 500 or more users at the same time.
Since we need to keep the cost of the software low, we don't have a big budget to spend on expensive tools or a more expensive server.
Check out Red-Gate's SQL Compare (structure comparison), SQL Data Compare (data comparison), and SQL Packager (for packaging up updates scripts into a C# project or a .NET executable).
They provide a nice, clean, fully functional and easy-to-use solution for all your database upgrade needs. They're well worth their license fees - that pays for itself in a few weeks or months.
Highly recommended!
In my opinion, it's an absolute bear doing these manually. For Microsoft SQL Server, I'd recommend using the Database editiion of Team System, since it includes complete source control capabilities for your database, and can automatically build your scripts for upgrading/downgrading versions.
Another option is SQLCompare with Redgate, which can also handle these kinds of upgrades/downgrades, and will result in a very nice SQL script. I've used both, and keeping the historic scripts has helped us troubleshoot issues and resolve many a mystery.
If you are working with a manual script as above, don't forget to also account for SP changes in your scripts. Also, any hand-edited script should be able to be executed multiple times on a database - i.e. if your script includes a table creation or drop, be sure to check for existance first, otherwise your script will fail if executed back to back.
Again, while it's possible to build a manual protocol I'd still fall back on using one of the purpose-built tools out there, and both Team System and SQL Compare will be able to output scripts that you could include as part of an installation/upgrade package.
With database updates I always believe it should be all or nothing. If any of the DB updates fail your application will be left in an unknown state that could be harmful to the data so I think it is best practice to either apply them all or none (1 transaction around them all).
I also like to backup the database before applying updates so that if anything does go wrong the database can be rolled back (this has saved me numerous times when working with live data).
Hope this helps.
Best practices for upgrading a production database schema actually look pretty bad on the surface. Unless you can completely shut down your system for the upgrade, which is often not possible, your changes all need to be backwards compatible. If you have many clients accessing the database, you can't update them all simultaneously, so any schema changes you make need to allow old code to run.
That means never renaming a column, and making all new columns nullable. This doesn't mean you leave it like that forever. You write two scripts, one for the initial change, which is backwards compatible, then another to clean things up after all clients have been updated.
Automated tools are great for validation of schemas, but they are not so good when it comes to actually modifying a complex system. You should break your changes up into many small, discrete change scripts so each can be run manually. If there's a failure, it's easier to pinpoint the cause and fix it. Basically, each feature gets its own script. Give each a unique name and then store that name in the database itself when you run the script so you can query the database to find out what's been run and what hasn't. This is invaluable when you have instances on developer's machines, test servers, production, etc.

Resources