Could anybody tell me what the current trend for SQL Server Integration Services is? Is it better than other ETL tools available in market like Informatica, Cognos, etc?
I was introduced to SSIS a couple of weeks ago. Executive summary: I am unlikey to consider it for future projects.
I'm pretty sure flow charts (i.e. non-structured) were discredited as an effective programming paradigm a long time, except in a tiny minority of cases.
There's no point replacing a clean textual (source code) interface with a colourful connect-the-dots one if the user still needs to think like a programmer to know where to drag the arrows.
A program design that you can't access (e.g. fulltext search, navigate using alternative methods, effectively version control, ...) except by one prescribed method is a massive productivity killer. And a wonderful source of RSI.
It's possible there is a particular niche where it's just right, but I imagine most ETL tasks would outgrow it pretty quickly.
SSIS isn't great for production applications from my experience for the following reasons:
To call an SSIS package remotely, you have to call a stored procedure, that calls a job, that calls the SSIS
Using the above method, you can't pass in parameters.
Passing parameters means you have to call the SSIS on a local server - meaning code running on a remote server will have to call code running on the SQL server to execute the package.
I would always rather write specific code to handle ETL and use SSIS for one off transforms.
In my opinion it's quite good platform, and I see a good progress on it. Many of the drwabacks that 2005 version had and that the community complained about, have been corrected on 2008.
From my point of view, the best thing is that you can extend and complement it with SQL or .NET code in an organized way as much as you want.
For instance, you can decide if in your solution you want 80% of c# code and 20% of ETL componenets or 5% of c# code and 95% of ETL components.
disclaimer - i work for microsoft
now the answer
SSIS or SQL Server Integration services is a great tool for ETL operations, there is a lot of uptake in the market place. there is no additional cost other than licensing SQL server and you can also use .Net languages to write tasks.
http://www.microsoft.com/sqlserver/2008/en/us/integration.aspx
http://msdn.microsoft.com/en-us/library/ms141026.aspx
I would list as benefits:
you use SSIS for bigger projects, probably/preferably once or in one run, and then use the integration project for many months with minor changes; the tasks, packages and everything in general is easily readable (of course, depends on perspective)
the tool itself handles the scheduled runs, sending you mails with the logs, and - as long as my experience reaches - it communicates very well with all the other tools (such as SSAS, SQL Server Management Studio, Microsoft Office Excel, Access etc., and other, non-Microsoft tools)
the manually, in-detail configured tasks seem to take over the responsibility in all ways, letting only small chance for errors
as also mentioned above, there are many former problems corrected in the new versions
I would recommend it for ETL, especially if you would continue with analytical processes, since the SSIS, SSAS and SSRS tools blend together quite smoothly.
Drawback: debugging/looking for errors is a bit harder until you get used to it.
Related
I have a Sql Server database which is used to store data coming from a lot of different sources (writers).
I need to provide users with some aggregated data, however in Sql Server this data is stored in several different tables and querying it is too slow ( 5 tables join with several million rows in each table, one-to-many ).
I'm currently thinking that the best way is to extract data, transform it and store it in a separate database (let's say MongoDB, since it will be used only for read).
I don't need the data to be live, just not older that 24 hours compared to the 'master' database.
But what's the best way to achieve this? Can you recommend any tools for it (preferably free) or is it better to write your own piece of software and schedule it to run periodically?
I recommend respecting the NIH principle here, reading and transforming data is a well understood exercise. There are several free ETL tools available, with different approaches and focus. Pentaho (ex Kettle) and Talend are UI based examples. There are other ETL frameworks like Rhino ETL that merely hand you a set of tools to write your transformations in code. Which one you prefer depends on your knowledge and, unsurprisingly, preference. If you are not a developer, I suggest using one of the UI based tools. I have used Pentaho ETL in a number of smaller data warehousing scenarios, it can be scheduled by using operating system tools (cron on linux, task scheduler on windows). More complex scenarios can make use of the Pentaho PDI repository server, which allows central storage and scheduling of your jobs and transformations. It has connectors for several database types, including MS SQL Server. I haven't used Talend myself, but I've heard good things about it and it should be on your list too.
The main advantage of sticking with a standard tool is that once your demands grow, you'll already have the tools to deal with them. You may be able to solve your current problem with a small script that executes a complex select and inserts the results into your target database. But experience shows those demands seldom stay the same for long, and once you have to incorporate additional databases or maybe even some information in text files, your scripts become less and less maintainable, until you finally give in and redo your work in a standard toolset designed for the job.
Some people don't like SSIS for the following reasons,
Need to find and click the express replacement scattered in different place when design a little bit more complex package.
These merge, lookup components don't perform well. I heard a lot of consultants just recommend loading data in the SQL Server tables and use transact-sql.
I've used a powershell in a small project which export the data and create csv files. I've used powershell and like it. Is it a trend to replace some of the tasks traditionally using SSIS with Powershell? Especially in the export only cases?
For very small projects/tasks power shell is an ok tool.
For projects that need to be robust, maintainable, modular, handle errors and auditing, SSIS is vastly superior.
The truth is, too many SSIS implementations are crafted by devs that don't understand the strengths of the program. They simply try to replicate their current T-SQL ETL process into SSIS with minimal effort or leverage of its capabilities. Performance issues almost always go right along with this.
SSIS is Not just a GUI way to get SPs and TSQL to autorun. If you really want to learn more on the subject I suggest picking up a few books - careful listening to narrow-fielded experts; their skillsets can easily fade from relevance and keep others behind with them.
Powershell trend away from SSIS? Not anywhere close to where it counts.
This is an old topic but I find it is well worth discussing. So I'd like to put a few ideas why I think SSIS is a bad choice 99% of the time as a ETL tool.
At this time, the only thing I can think of SSIS better than PowerShell is its performance in handling huge amount of data with multiple sources / targets, this is mainly due to the SSIS internal parallelism and caching capability.
However, SSIS is notorious for its error message, almost unable to debug once SSIS packages are deployed, also from source control perspective, SSIS packages, which are XML files, are difficult to compare between versions, also very fragile once either source or target objects have minor changes (like a column on target is increased by one char).
In my prod environment, there are many SSIS packages deployed and scheduled with sql agent jobs, so when there is a job failure, there is no way for me to figure out the problem until I went to TFS to find the SSIS project and open it in Visual Studio to figure out the logic. It is a nightmare.
With PowerShell, the code you see is the code executed, and you can always get the logic from the PS code and do the trouble-shooting along the way.
With many, many open-sourced PS modules these days, PowerShell's power is increasing exponentially, it is indeed the time to consider using PS as an alternative tool rather than the SSIS.
I am looking for a stress tool for SQL Server. I've seen a lot of suggestions on Google. But nothing of what I really need.
I am really looking for a tool that could run a list of stored procedures in parallel to see how much contention on resources. The collect and reporting feature is not that important. But I also want something server-side base for our enterprise build server.
I am not looking for a replay feature (Yes it could do the trick but it would be difficult to program a lot of different scenarios)
I've look at the following tools:
RML Utilities from Microsoft
DTM DB Stress (this is the closest to what I'm looking for)
SQL Stress
I created a simple test tool for this scenario, check it out to see if it will be of any use to you. It's free, no licensing of any sort required. No guarantees on any performance or quality either ;-)
Usage: StressDb.exe <No. of instances> <Tot. Runtime (mins)> <Interval (secs)>
Connection string should reside in the configuration file.
All command line arguments are required. Use integers.
The stored proc to use is also in the config file.
You need to have .NET framework 3.5 installed. You can also run it from multiple workstations for additional load, or from multiple folders on same machine if trying to run additional stored procedures. You also need a SQL user Id, as currently it doesn't use a trusted connection.
This code was actually super simple, the only clever bit was making sure that the connections are not pooled.
http://......com/..../stressdb.zip
Let me know if you find it useful.
I have a client that currently uses a local Advantage Database on their PC along with an application. They are thinking of upscaling their setup to have multiple applications running communicating with a database server i.e/a client-server environment.
They are now considering the best database for this approach. They are looking at the Advantage Database Server product in comparison to SQL Server Express(the application does not warrant a full SQL Server at this stage).
Obviously SQL Server is a more well known product probably with more support but I was hoping you could give me some opinions and thoughts on what you think the best product would be in terms of performance, stability and support.
One thing to note although not directly relevant is that the application is currently written in Delphi and there could be a move to C# to bring it up to date.
The migration from a local Advantage Database to a client/server Advantage database is a very simple process. It simply involves changing the connection properties within the program. There are no other coding changes that need to be done.
Advantage has a great support team and has been in development for over 15 years. The stability and support are at least equal to SQL Server.
Advantage also provides a .NET Data Provider which would allow for C# development.
I have developed for both SQL Server and Advantage. They each have their pros and cons (although now I favor Advantage).
Given your situation, however, this decision appears to be a no-brainer: Advantage Database Server. Why? It's already done!
My Advantage programs run, unmodified, against the same database either locally or remotely. All I change is the connection string. I'm not saying that your customer's code won't have to be changed. I am saying it is likely to be trivial. Compare that to the greater effort involved in switching to a whole new database engine.
In general I'm a SQL Server person all the way. I work with id daily and have for almost ten years, but in your situatuion, it seems silly to consider moving to a new database when there is aclear upgrade path to do what you want using the backend you already have. It would be much less work and far less likely to introduce new bugs to stay within the same database family.
ADS wins hands down. It is maintenance-free. It is extremely reliable. It is extremely fast. It is extremely scalable. SQL is very well supported, and the ADS newsgroups are responsive (answers within hours instead of days on SQL server fora) and well-informed. I have been using ADS since 1991 and it has never gone wrong! My users are incredibly demanding and to be able to turn round solutions within hours instead of days, is both a joy to me and a business incentive to the end users and clients. Deployment is gentle, fast and simple. Platform support is better than SQL server. 64-bit server deployment abounds and is well-grounded, transparent and reliable. 64-bit clients are coming in the next version (10). My experience with ADS is wholly positive, whereas my ventures with SQL server have been fraught with difficulties, idiosyncrasies and workrounds!
I happen to be a support rep for Advantage so when you say "Obviously SQL Server is a more well known product probably with more support" I have to argue a bit.
As Chris stated switching from Advantage Local Server to the the Advantage Remote (client/server) Server is a pretty painless process - they designed it that way.
Install the Advantage Database Server on a machine where the data is located (not a requirement but it's recommended). You can get a free trial here: http://marketing.ianywhere.com/forms/ADS91-30-Day
Within the application there will be TAdsConnection component(s) - change the TAdsConnection.ConnectionType to 'REMOTE' (http://devzone.advantagedatabase.com/dz/webhelp/Advantage9.1/mergedProjects/ade/sec7/connectiontype.htm)
You can specify the path (TAdsConnection.ConnectPath) from the clients in a couple different ways but the recommended is:
\\server:6262\mydata
http://devzone.advantagedatabase.com/dz/webhelp/Advantage9.1/mergedProjects/ade/sec7/connectpath_tadsconnection.htm
Note: 6262 is the port used by default (may need to add an exception to the firewall). Also if your application uses a data dictionary the path would include the name of the .ADD file (e.g. \\server:6262\mydata\mydd.add)
Hope this helps!
In Maybe Normalizing Isn't Normal Jeff Atwood says, "You're automatically measuring all the queries that flow through your software, right?" I'm not but I'd like to.
Some features of the application in question:
ASP.NET
a data access layer which depends on the MS Enterprise Library Data Access Application Block
MS SQL Server
In addition to Brad's mention of SQL Profiler, if you want to do this in code, then all your database calls need to funnelled through a common library. You insert the timing code there, and voila, you know how long every query in your system takes.
A single point of entry to the database is a fairly standard feature of any ORM or database layer -- or at least it has been in any project I've worked on so far!
SQL Profiler is the tool I use to monitor traffic flowing to my SQL Server. It allows you to gather detailed data about your SQL Server. SQL Profiler has been distributed with SQL Server since at least SQL Server 2000 (but probably before that also).
Highly recommended.
Take a look at this chapter Jeff Atwood and I wrote about performance optimizations for websites. We cover a lot of stuff, but there's a lot of stuff about database tracing and optimization:
Speed Up Your Site: 8 ASP.NET Performance Tips
The Dropthings project on CodePlex has a class for timing blocks of code.
The class is named TimedLog. It implements IDisposable. You wrap the block of code you wish to time in a using statement.
If you use rails it automatically logs all the SQL queries, and the time they took to execute, in your development log file.
I find this very useful because if you do see one that's taking a while, it's one step to just copy and paste it straight off the screen/logfile, and put 'explain' in front of it in mysql.
You don't have to go digging through your code and reconstruct what's happening.
Needless to say this doesn't happen in production as it'd run you out of disk space in about an hour.
If you define a factory that creates SqlCommands for you and always call it when you need a new command, you can return a RealProxy to an SqlCommand.
This proxy can then measure how long ExecuteReader / ExecuteScalar etc. take using a StopWatch and log it somewhere. The advantage to using this kind of method over Sql Server Profiler is that you can get full stack traces for each executed piece of SQL.