ETL Tools and Build Tools - database

I have familiarities with software automated build tools ( such as Automated Build Studio). Now I am looking at ETL tools.
The one thing crosses my mind is that, I can do anything I can do in ETL tools by using a software build tool. ETL tools are tailored for data loading and manipulation for which a lot of scripts are needed in order to do the job. Software build tool, on the other hand, is versatile enough to do any jobs, including writing scripts to extract, transform and load any data from any format into any format.
Am I right?

It is correct that you can roll-out your own ETL scripts written using a development tool of your preference. Having said that, ETL jobs are frequently large (for a lack of better word) and demand considerable administration and attention to minute details (like programming). ETL tools allow developer to focus on ETL tasks -- as opposed to writing and debugging code, although that's part of it too. There are some open-source tools out there, so you can get a feeling of what an average tool does, before jumping into custom development. For example, more expensive tools provide data lineage, meaning you can (graphically) track every field on a report back to the originating table through all transformations (versions included); after a corporate merger that's quite a task to do.
For example Pentaho has community edition; if you have MS SQL Server, you can get SSIS. Also see if you can find something here.

The benefit of an ETL tool is maximized if you have many processes to build (I like jsf80238's post aboves analogy with hammering in 100 nails). A key benefit of real ETL tools is the metadata they generate and operational support. Writing your scripts in Perl/Ruby/etc is fairly easy, but breaks down when problems need to be tracked down or someone other than the author has to figure out what's wrong.The ability for admin/support staff to quickly see what went wrong is what's worth paying money for. I have used Microsoft's SSIS (2005 - OK) and the latest Pentaho PDI (quite good). The Pentaho ETL GUI is used by business users (without IT support for 99% of the time) at my workplace, and has replaced a tangle of SQL scripts and spreadsheets. Say what you like about the rest of the Pentaho stack, but the ETL component is, in my opinion, excellent "bang for buck".

The whole business of ETL is based on the premise that the source of the data is incompatible with destination data source. And many times, the folks who dump the source data may not be thinking that this data needs to be collected and aggregated. This is why the whole business of ETL is in existent.
A commercial ETL tool will not magically read the source input and transform data according to the rules of the destination database. Rules have to be defined and fed into the ETL tool. Interestingly, many companies offer training!!! on how to use their proprietary scripting language. So it is not always that easy. But for non-programmers, maybe this is the preferred route.
Personally, I think that it is always easier to write a proprietary ETL tool in a language like Perl. Simply write a state-machine algorithm to rip through the source data and convert it to the desired format. I use Perl to FTP into machines, read in the files, transform the data, and then load it into the database. This is always a superior solution and much faster if one is proficient in Perl or similar, or can hire someone who knows Perl.
And one final point, start with the end in mind. Dump your source data in a structured format to help out the analysis group in your company who wants to aggregate and study the. This will make the ETL program easier and faster to develop.

I like Damir Sudarevic's answer and wanted to add that your choice of tool might also depend on how much work you have in front of you. If you have the occasional ETL task and are already familiar with a tool that will allow you to accomplish that task, use the tool you already know (this approach assigns a zero value to learning a new tool, which is perhaps undervaluing new knowledge). If you have a lot of ETL tasks, the up-front investment of learning a new tool might very well pay off. You can use pliers to drive a nail, and if you have only one nail you can use the pliers. If you have to drive 100 nails get yourself a hammer.

You can also do anything ETL tools can do with code. :-)
Both tool categories you mention can be used to solve this problem, but they are optimized for the class of problems they are trying to solve:
ETLs tend to come with a library of data manipulation tools (relational calculus, in-line computations, etc.), are optimized to handle large quantities of data, and have job management features (important if this isn't a single one-off data migration).
Build tools (for me, Ant comes to mind as a prototypical example) could do similar tasks, but are focused on compilation, file organization and manipulation, and packaging.

Related

ETL Tool suggestion for REST input and ODATA (or REST) output

We have a REST Data source and need to go to an ERP ODATA destination.
Data is tabular, mostly strings, numeric and simple types.
Not a lot of data (less than 1,000 rows each day).
We would like a ETL tool with the following features:
Logging
Command line friendly
Automated nightly operation (scheduled runs)
Simple GUI (we may have to do some simple transformations or filtering)
Alert (email) for errors
Fairly easy to learn (we have some NiFi gurus, but that might be too much)
We don't care if it is not Free or Open Source, in fact paid support is better
Not a cloud service. The data sources are local to the company network. Data cannot go offsite.
Google searchers for this level of detail have not been that successful.
I don't understand what is the use case of command line?
Some of the top ETL tools are:
Nifi
Pentaho
UC4
Airflow
If you don't have a lot of data some CI tools such as Jenkins and TeamCity might actually be enough. But you would have to write the script yourself
I would suggest you to have a look into Informatica Powercenter. Its a paid tool and the support is fairly good.
Logging - provides a lots of way to log the process
Command line friendly - good command line tools available can be called through any CI /CD tools
Automated nightly operation (scheduled runs) - Tool offers inbuilt scheduler for use and supports complex schedulers as well with help of command line
Simple GUI (we may have to do some simple transformations or filtering) - GUI is fairly easy and lots of tutorials available on youtube for this tool
Alert (email) for errors - You can enable the email error with simple command / drag-drop functionlaity or make use of backend command line
Fairly easy to learn (we have some NiFi gurus, but that might be too much) - This is subjective but this tool is fairly simpler.
We don't care if it is not Free or Open Source, in fact paid support is better - This is a paid tool
Not a cloud service. The data sources are local to the company network. Data cannot go offsite-- best part of the tool. available on both on-premise as well on cloud. Your choice of selection.

Automating jobs for service layer regression testing (Powershell/MSBuild)

In a .net development environment, I'm looking to implement some regression testing scripts to do some end-to-end (/blackbox) testing on a fully setup server based application, which will end up being fairly complex.
My initial thought was to roll my own powershell scripts/XML configuration of the steps. But I wanted to do some analysis to see if there is anything out there I could reuse, and perhaps what anyone else did which might prove to be a best practice (which I haven't found, as of yet).
I realised I could potentially just use an MSBuild project, along with the MSBuildExtensions and community tasks, but I've found these scripts to be harder to modify/maintain in the long run.
An example of some of the job steps I'd be coding for one of the applications:
Copy files to certain directories and trigger a service to load
Wait for the service to load files (checking sql tables for job completion)
Truncating tables (etc, on sql databases)
Comparing sql table output with expected results
Parsing log files
and so on
Some pretty simple powershell would be able to cater for most of these. I'd be interested on opinions: what do you use if you have some regression style end-to-end testing? Rolling your own in order to have a fairly simple, and specific implementation, or use a third part tool (like MSBuild, or something else)?
Choosing the right tool for the job is often driven by personal preference but it really should be driven by effectiveness vs. maintainability.
MSBuild excels in task reuse and dependency chain resolution.
PowerShell shines in compressing complex processes into a set of few elegant commands.
In your scenario I'd probably use PowerShell for the integration-oriented DSLs of job queuing, database, IO. I'd keep MSBuild for producing the build artifacts.
No need for third party tool unless it's the top dog in its field and the price is right (=open source or already purchased by your company).

Load and Performance Testing of a Database

This is the first time, my team has asked me to do some testing on Database which I have no clue how to approach. By testing on database I mean, I need to see how fast it can insert records into it. And till what pressure it can handle. Just like Load and Performance Testing for database. The database that we are about to use is XPRESSmp.
So can anyone help me in what kind of testing we usually do when we need to Test the database and what are the tools that I can take a look into for this. Most of the articles that I have seen where mostly related to Oracle and MySQL. But this is a new database altogether.
One approach I can think of is write a Multithreaded Program with X number of threads that will pump the data into XMP at very high speed. And keep on measuring how much time each thread is taking. What else I can do to test the database?
My team has asked me to break the database by doing your testing but we should know at what situation it broked and what was the reason behind that.
And what important points I should know and take into the consideration while doing the testing on database.
P.S I will be doing this testing in seperate LnP machines.
Usually, SysBench is used to test queries performance on MySQL. It is not just for MySQL, though. I have only a basic knowledge of it, so I suggest you don't ask me and read documentation:
http://sysbench.sourceforge.net/
You can use these tools:
HammerDB is an open source database load testing and benchmarking tool for Oracle, SQL Server, TimesTen, PostgreSQL, Greenplum, Postgres Plus Advanced Server, MySQL and Redis. HammerDB is automated, multi-threaded and extensible with dynamic scripting support. HammerDB includes complete built-in workloads based on industry standard benchmarks as well as capture and replay for the Oracle database.
Download or see more information visit http://hammerora.sourceforge.net/
p-unit
Description:
An open source framework for unit test and performance benchmark, which was initiated by Andrew Zhang, under GPL license. p-unit supports to run the same tests with single thread or multi-threads, tracks memory and time consumption, and generates the result in the form of plain text, image or pdf file.
http://p-unit.sourceforge.net/
DBMonster
Description:
DBMonster is an application to generate random data for testing SQL database driven applications under heavy load.
http://sourceforge.net/projects/dbmonster/
Replied here, use the k6 SQL extension.

Strategies for populating a Reporting/Data Warehouse database

For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.
Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.
What are some other strategies that could be used to populate this reporting database?
SSIS? This has been considered but doesn't appear to offer a much cleaner, more maintainable approach that just the stored procedures.
A separate C# (or whatever language) process that aggregates the data in memory and then pushes it into the reporting database? This would allow us to write Unit Tests for the logic and organize the code in a much more maintainable manner.
I'm looking for any new ideas or additional thoughts on the above. Thanks!
Our general process is:
Copy data from source table(s) into
tables with exactly the same
structure in a loading database
Transform data into staging
table, which have the same structure
as the final fact/dimension tables
Copy data from the staging tables to
the fact/dimension tables
SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.
For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.
In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.
Step 3 is a series of INSERT... SELECT... statements: it's fast.
So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.
I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.
I would take another look at SSIS. While there is a learning curve, it can be quite flexible. It has support for a lot of different ways to manipulate data including stored procedures, ActiveX scripts and various ways to manipulate files. It has the ability to handle errors and provide notifications via email or logging. Basically, it should be able to handle just about everything. The other option, a custom application, is probably going to be a lot more work (SSIS already has a lot of the basics covered) and is still going to be fragile - any changes to data structures will require a recompile and redeployment. I think a change to your SSIS package would probably be easier to make. For some of the more complicated logic you might even need to use multiple stages - a custom C# console program to manipulate the data a bit and then an SSIS package to load it to the database.
SSIS is a bit painful to learn and there are definitely some tricks to getting the most out of it but I think it's a worthwhile investment. A good reference book or two would probably be a good investment (Wrox's Expert SQL Server 2005 Integration Services isn't bad).
I'd look at ETL (extract/transform/load) best practices. You're asking about buying vs building, a specific product, and a specific technique. It's probably worthwhile to backup a few steps first.
A few considerations:
There's a lot of subtle tricks to delivering good ETL: making it run very fast, be very easily managed, handling rule-level audit results, supporting high-availability or even reliable recovery and even being used as the recovery process for the reporting solution (rather than database backups).
You can build your own ETL. The downside is that commercial ETL solutions have pre-built adapters (which you may not need anyway), and that custom ETL solutions tend to fail since few developers are familiar with the batch processing patterns involved (see your existing architecture). Since ETL patterns have not been well documented it is unlikely to be successful in writing your own ETL solution unless you bring in a developer very experienced in this space.
When looking at commercial solutions note that the metadata and auditing results are the most valuable part of the solution: The GUI-based transform builders aren't really any more productive than just writing code - but the metadata can be more productive than reading code when it comes to maintenance.
Complex environments are difficult to solution with a single ETL product - because of network access, performance, latency, data format, security or other requirements incompatible with your ETL tool. So, a combination of custom & commercial often results anyway.
Open source solutions like Pentaho are really commercial solutions if you want support or critical features.
So, I'd probably go with a commercial product if pulling data from commercial apps, if the requirements (performance, etc) are tough, or if you've got a junior or unreliable programming team. Otherwise you can write your own. In that case I'd get an ETL book or consultant to help understand the typical functionality and approaches.
I've run data warehouses that were built on stored procedures, and I have used SSIS. Neither is that much better than the other IMHO. The best tool I have heard of to manage the complexity of modern ETL is called Data Build Tool (DBT) (https://www.getdbt.com/). It has a ton of features that make things more manageable. Need to refresh a particular table in the reporting server? One command will rebuild it, including refreshing all the tables it depends on back to the source. Need dynamic SQL? This offers Jinja for scripting your dynamic SQL in ways you never thought possible. Need version control for what's in your database? DBT has you covered. After all that, it's free.

Need help designing big database update process

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

Resources