Automating jobs for service layer regression testing (Powershell/MSBuild) - sql-server

In a .net development environment, I'm looking to implement some regression testing scripts to do some end-to-end (/blackbox) testing on a fully setup server based application, which will end up being fairly complex.
My initial thought was to roll my own powershell scripts/XML configuration of the steps. But I wanted to do some analysis to see if there is anything out there I could reuse, and perhaps what anyone else did which might prove to be a best practice (which I haven't found, as of yet).
I realised I could potentially just use an MSBuild project, along with the MSBuildExtensions and community tasks, but I've found these scripts to be harder to modify/maintain in the long run.
An example of some of the job steps I'd be coding for one of the applications:
Copy files to certain directories and trigger a service to load
Wait for the service to load files (checking sql tables for job completion)
Truncating tables (etc, on sql databases)
Comparing sql table output with expected results
Parsing log files
and so on
Some pretty simple powershell would be able to cater for most of these. I'd be interested on opinions: what do you use if you have some regression style end-to-end testing? Rolling your own in order to have a fairly simple, and specific implementation, or use a third part tool (like MSBuild, or something else)?

Choosing the right tool for the job is often driven by personal preference but it really should be driven by effectiveness vs. maintainability.
MSBuild excels in task reuse and dependency chain resolution.
PowerShell shines in compressing complex processes into a set of few elegant commands.
In your scenario I'd probably use PowerShell for the integration-oriented DSLs of job queuing, database, IO. I'd keep MSBuild for producing the build artifacts.
No need for third party tool unless it's the top dog in its field and the price is right (=open source or already purchased by your company).

Related

force.com ISV development, deployment, support

We're an ISV that's completed our first app on force.com. It's an xRM-like app with extended workflow to build out complex campaigns (not simple marketing-like campaigns) and integration with on-premise software. The platform brings enormous value, and at the same time some challenges. Interested in other ISV experiences around the following:
Application upgrade process. Customers expect cloud app upgrade to "just happen". Reality is that there's inevitable manual pre- and post-upgrade steps that can fill many pages. We don't want to burden the customer with this, and at the same time while we're happy to do the upgrade work for the customer, we don't want access to customer data and the need for elaborate security assurances that come along with that access. A conundrum.
Development environment. Agile/scrum development relies on achieving full test automation and continuous integration, yet full automation beyond unit test seems difficult or impossible.
Background processing. Constraints on scheduled jobs, callouts, and futures, and issues with transaction management present challenges to traditional software development.
Curious what other ISVs have found.
Thanks!
I am now working at my second Force.com ISV and so have a fair amount of experience in releasing products on the platform (have seen 4 separate products releases, 1 which included 3 version releases and 1 including another version update).
If possible, you should try to remove any pre/post install steps that the user requires to do. It sounds tough, and it is, but its the biggest reason people don't adopt a product. The idea is that it is quick and easy to install, one click, and any extra effort detracts from the user experience. Ensuring your system is data independent is a good way of getting around the data security issues you referred to, and obviously you can offer a consultancy to do the upgrade work. A sensible idea might be to have a list of all the objects and fields that are affected by your products installation and then to do a check of the customer org before installing. I would also say that installing in sandbox and doing a couple of weeks user testing can highlight any problems you may have in future very effectively.
It is not true that full test automation beyond unit tests cannot occur and is actually very simple. The key is having the necessary framework setup. So you would have a central version control system where your code is stored (a key agile part). Then you create a script so that when code is committed, it runs an install on a SFDC org, running all tests and reporting back. You can then get this script to run a set of apex classes or upload a bunch of CSV files to put data in with either further fuller apex tests to run functionality or selenium running to do a set of tests. You can then also use this test data and script for knocking out demo environments for sales guys.
The governor and background processing limits are a bit tight, but they keep on being increased. Maybe you should integrate with Heroku or similar to do some larger external processing? I will say though I think it improves programming abilities in general, making you think about what it is your doing and the best way to do it. This then leads to a more pleasant end user experience. Batch apex jobs area a good way of doing this processing and you can use the asyncapexjob object to report back on the status f a run to users.
Hope that helps and gives you a different perspective!
Paul

Setup and Deployment of a WPF application

I'm currently developing a small WPF application using a file database (SQLCe).
Since I'm near release of the product and I've had no experience with setup and deployment I would like to hear your thoughts around this subject.
The application is small and the updates that I will make is minor database changes (such as alter tables, columns etc.) and dll updates.
I've tried to play around with ClickOnce deployment but I don't understand how updates to a database should be handled.
On the other hand a standard Setup and Deployment project feels rather complex for just a couple of database updates and dll replacements?
Which one of these two "tools" would you recommend for my given scenario?
Are there any best practises or other tools that can ease the setup and deployment work?
Cheers!
Try NSIS http://nsis.sourceforge.net/. It's a good tool and allows for custom update programs to be written quite easily. This would be able to handle all of the .dll replacements and is very suited for this type of deployment.
In terms of the database updates; if you’re going to be writing scripts to update the data base tables, you will need to consider how you’re going to connect to the local instance of the data base to run the scripts against the DB. The more automated the solution, you may want to consider an application to look up the location of the DB and execute the scripts at the run time of the NSIS script being run.
Small overhead, with a lot of flexibility.

Strategies for populating a Reporting/Data Warehouse database

For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.
Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.
What are some other strategies that could be used to populate this reporting database?
SSIS? This has been considered but doesn't appear to offer a much cleaner, more maintainable approach that just the stored procedures.
A separate C# (or whatever language) process that aggregates the data in memory and then pushes it into the reporting database? This would allow us to write Unit Tests for the logic and organize the code in a much more maintainable manner.
I'm looking for any new ideas or additional thoughts on the above. Thanks!
Our general process is:
Copy data from source table(s) into
tables with exactly the same
structure in a loading database
Transform data into staging
table, which have the same structure
as the final fact/dimension tables
Copy data from the staging tables to
the fact/dimension tables
SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.
For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.
In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.
Step 3 is a series of INSERT... SELECT... statements: it's fast.
So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.
I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.
I would take another look at SSIS. While there is a learning curve, it can be quite flexible. It has support for a lot of different ways to manipulate data including stored procedures, ActiveX scripts and various ways to manipulate files. It has the ability to handle errors and provide notifications via email or logging. Basically, it should be able to handle just about everything. The other option, a custom application, is probably going to be a lot more work (SSIS already has a lot of the basics covered) and is still going to be fragile - any changes to data structures will require a recompile and redeployment. I think a change to your SSIS package would probably be easier to make. For some of the more complicated logic you might even need to use multiple stages - a custom C# console program to manipulate the data a bit and then an SSIS package to load it to the database.
SSIS is a bit painful to learn and there are definitely some tricks to getting the most out of it but I think it's a worthwhile investment. A good reference book or two would probably be a good investment (Wrox's Expert SQL Server 2005 Integration Services isn't bad).
I'd look at ETL (extract/transform/load) best practices. You're asking about buying vs building, a specific product, and a specific technique. It's probably worthwhile to backup a few steps first.
A few considerations:
There's a lot of subtle tricks to delivering good ETL: making it run very fast, be very easily managed, handling rule-level audit results, supporting high-availability or even reliable recovery and even being used as the recovery process for the reporting solution (rather than database backups).
You can build your own ETL. The downside is that commercial ETL solutions have pre-built adapters (which you may not need anyway), and that custom ETL solutions tend to fail since few developers are familiar with the batch processing patterns involved (see your existing architecture). Since ETL patterns have not been well documented it is unlikely to be successful in writing your own ETL solution unless you bring in a developer very experienced in this space.
When looking at commercial solutions note that the metadata and auditing results are the most valuable part of the solution: The GUI-based transform builders aren't really any more productive than just writing code - but the metadata can be more productive than reading code when it comes to maintenance.
Complex environments are difficult to solution with a single ETL product - because of network access, performance, latency, data format, security or other requirements incompatible with your ETL tool. So, a combination of custom & commercial often results anyway.
Open source solutions like Pentaho are really commercial solutions if you want support or critical features.
So, I'd probably go with a commercial product if pulling data from commercial apps, if the requirements (performance, etc) are tough, or if you've got a junior or unreliable programming team. Otherwise you can write your own. In that case I'd get an ETL book or consultant to help understand the typical functionality and approaches.
I've run data warehouses that were built on stored procedures, and I have used SSIS. Neither is that much better than the other IMHO. The best tool I have heard of to manage the complexity of modern ETL is called Data Build Tool (DBT) (https://www.getdbt.com/). It has a ton of features that make things more manageable. Need to refresh a particular table in the reporting server? One command will rebuild it, including refreshing all the tables it depends on back to the source. Need dynamic SQL? This offers Jinja for scripting your dynamic SQL in ways you never thought possible. Need version control for what's in your database? DBT has you covered. After all that, it's free.

How should you build your database from source control?

There has been some discussion on the SO community wiki about whether database objects should be version controlled. However, I haven't seen much discussion about the best-practices for creating a build-automation process for database objects.
This has been a contentious point of discussion for my team - particularly since developers and DBAs often have different goals, approaches, and concerns when evaluating the benefits and risks of an automation approach to database deployment.
I would like to hear some ideas from the SO community about what practices have been effective in the real world.
I realize that it is somewhat subjective which practices are really best, but I think a good dialog about what work could be helpful to many folks.
Here are some of my teaser questions about areas of concern in this topic. These are not meant to be a definitive list - rather a starting point for people to help understand what I'm looking for.
Should both test and production environments be built from source control?
Should both be built using automation - or should production by built by copying objects from a stable, finalized test environment?
How do you deal with potential differences between test and production environments in deployment scripts?
How do you test that the deployment scripts will work as effectively against production as they do in test?
What types of objects should be version controlled?
Just code (procedures, packages, triggers, java, etc)?
Indexes?
Constraints?
Table Definitions?
Table Change Scripts? (eg. ALTER scripts)
Everything?
Which types of objects shouldn't be version controlled?
Sequences?
Grants?
User Accounts?
How should database objects be organized in your SCM repository?
How do you deal with one-time things like conversion scripts or ALTER scripts?
How do you deal with retiring objects from the database?
Who should be responsible for promoting objects from development to test level?
How do you coordinate changes from multiple developers?
How do you deal with branching for database objects used by multiple systems?
What exceptions, if any, can be reasonable made to this process?
Security issues?
Data with de-identification concerns?
Scripts that can't be fully automated?
How can you make the process resilient and enforceable?
To developer error?
To unexpected environmental issues?
For disaster recovery?
How do you convince decision makers that the benefits of DB-SCM truly justify the cost?
Anecdotal evidence?
Industry research?
Industry best-practice recommendations?
Appeals to recognized authorities?
Cost/Benefit analysis?
Who should "own" database objects in this model?
Developers?
DBAs?
Data Analysts?
More than one?
Here are some some answers to your questions:
Should both test and production environments be built from source control? YES
Should both be built using automation - or should production by built by copying objects from a stable, finalized test environment?
Automation for both. Do NOT copy data between the environments
How do you deal with potential differences between test and production environments in deployment scripts?
Use templates, so that actually you would produce different set of scripts for each environment (ex. references to external systems, linked databases, etc)
How do you test that the deployment scripts will work as effectively against production as they do in test?
You test them on pre-production environment: test deployment on exact copy of production environment (database and potentially other systems)
What types of objects should be version controlled?
Just code (procedures, packages, triggers, java, etc)?
Indexes?
Constraints?
Table Definitions?
Table Change Scripts? (eg. ALTER scripts)
Everything?
Everything, and:
Do not forget static data (lookup lists etc), so you do not need to copy ANY data between environments
Keep only current version of the database scripts (version controlled, of course), and
Store ALTER scripts: 1 BIG script (or directory of scripts named liked 001_AlterXXX.sql, so that running them in natural sort order will upgrade from version A to B)
Which types of objects shouldn't be version controlled?
Sequences?
Grants?
User Accounts?
see 2. If your users/roles (or technical user names) are different between environments, you can still script them using templates (see 1.)
How should database objects be organized in your SCM repository?
How do you deal with one-time things like conversion scripts or ALTER scripts?
see 2.
How do you deal with retiring objects from the database?
deleted from DB, removed from source control trunk/tip
Who should be responsible for promoting objects from development to test level?
dev/test/release schedule
How do you coordinate changes from multiple developers?
try NOT to create a separate database for each developer. you use source-control, right? in this case developers change the database and check-in the scripts. to be completely safe, re-create the database from the scripts during nightly build
How do you deal with branching for database objects used by multiple systems?
tough one: try to avoid at all costs.
What exceptions, if any, can be reasonable made to this process?
Security issues?
do not store passwords for test/prod. you may allow it for dev, especially if you have automated daily/nightly DB rebuilds
Data with de-identification concerns?
Scripts that can't be fully automated?
document and store with the release info/ALTER script
How can you make the process resilient and enforceable?
To developer error?
tested with daily build from scratch, and compare the results to the incremental upgrade (from version A to B using ALTER). compare both resulting schema and static data
To unexpected environmental issues?
use version control and backups
compare the PROD database schema to what you think it is, especially before deployment. SuperDuperCool DBA may have fixed a bug that was never in your ticket system :)
For disaster recovery?
How do you convince decision makers that the benefits of DB-SCM truly justify the cost?
Anecdotal evidence?
Industry research?
Industry best-practice recommendations?
Appeals to recognized authorities?
Cost/Benefit analysis?
if developers and DBAs agree, you do not need to convince anyone, I think (Unless you need money to buy a software like a dbGhost for MSSQL)
Who should "own" database objects in this model?
Developers?
DBAs?
Data Analysts?
More than one?
Usually DBAs approve the model (before check-in or after as part of code review). They definitely own performance related objects. But in general the team own it [and employer, of course :)]
I treat the SQL as source-code when possible
If I can write it in standard's compliant SQL then it generally goes in a file in my source control. The file will define as much as possible such as SPs, Table CREATE statements.
I also include dummy data for testing in source control:
proj/sql/setup_db.sql
proj/sql/dummy_data.sql
proj/sql/mssql_specific.sql
proj/sql/mysql_specific.sql
And then I abstract out all my SQL queries so that I can build the entire project for MySQL, Oracle, MSSQL or anything else.
Build and test automation uses these build-scripts as they are as important as the app source and tests everything from integrity through triggers, procedures and logging.
We use continuous integration via TeamCity. At each checkin to source control, the database and all the test data is re-built from scratch, then the code, then the unit tests are run against the code. If you're using a code-generation tool like CodeSmith, it can also be placed into your build process to generate your data access layer fresh with each build, making sure that all your layers "match up" and do not produce errors due to mismatched SP parameters or missing columns.
Each build has its own collection of SQL scripts that are stored in the $project\SQL\ directory in source control, assigned a numerical prefix and executed in order. That way, we're practicing our deployment procedure at every build.
Depending on the lookup table, most of our lookup values are also stored in scripts and run to make sure the configuration data is what we expect for, say, "reason_codes" or "country_codes". This way we can make a lookup data change in dev, test it out and then "promote" it through QA and production, instead of using a tool to modify lookup values in production, which can be dangerous for uptime.
We also create a set of "rollback" scripts that undo our database changes, in case a build to production goes screwy. You can test the rollback scripts by running them, then re-running the unit tests for the build one version below yours, after its deployment scripts run.
+1 for Liquibase:
LiquiBase is an open source (LGPL), database-independent library for tracking, managing and applying database changes. It is built on a simple premise: All database changes (structure and data) are stored in an XML-based descriptive manner and checked into source control.
The good point, that DML changes are stored semantically, not just diff, so that you could track the purpose of the changes.
It could be combined with GIT version control for better interaction. I'm going to configure our dev-prod enviroment to try it out.
Also you could use Maven, Ant build systems for building production code from scripts.
Tha minus is that LiquiBase doesnt integrate into widespread SQL IDE's and you should do basic operations yourself.
In adddition to this you could use DBUnit for DB testing - this tool allows data generation scripts to be used for testing your production env with cleanup aftewards.
IMHO:
Store DML in files so that you could
version them.
Automate schema build process from
source control.
For testing purposes developer could
use local DB builded from
source control via build system +
load testing Data with scripts, or
DBUnit scripts (from Source
Control).
LiquiBase allows you to provide "run
sequence" of scripts to respect
dependences.
There should be DBA team that checks master
brunch with ALL changes
before production use. I mean they
check trunk/branch from other DBA's
before committing into MASTER trunk.
So that master is always consistent
and production ready.
We faced all mentioned problems with code changes, merging, rewriting in our billing production database. This topic is great for discovering all that stuff.
By asking "teaser questions" you seem to be more interested in a discussion than someone's opinion of final answers. The active (>2500 members) mailing list agileDatabases has addressed many of these questions and is, in my experience, a sophisticated and civil forum for this kind of discussion.
I basically agree with every answer given by van. Fore more insight, my baseline for database management is K. Scott Allen series (a must read, IMHO. And Jeff's opinion too it seems).
Database objects can always be rebuilt from scratch by launching a single SQL file (that can itself call other SQL files) : Create.sql. This can include static data insertion (lists...).
The SQL scripts are parameterized so that no environment-dependent and/or sensitive information is stored in plain files.
I use a custom batch file to launch Create.sql : Create.cmd. Its goal is mainly to check for pre-requisites (tools, environment variables...) and send parameters to the SQL script. It can also bulk-load static data from CSV files for performance issues.
Typically, system user credentials would be passed as a parameter to the Create.cmd file.
IMHO, dynamic data loading should require another step, depending on your environment. Developers will want to load their database with test, junk or no data at all, while at the other end production managers will want to load production data. I would consider storing test data in source control as well (to ease unit testing, for instance).
Once the first version of the database has been put into production, you will need not only build scripts (mainly for developers), but also upgrade scripts (based on the same principles) :
There must be a way to retrieve the version from the database (I use a stored procedure, but a table would do as well).
Before releasing a new version, I create an Upgrade.sql file (that can call other ones) that allows upgrading version N-1 to version N (N being the version being released). I store this script under a folder named N-1.
I have a batch file that does the upgrade : Upgrade.cmd. It can retrieve the current version (CV) of the database via a simple SELECT statement, launch the Upgrade.sql script stored under the CV folder, and loop until no folder is found. This way, you can automatically upgrade from, say, N-3 to N.
Problems with this are :
It is difficult to automatically compare database schemas, depending on database vendors. This can lead to incomplete upgrade scripts.
Every change to the production environment (usually by DBAs for performance tuning) should find its way to the source control as well. To make sure of this, it is usually possible to log every modification to the database via a trigger. This log is reset after every upgrade.
More ideally, though, DBA initiated changes should be part of the release/upgrade process when possible.
As to what kind of database objects do you want to have under source control ? Well, I would say as much as possible, but not more ;-) If you want to create users with passwords, get them a default password (login/login, practical for unit testing purposes), and make the password change a manual operation. This happens a lot with Oracle where schemas are also users...
We have our Silverlight project with MSSQL database in Git version control. The easiest way is to make sure you've got a slimmed down database (content wise), and do a complete dump from f.e. Visual Studio. Then you can do 'sqlcmd' from your build script to recreate the database on each dev machine.
For deployment this is not possible since the databases are too large: that's the main reason for having them in a database in the first place.
I strongly believe that a DB should be part of source control and to a large degree part of the build process. If it is in source control then I have the same coding safe guards when writing a stored procedure in SQL as I do when writing a class in C#. I do this by including a DB scripts directory under my source tree. This script directory doesn't necessarily have one file for one object in the database. That would be a pain in the butt! I develop in my db just a I would in my code project. Then when I am ready to check in I do a diff between the last version of my database and the current one I am working on. I use SQL Compare for this and it generates a script of all the changes. This script is then saved to my db_update directory with a specific naming convention 1234_TasksCompletedInThisIteration where the number is the next number in the set of scripts already there, and the name describes what is being done in this check in. I do this this way because as part of my build process I start with a fresh database that is then built up programatically using the scripts in this directory. I wrote a custom NAnt task that iterates through each script executing its contents on the bare db. Obviously if I need some data to go into the db then I have data insert scripts too. This has many benefits too it. One, all of my stuff is versioned. Two, each build is a fresh build which means that there won't be any sneaky stuff eking its way into my development process (such as dirty data that causes oddities in the system). Three, when a new guy is added to the dev team, they simply need to get latest and their local dev is built for them on the fly. Four, I can run test cases (I didn't call it a "unit test"!) on my database as the state of the database is reset with each build (meaning I can test my repositories without worrying about adding test data to the db).
This is not for everyone.
This is not for every project. I usually work on green field projects which allows me this convenience!
Rather than get into white tower arguments, here's a solution that has worked very well for me on real world problems.
Building a database from scratch can be summarised as managing sql scripts.
DBdeploy is a tool that will check the current state of a database - e.g. what scripts have been previously run against it, what scripts are available to be run and therefore what scripts are needed to be run.
It will then collate all the needed scripts together and run them. It then records which scripts have been run.
It's not the prettiest tool or the most complex - but with careful management it can work very well. It's open source and easily extensible. Once the running of the scripts is handled nicely adding some extra components such as a shell script that checks out the latest scripts and runs dbdeploy against a particular instance is easily achieved.
See a good introduction here:
http://code.google.com/p/dbdeploy/wiki/GettingStarted
You might find that Liquibase handles a lot of what you're looking for.
Every developer should have their own local database, and use source code control to publish to the team. My solution is here : http://dbsourcetools.codeplex.com/
Have fun,
- Nathan

ETL Tools and Build Tools

I have familiarities with software automated build tools ( such as Automated Build Studio). Now I am looking at ETL tools.
The one thing crosses my mind is that, I can do anything I can do in ETL tools by using a software build tool. ETL tools are tailored for data loading and manipulation for which a lot of scripts are needed in order to do the job. Software build tool, on the other hand, is versatile enough to do any jobs, including writing scripts to extract, transform and load any data from any format into any format.
Am I right?
It is correct that you can roll-out your own ETL scripts written using a development tool of your preference. Having said that, ETL jobs are frequently large (for a lack of better word) and demand considerable administration and attention to minute details (like programming). ETL tools allow developer to focus on ETL tasks -- as opposed to writing and debugging code, although that's part of it too. There are some open-source tools out there, so you can get a feeling of what an average tool does, before jumping into custom development. For example, more expensive tools provide data lineage, meaning you can (graphically) track every field on a report back to the originating table through all transformations (versions included); after a corporate merger that's quite a task to do.
For example Pentaho has community edition; if you have MS SQL Server, you can get SSIS. Also see if you can find something here.
The benefit of an ETL tool is maximized if you have many processes to build (I like jsf80238's post aboves analogy with hammering in 100 nails). A key benefit of real ETL tools is the metadata they generate and operational support. Writing your scripts in Perl/Ruby/etc is fairly easy, but breaks down when problems need to be tracked down or someone other than the author has to figure out what's wrong.The ability for admin/support staff to quickly see what went wrong is what's worth paying money for. I have used Microsoft's SSIS (2005 - OK) and the latest Pentaho PDI (quite good). The Pentaho ETL GUI is used by business users (without IT support for 99% of the time) at my workplace, and has replaced a tangle of SQL scripts and spreadsheets. Say what you like about the rest of the Pentaho stack, but the ETL component is, in my opinion, excellent "bang for buck".
The whole business of ETL is based on the premise that the source of the data is incompatible with destination data source. And many times, the folks who dump the source data may not be thinking that this data needs to be collected and aggregated. This is why the whole business of ETL is in existent.
A commercial ETL tool will not magically read the source input and transform data according to the rules of the destination database. Rules have to be defined and fed into the ETL tool. Interestingly, many companies offer training!!! on how to use their proprietary scripting language. So it is not always that easy. But for non-programmers, maybe this is the preferred route.
Personally, I think that it is always easier to write a proprietary ETL tool in a language like Perl. Simply write a state-machine algorithm to rip through the source data and convert it to the desired format. I use Perl to FTP into machines, read in the files, transform the data, and then load it into the database. This is always a superior solution and much faster if one is proficient in Perl or similar, or can hire someone who knows Perl.
And one final point, start with the end in mind. Dump your source data in a structured format to help out the analysis group in your company who wants to aggregate and study the. This will make the ETL program easier and faster to develop.
I like Damir Sudarevic's answer and wanted to add that your choice of tool might also depend on how much work you have in front of you. If you have the occasional ETL task and are already familiar with a tool that will allow you to accomplish that task, use the tool you already know (this approach assigns a zero value to learning a new tool, which is perhaps undervaluing new knowledge). If you have a lot of ETL tasks, the up-front investment of learning a new tool might very well pay off. You can use pliers to drive a nail, and if you have only one nail you can use the pliers. If you have to drive 100 nails get yourself a hammer.
You can also do anything ETL tools can do with code. :-)
Both tool categories you mention can be used to solve this problem, but they are optimized for the class of problems they are trying to solve:
ETLs tend to come with a library of data manipulation tools (relational calculus, in-line computations, etc.), are optimized to handle large quantities of data, and have job management features (important if this isn't a single one-off data migration).
Build tools (for me, Ant comes to mind as a prototypical example) could do similar tasks, but are focused on compilation, file organization and manipulation, and packaging.

Resources