Database design for physics hardware - database

I have to develop a database for a unique environment. I don't have experience with database design and could use everybody's wisdom.
My group is designing a database for piece of physics hardware and a data acquisition system. We need a system that will store all the hardware configuration parameters, and track the changes to these parameters as they are changed by the user.
The setup:
We have nearly 200 detectors and roughly 40 parameters associated with each detector. Of these 40 parameters, we expect only a few to change during the course of the experiment. Most parameters associated with a single detector are static.
We collect data for this experiment in timed runs. During these runs, the parameters loaded into the hardware must not change, although we should be able to edit the database at any time to prepare for the next run. The current plan:
The database will provide the difference between the current parameters and the parameters used during last run.
At the start of a new run, the most recent database changes be loaded into hardware.
The settings used for the upcoming run must be tagged with a run number and the current date and time. This is essential. I need a run-by-run history of the experimental setup.
There will be several different clients that both read and write to the database. Although changes to the database will be infrequent, I cannot guarantee that the changes won't happen concurrently.
Must be robust and non-corruptible. The configuration of the experimental system depends on the hardware. Any breakdown of the database would prevent data acquisition, and our time is expensive. Database backups?
My current plan is to implement the above requirements using a sqlite database, although I am unsure if it can support all my requirements. Is there any other technology I should look into? Has anybody done something similar? I am willing to learn any technology, as long as it's mature.
Tips and advice are welcome.
Thank you,
Sean
Update 1:
Database access:
There are three lite applications that can write and read to the database and one application that can only read.
The applications with write access are responsible for setting a non-overlapping subset of the hardware parameters. To be specific, we have one application (of which there may be multiple copies) which sets the high voltage, one application which sets the remainder of the hardware parameters which may change during the experiment, and one GUI which sets the remainder of the parameters which are nearly static and are only essential for the proper reconstruction of the data.
The program with read access only is our data analysis software. It needs access to nearly all of the parameters in the database to properly format the incoming data into something we can analyze properly. The number of connections to the database should be >10.
Backups:
Another setup at our lab dumps an xml file every run. Even though I don't think xml is appropriate, I was planning to back up the system every run, just in case.

Some basic things about the design; you should make sure that you don't delete data from any tables; keep track of the most recent data (probably best with most recent updated datetime); when the data value changes, though, don't delete the old data. When a run is initiated, tag every table used with the Run ID (in another column); this way, you maintain full historical record about every setting, and can pin exactly what the state used at a given run was.

Ask around of your colleagues.
You don't say what kind of physics you're doing, or how big the working group is, but in my discipline (particle physics) there is a deep repository of experience putting up and running just this type of systems (we call it "slow controls" and similar). There is a pretty good chance that someone you work with has either done this or knows someone who has. There may be a detailed description of the last time out in someone's thesis.
I don't personally have much to do with this, but I do know this: one common feature is to have no-delete-no-overwrite design. You can only add data, never remove it. This preserves your chances of figuring out what really happened in the case of trouble
Perhaps I should explain a little more. While this is an important task and has to be done right, it is not really related to physics, so you can't look it up on Spires or on arXive.org. No one writes papers on the design and implementation of medium sized slow controls databases. But they do sometimes put it in their dissertations. The easiest way to find a pointer really is to ask a bunch of people around the lab.

This is not a particularly large database by the sounds of things. So you might be able to get away with using Oracle's free database which will give you all kinds of great flexibility with journaling (not sure if that is an actual word) and administration.
Your mentioning of 'non-corruptible' right after you say "There will be several different clients that both read and write to the database" raises a red flag for me. Are you planning on creating some sort of application that has a interface for this? Or were you planning on direct access to the db via a tool like TOAD?
In order to preserve your data integrity you will need to get really strict on your permissions. I would only allow one (and a backup) person to have admin rights with the ability to do the data manipulation outside the GUI (which will make your life easier).
Backups? Yes, absolutely! Not only should you do daily, weekly and monthly backups you should do full and incremental. Also, test your backup images often to confirm they are in fact working.
As for the data structure I would need much greater detail in what you are trying to store and how you would access it. But from what you have put here I would say you need the following tables (to begin with):
Detectors
Parameters
Detector_Parameters
Some additional notes:
Since you will be doing so many changes I recommend using a version control like SVN to keep track of all your DDLs etc. I would also recommend using something like bugzilla for bug tracking (if needed) and using google docs for team document management.
Hope that helps.

Related

Creating a biological database: First steps?

My lab is doing a lot of sequencing, but the way the sequences are documented makes it difficult to retrieve them or keep track of the data. I would like to create a database that has following features:
-A Graphical user interface to allow one to upload/retrieve/view data, and can incorporate links to quickly BLAST or analyse the sequences with other online tools.
-allows one to access it in the command line
-that has another section on the GUI that has records of what's in the lab, what needs to be ordered etc.
I wanted to know if there are general database templates I can adopt and modify to suit my lab needs? I have no experience in database design but have read about mySQL.
What are the first steps I should take in embarking on this project?
Thank you!
This is an interesting question and problem domain (one I now have expierence with btw). Your first step is to decide on a general architecture and then select technologies for this.
For the web/graphical side, there are lots of off the shelf components (I assume you are aware of tools like AntiSMASH, JBrowse, etc). But you will need to evaluate these. That is way outside the scope of the db side however.
On the database side, PostgreSQL performs admirably here. I have worked on a heavily loaded 10+TB db which was specifically storing sequencing data, BLAST reports, and so forth. If you add stuff like PostBIS on top of that, you get something quite functional.
A lot of the heavier portions of the industry however are using Hadoop because of the fact that the quantity of data available is increasing very rapidly but the amount of expertise required to make that work is also appropriately higher.

Change in database structure

We already have a database structure, but it is the structure without normalization and very confused and in need of change, but already has a large volume of stored data, for example, all financial data company, which finance department officials are afraid of losing.
We are undecided about remodeling the entire structure of the database and retrieve the most basic and all that is possible, or continue with the same model along with their problems.
I wonder if someone has made a change like this, if you can actually transfer the data to a new structure.
thanks
Before you do any thing I would BACKUP!!! Next I would create a new database with the ideas that you had in mind. Remember this is were all the real work should be once this is created it is hard to go back. Put a lot of thought in and make the design a bullet proof tiger to the design of your company. Next create some procedures to transform the data you have in the new database as you see fit. It would help if you mentioned the platform(s) you are using and mabey provide some generic examples
I have found SSIS packages work well for projects like this if you are using SQLSERVER. While you will need to still write your transforms out the packages make the work easier for others to see what is happening
Anything can be done by you the developer. However it might make business sense to check out various 3rd party tools. There are many out there and depending on exactly what you are doing you may benefit from doing some research
Yes, it's called "database conversion". It is a very common practice, but it must be done carefully and methodically, ideally by someone who has done many of them and knows the pitfalls. It is not to be done casually by any means. Moreover, it is not unusual in the financial sector to run the "old system" in parallel with the new system for a couple of months, to reconcile month-end reports, before saying goodbye to the old system. Running parallel is a PITA, and can only be done if all of the conversion programs are in place, but it's better to be safe than sorry when the numbers must be correct to the penny.
I had the same problem, the way I solved this is by re-design a new database, then I made a script that copies the data from the old schema to the new one. It's not an easy task because you need to take care of what you are copying from the old model to the new one but it's doable!
absolutely you can migrate the data to an new structure. The real question is 'how difficult (expensive/time consuming/reliable) will the migration be?' To answer that question one would have to know
The accuracy of the existing data - does it have gaps, duplication that disagrees with each other and no way to resolve, errors, etc.
What structure do you imagine going to and is this going to introduce complexity to the migration
the skill level of the person/team doing the migration
How long the migration will take and will the platforms be changing (either the live system being modified or the new system design changing)

What arguments to use to explain why SQL Server is far better than a flat file

The higher-ups in my company were told by good friends that flat files are the way to go, and we should switch from SQL Server to them for everything we do. We have over 300 servers and hundreds of different databases. From just the few I'm involved with we have > 10 billion records in quite a few of them with upwards of 100k new records a day and who knows how many updates... Me and a couple others need to come up with a response saying why we shouldn't do this. Most of our stuff is ASP.NET with some legacy ASP. We thought that making a simple console app that tests/times the same interactions between a flat file (stored on the network) and SQL over the network doing large inserts, searches, updates etc along with things like network disconnects randomly. This would show them how bad flat files can be, especially when you are dealing with millions of records.
What things should I use in my response? What should I do with my demo code to illustrate this?
My sort list so far:
Security
Concurrent access
Performance with large amounts of data
Amount of time to do such a massive rewrite/switch and huge $ cost
Lack of transactions
PITA to map relational data to flat files
NTFS doesn't support tons of files in a directory well
Lack of Adhoc data searching/manipulation
Enforcing data integrity
Recovery from network outage
Client delay while waiting for other clients changes to commit
Most everybody stopped using flat files for this type of storage long ago for good reason
Load balancing/replication
I fear that this will be a great post on the Daily WTF someday if I can't stop it now.
Additionally
Does anyone know if anything about HIPPA could be used in this fight? Many of our records are patient records...
Data integrity. First, you can enforce it in a database and cannot in a flat file. Second, you can ensure you have referential integrity between different entities to prevent orphaning rows.
Efficiency in storage depending on the nature of the data. If the data is naturally broken into entities, then a database will be more efficient than lots of flat files from the standpoint of the additional code that will need to be written in the case of flat files in order to join data.
Native query capabilities. You can query against a database natively whereas you cannot with a flat file. With a flat file you have to load the file into some other environment (e.g. a C# application) and use its capabilities to query against it.
Format integrity. The database format is more rigid which means more consistent. A flat file can easily change in a way that the code that reads the flat file(s) will break. The difference is related to #3. In a database, if the schema changes, you can still query against it using native tools. If the flat file format changes, you have to effectively do a search because the code that reads it will likely be broken.
"Universal" language. SQL is somewhat ubiquitous where as the structure of the flat file is far more malleable.
I'd also mention data corruption. Most modern SQL databases can have the power killed on the server, or have the server instance crash and you won't (shouldn't) loose data. Flat files aren't really that way.
Also I'd mention search times. Perhaps even write a simple flat file database with 1mil entries and show search times vs MS SQL. With indexes you should be able to search a SQL database thousands of times faster.
I'd also be careful how quickly you write off flat files. Id go so far as saying "it's a good idea for many cases, but in our case....". This way you won't sound like you're not listening to the other views. Tact in situations like this is a major thing to consider. They may be horribly wrong, but you have to convince your boss of that.
What do they gain from using flat files? The conversion process will be hundreds of hours - hours they pay for. How quickly can flat files generate a positive return on that investment? Provide a rough cost estimate. Translate the technical considerations into money (costs), and it puts the problem in their perspective.
On top of just the data conversion, add in the hidden costs for duplicating a database's capabilities...
Indexing
Transaction processing
Logging
Access control
Performance
Security
Databases allow you to easily index your data to be able to particular records or groups of records by searching any number of different columns.
With flat files you have to write your own indexing mechanisms. There is no need to do all that work again when the database does it for you already.
If you use "text files", you'll need to build an interface on top of it which Microsoft has already done for you and called it SQL Server.
Ask your managers if it makes sense to your company to spend all these resources building a home-made database system (because really that's what it is), or would these resources be better spent focusing on the business.
Performance: SQL Server is built for storing conveniently searchable data. It has optimized data structures in memory built with searching/inserting/deleting in mind. Usage of the disk is lowered, as data regularly queried is kept in memory.
Business partners: if you ever plan to do B2B with 3rd party companies, SQL Server has built-in functionality for it called Linked Servers. If you have only a bunch of files, your business partner will give up on you as no data interconnection is possible. Unless you want to re-invent the wheel again, and build an interface for each business partner you have.
Clustering: you can easily cluster servers in SQL Server for high availability and speed, a lot more than what's possible with text based solution.
You have a nice start to your list. The items I would add include:
Data integrity - SQL engines provide built-in mechanisms (relationships, constraints, triggers, etc.) that make it very simple to reduce the amount of "bad" data in your system. You would need to hand code all data constraint separately if you use flat files.
Add-Hoc data retrieval - SQL engines, through the use of SELECT statements, provide a means of filtering and summarizing your data with very little code. If you are using flat files, considerably more code is needed to get the same results.
These items can be replicated if you want to take the time to build a data engine, but what would be the point? SQL engines already provide these benefits.
I don't think I can even start to list the reasons. I think my head is going to explode. I'll take the risk though to try to help you...
Simulate a network outage and show what happens to one of the files at that point
Demo the horrors of a half-committed transaction because text files don't pass the ACID test
If it's a multi-user application, show how long a client has to wait when 500 connections are all trying to update the same text file
Try to politely explain why the best approach to making business decisions is to listen to the professionals who you are paying money and who know the domain (in this case, IT) and not your buddy who doesn't have a clue (maybe leave out that last bit)
Mention the fact that 99% (made up number) of the business world uses relational databases for their important data, not text files and there's probably a reason for that
Show what happens to your application when someone goes into the text file and types in "haha!" for a column that's supposed to be an integer
If you are a public company, the shareholders would be well served to know this is being seriously contemplated. "We" all know this is a ridiculous suggestion given the size and scope of your operation. Patient records must be protected, not only from security breaches but from irresponsible exposure to loss - lives may depend up the data. If the Executives care at all about the patients, THIS should be their highest concern.
I worked with IBM 370 mainframes from '74 onwards and the day that DB2 took over from plain old flat files, VSAM and ISAM was a milestone day. Haven't looked back to flat-file storage, except for streaming data, in my 25 years with RDBMSs of 4 flavors.
If I owned stock in "you", dumping it in a hurry the moment the project took off would seem appropriate...
Your list is a great start of reasons for sticking with a database.
However, I would recommend that if you're talking to a technical person, to shy away from technical reasons in a recommendation because they might come across as biased.
Here are my 2 points against flat file data storage:
1) Security - HIPPA audits require that patient data remain in a secure environment. The common database systems (Oracle, Microsoft SQL, MySQL) have methods for implementing HIPPA compliant security access. Doing so on a flat-file would be difficult, at best.
Side note: I've also seen medical practices that encrypt the patient name in the database to add extra layers of protection & compliance to ensure even if their DB is compromised that the patient records are not at risk.
2) Reporting - Reporting from any structured database system is simple and common. There are hundreds of thousands of developers that can perform this task. Reporting from flat-files will require an above-average developer. And, because there is no generally accepted method for doing reporting off of a flat-file database, one developer might do things different than another. This could impact the talent pool able to work on a home-grown flat-file system, and ultimately drive costs up by having to support that type of a system.
I hope that helps.
How do you create a relational model with plain text files?
Or are you planning to use a different file for each entity?
Pro file system:
Stable (less lines of code = less bugs, easier to understand, more reliable)
Faster with huge data blobs
Searching/sorting is somewhat slow (but sort can be faster than SQL's order by)
So you'd chose a filesystem to create log files, for example. Logging into a DB is useless unless you need to do complex analysis of the data.
Pro DB:
Transactions (which includes concurrent access)
It can search through huge amounts of records (but not through huge blobs of data)
Chopping the data in all kinds of ways with queries is easy (well, if you know your SQL and the special "oddities" of your DB)
So if you need to add data rarely but search it often, select parts of it by certain criteria or aggregate values, a DB is for you.
NTFS does not support mass amounts of .txt files well. Depending on how a flat file system is developed, the health of a harddrive can become an issue. A lot of older file systems use mass amount of small .txt files to store data. It's bad design, but tends to happen as a flat file system gets older.
Fragmentation becomes an issue, and you lose a text file here and there, causing you to lose small amounts of data. Health of a hard drive should not be an issue when it comes to database design.
This is indeed, on the part of your employer, a MAJOR WTF if he's seriously proposing flat files for everything...
You already know the reasons (oh - add Replication / Load Balancing to your list) - what you need to do now is to convince him of them. My approach on this would two fold.
First of all, I would write a script in whatever tool you currently use to perform a basic operation using SQL, and have it timed. I would then write another script in which you sincerely try to get a flat text solution working, and then highlight the difference in performance. Give him both sets of code so he knows you aren't cheating.
Point out that technology evolves, and that just because someone was successful 20 years ago, this does not automatically entitle them to a credible opinion now.
You might also want to mention the scope for errors in decoding / encoding information in text files, that it would be trivial for someone to steal them, and the costs (justify your estimate) in adapting the current code base to use text files.
I would then ask serious questions of management - foremost amongst them, and I would ask this DIRECTLY, is "Why are you prepared to overrule your technical staff on technical matters" based on one other individual's opinion - especially when said individual is not as familiar with our set up as we are...
I'd also then use the phrase "I do not mean to belittle you, but I seriously feel I have to intervene at this point for the good of the company..."
Another approach - turn the tables - have Mr. Wonderful supply arguments as to why text files are the way forward. You'll then either a) Learn something (not likely), or b) Be in a position to utterly destroy his arguments.
Good luck with this - I feel your pain...
Martin
I suggest you get your retalliation in first, post on Daily WTF now.
As to your question: a business reason would be why does your boss want to rewrite all your systems. From scratch as you would, effectively, have to write your own database system.
For a development reason, you would lose access to the SQL server ecosystem, all the libraries, tools, utilities.
Perhaps the guy that suggested this is actually thinking of going into competition with your company.
Simplest way to refute this argument - name a fortune 500 company that processes data on this scale using flat files?
Now name a fortune 500 company that doesn't use a relational database...
Case closed.
Something is really fishy here. For someone to get the terminology right ( "flat file" ) but not know how overwhelmingly stupid an idea that is, it just doesn't add up. I would be willing to be your manager is non-technical, but the person your manager is talking to is. This sounds more like a lost in translation problem.
Are you sure they don't mean no-SQL, as if you are in a document centric environment, moving away from a relational database actually does make sense in some regards, while still having many of the positives of a tradition RDBMS.
So, instead of justifying why SQL is better than flat files, I would invert the problem and ask what problems flat files are meant to solve. I would put odds on money that this is a communication problem.
If its not and your company is actually considering replacing its DB with a home grown flat file system off the recommendation of "a friend", convincing your manager why he is wrong is the least of your worries. Instead, dust off and start circulating your resume.
•Amount of time to do such a massive
rewrite/switch and huge $ cost
It's not just amount of time it is the introduction of new bugs. A re-write of these proportions would cause things that currenty work to break.
I'd suggest a giving him a cost estimate of the hours to do such a rewrite for just one system and then the number of systems that would need to change. Once they have a cost estimate, they will run from this as fast as they can.
Managers like numbers, so do a formal written decision analysis. Compare the two proposals by benefits and risks, side by side with numeric values. When you get to cost 0 to maintain and 100,000,000 to convert they will get the point.
The people that doesn't distinguish between flat files and sql, doesnt understand all arguments that you say before.
The explanation must simple as possible, something like this:
SQL is a some kind of search/concurrency wrapper around the flat files.
All the problems that exist currently, will stay even the company going to write the wrapper from zero.
Also you must to give some other way to resolve the current problems, use smart words like advanced BLL or install/uninstall scripting environment. :)
You have to speak executive. Without saying it, make them realize they're in way over their heads here. Here's some ammunition:
Database theory is hardcore computer science. We're talking about building a scalable system that can handle millions of records and tolerate disasters without putting everyone out of business.
This is the work of PhD-level specialists. They've been refining the field for a good 20 years now, and the great thing about that is this: it allows us to specialize in building business systems.
If you have to, come right out and say that this just isn't done in the enterprise. It would be costly and the result would be inferior. It's exactly the kind of wheel that developers love to reinvent, and in my opinion the only time you should is if the result is going to be a product or service that you can sell. And it won't be.

Track changes on a database

I am not sure whether this has been asked before; I did a few searches but nothing appropriate showed up.
OK, now my problem:
I want to migrate an old application to a different programming language. The only requirement we have is to keep the database structure stable. So no changes in my database schema. For the rest of the application I am basically reimplementing everything from scratch without reusing old code.
My Idea: in order to verify my new code was to let users do certain actions or workflows, capture the state of the database before that and after that and then maybe create unit tests with the help of this data. Does anyone know an elegant solution to keep track of these changes? Copying the database (>10GB) is pretty expensive. I also can't modify the code of the old application in which the users will be performing these sample actions. I have to keep it on the database level.
My database is Oracle 10g.
You could capture the old application behavior with a trace and then validate the changes against your new code. But, honestly, trying to write a new application by capturing the data modifications it makes and the imitating that will be a very difficult task as the inputs and the outputs to the original application are not guaranteed to be stateless (that is, the old application might do the same thing the first 1,000,000 times it is given a certain set of inputs and do something completely different on the 1,000,001st run.)
Your best bet is to start over with the business requirements and use the old application and a functional reference.
Take a look at Oracle Flashback Queries.
It enables to execute queries which return past data. The timeframe is limited, but it can be very useful.
In 10g the only way is to do with FLASHBACK queries.in 11g we can do this with RAT(Real Application Testing). RAT is quite useful for this senarios and also for load and volume testing.

How to approach an ETL mission?

I am supposed to perform ETL where source is a large and badly designed sql 2k database and a a better designed sql 2k5 database. I think SSIS is the way to go. Can anyone suggest a to-do list or a checklist or things to watchout for so that I dont forget anything? How should I approach this so that it does not bite me in the rear later on.
Some general ETL tips
Consider organising it by
destination (for example, all the
code to produce the Customer
dimension lives in the same module, regardless of source).
This is sometimes known as
Subject-oriented ETL. It makes
finding stuff much easier and will
increase the maintainability of your
code.
If the SQL2000 database is a mess,
you will probably find that SSIS
data flows are a clumsy way to deal
with the data. As a rule, ETL tools
scale poorly with complexity;
something like half of all data
warehouse projects in finance
companies are done with stored
procedure code as an explicit
architectural decision - for precisely this reason. If you have
to put a large amount of code in
sprocs, consider putting all of the
code in sprocs.
For a system involving lots of complex scrubbing or transformations, a 100% sproc approach is far more maintainable as it is the only feasible way to put all of the transformations and business logic in one place. With mixed ETL/sproc systems, you have to look in multiple places to track, troubleshoot, debug or change the whole transformation.
The sweet spot of ETL tools is on systems where you have a larger number of data sources with relatively simple transformations.
Make the code testable, so you can
pick apart the components and test
in isolation. Code that can only be executed from within the middle of a complex data flow in an ETL tool is much harder to test.
Make the data extract dumb with no
business logic, and copy into a
staging area. If you have business
logic spread across the extract and
transform layers, you will have
transformations that cannot be tested
in isolation and make it hard to
track down bugs. If the transform is
running from a staging area you
reduce the hard dependency on the
source system, again enhancing
testability. This is a particular win on sproc-based architectures as it allows an almost completely homogeneous code base.
Build a generic slowly-changing
dimension handler or use one off the
shelf if available. This makes it
easier to unit test this
functionality. If this can be unit
tested, the system testing does not
have to test all of the corner cases,
merely whether the data presented to
it is correct. This is not as complex as it sounds - The last one I wrote was about 600 or 700 lines of T-SQL code. The same goes for any generic scrubbing functions.
Load incrementally if possible.
Instrument your code - have it make log entries, possibly recording diagnostics such as check totals or counts. Without this, troubleshooting is next to impossible. Also, assertion checking is a good way to think of error handling for this (does row count in a equal row count in b, is A:B relationship really 1:1).
Use synthetic keys. Using natural keys from the source systems ties your system to the data sources, and makes it difficult to add extra sources. The keys and relationships in the system should always line up - no nulls. For errors, 'not recorded', make a specific 'error' or 'not recorded' entries in the dimension table and match to them.
If you build an Operational Data Store (the subject of many a religious war) do not recycle the ODS keys in the star schemas. By all means join on ODS keys to construct dimensions, but match on a natural key. This allows you to arbitrarily drop and recreate the ODS - possibly changing its structure - without disturbing the star schemas. Having this capability is a real maintenance win, as you can change ODS structure or do a brute-force re-deployment of the ODS at any point.
Points 1-2 and 4-5 mean that you can build a system where all of the code for any given subsystem (e.g. a single dimension or fact table) lives in one and only one place in the system. This type of architecture is also better for larger numbers of data sources.
Point 3 is a counterpoint to point 2. Basically the choice between SQL and ETL tooling is a function of transformation complexity and number of source systems. The simpler the data and larger the number of data sources, the stronger the case for a tools-based approach. The more complex the data, the stronger the case for moving to an architecture based on stored procedures. Generally it's better to exclusively or almost exclusively use one or the other but not both.
Point 6 makes your system easier to test. Testing SCD's or any change based functionality is fiddly, as you have to be able to present more than one version of the source data to the system. If you move the change management functionality into infrastructure code, you can test it in isolation with test data sets. This is a win in testing, as it reduces the complexity of your system testing requirements.
Point 7 is a general performance tip that you will need to observe for large data volumes. Note that you may only need incremental loading for some parts of a system; for smaller reference tables and dimensions you may not need it.
Point 8 is germane to any headless process. If it goes tits up during the night, you want some fighting chance of seeing what went wrong the next day. If the code doesn't properly log what's going on and catch errors, you will have a much harder job troubleshooting it.
Point 9 gives the data warehouse a life of its own. You can easily add and drop source systems when the warehouse has its own keys. Warehouse keys are also necessary to implement slowly changing dimensions.
Point 10 is a maintenance and deployment win, as the ODS can be re-structured if you need to add new systems or change the cardinality of a record. It also means that a dimension can be loaded from more than one place in the ODS (think: adding manual accounting adjustments) without a dependency on the ODS keys.
I have experience with ETL processes pulling data from 200+ distributed databases to a central database on a daily, weekly, monthly and yearly basis. It is a massive amount of data and there are many issues we have had specific to our situation. But as I see it, there are several items to think about regardless of the situation:
Make sure that you take file locks into consideration, both on the source and destination side. Making sure that other processes do not have the files locked (and removing those locks if necessary and it makes sense) is important.
locking the files for yourself. Make sure, especially on the source that you lock the files while pulling out the data so that you do not get halfway updated data.
if at all possible, pull deltas, not all of the data. Get a copy of the data and then pull only rows that have changed instead of everything. The larger your data set the more important this becomes. Look at journals and triggers if you have to, but as it becomes more important to have this data on a certain basis, this is probably the number one advice I would give you. Even if it adds a significant amount of time to the project.
execution log. make sure you know when it worked and when it didn't, and throwing specific errors in the process can really help in debugging.
document, document, document. If you build this right, you are going to build it and then not think about it for a long time. But you can be guaranteed, you or someone else will need to come back to it at some point to enhance it or do a bug fix. Documentation is key in these situations.
HTH, ill update this if I think of anything else.
Well i'm developing an ETL for the company where i am.
We are working with SSIS.
Using the api to generate and build our own dtsx packages.
SSIS it's not friendly for managing errors. Sometimes you get an "OleDb Error" that could have a lot of different meanings depeding on the context.
Read the API Documentation (they don't say much).
Some links to help you out starting there:
http://technet.microsoft.com/de-de/library/ms135932(SQL.90).aspx
http://msdn.microsoft.com/en-us/library/ms345167.aspx
http://msdn.microsoft.com/en-us/library/ms403356.aspx
http://www.codeproject.com/KB/database/SSISProgramming.aspx?display=PrintAll&fid=382208&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=26&select=2551674
http://www.codeproject.com/KB/database/foreachadossis.aspx
http://wiki.sqlis.com/default.aspx/SQLISWiki/ComponentErrorCodes.html
http://www.new.facebook.com/inbox/readmessage.php?t=1041904880323#/home.php?ref=logo
http://technet.microsoft.com/en-us/library/ms187670.aspx
http://msdn.microsoft.com/ja-jp/library/microsoft.sqlserver.dts.runtime.foreachloop.foreachenumerator.aspx
http://www.sqlis.com/post/Handling-different-row-types-in-the-same-file.aspx
http://technet.microsoft.com/en-us/library/ms135967(SQL.90).aspx
http://msdn.microsoft.com/en-us/library/ms137709(SQL.90).aspx
http://msdn.microsoft.com/en-us/library/ms345164(SQL.90).aspx
http://msdn.microsoft.com/en-us/library/ms141232.aspx
http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx
http://www.ivolva.com/ssis_code_generator.html
http://www.ivolva.com/ssis_wizards.html
http://www.codeplex.com/MSFTISProdSamples
http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/Q_23972361.html
http://forums.microsoft.com/MSDN/MigratedForum.aspx?siteid=1&PostID=1404157
http://msdn.microsoft.com/en-us/library/aa719592(VS.71).aspx
http://forums.microsoft.com/MSDN/MigratedForum.aspx?siteid=1&ForumID=80
http://blogs.conchango.com/jamiethomson/archive/2005/06/11/SSIS_3A00_-Custom-Logging-Using-Event-Handlers.aspx
http://blogs.conchango.com/jamiethomson/archive/2007/03/13/SSIS_3A00_-Property-Paths-syntax.aspx
http://search.live.com/results.aspx?q=%s&go=Buscar&form=QBJK&q1=macro%3Ajamiet.ssis
http://toddmcdermid.blogspot.com/2008/09/using-performupgrade.html?showComment=1224715020000
http://msdn.microsoft.com/en-us/library/ms136082.aspx
http://support.microsoft.com/kb/839279/en-us
Sorry for the "spam", but they are very useful to me.
We're doing a huge ETL (moving a client from legacy AS400 apps to Oracle EBS), and we actually have a process that (with modifications) I can recommend:
Identify the critical target
tables/fields.
Identify the critical
source tables/fields.
Work with the
business users to map source to
target.
Analyze the source data for
quality issues.
Determine who's
responsible for data quality issues
identified.
Have responsible parties
clean up the data in the source.
Develop the actual ETL based on the
information from steps 1 - 3.
The trickiest steps are 2 & 3 in my experience - it's sometimes difficult to get the business users to correctly identify all the bits they need in one pass, and can be even harder to properly identify exactly where the data is coming from (though that may have something to do with cryptic file and field names that I'm seeing!). However, this process should help you avoid major misses.
This thread is old, but I want to draw your attention to ConcernedOfTunbridgeWells' answer. It is incredibly good advice, on all points. I could reiterate a few, but that would diminish the rest, and they all deserve close study.

Resources