Advice on how to write robust data transfer processes?

Advice on how to write robust data transfer processes? - database

I have a daily process that relies on flat files delivered to a "drop box" directory on file system, this kicks off a load of this comma-delimited (from external company's excel etc) data into a database, a piecemeal Perl/Bash application, this database is used by multiple applications as well as edited directly with some GUI tools. Some of the data then gets replicated with some additional Perl app into the database that I mainly use.
Needless to say, all that is complicated and error prone, data coming in is sometimes corrupt or sometimes an edit breaks it. My users often complain about missing or incorrect data. Diffing the flat files and DBs to analyze where the process breaks is time consuming, and which each passing day data becomes more out of data and difficult to analyze.
I plan to fix or rewrite parts or all of this data transfer process.
I am looking on recommended reading before I embark on this, websites and articles on how to write robust, failure resistant and auto-recoverable ETL processes or other advice would be appreciated.

This is precisely what Message Queue Managers are designed for. Some examples are here.

You don't say what database backend you have, but in SQL Server I would write this as an SSIS package. We have a system designed to also write data to a meta data database that tells us when teh file was picked up, whether it processed successfully and why if it did not. It also tells things like how many rows the file had (which we can then use to determine if the current row size is abnormal). One of the beauties of SSIS is that I can set up configurations on package connections and variables, sothat moving the package from development to prod is easy (I don't have to go in and manaually change the connections each time once I have a configuration set up in the config table)
In SSIS we do various checks to ensure the data is correct or clean up the data before inserting to our database. Actually we do lots and lots of checks. Questionable records can be removed from the file processing and put in a separate location for the dbas to examine and possibly pass back to the customer. We can also check to see if the data in various columns (and the column names if given, not all files have them) is what would be expected. So if the zipcode field suddently has 250 characters, we know something is wrong and can reject the file before processing. That way when the client swaps the lastname column with the firstname column without telling us, we can reject the file before importing 100,000 new incorrect records. IN SSIS we can also use fuzzy logic to find existing records to match. So if the record for John Smith says his address is at 213 State st. it can matchour record that says he lives at 215 State Street.
It takes alot to set up a process this way, but once you do, the extra confidence that you are processing good data is worth it's weight in gold.
Even if you can't use SSIS, this should at least give you some ideas of the types of things you should be doing to get the information into your database.

I found this article helpful for the error handling aspects of running cron jobs:
IBM DeveloperWorks Article: "Build intelligent, unattended
scripts"

Related

How and where to store the current customer purchasing history data?

I am now working on a project which requires to show the transaction history of one customer and if the product customer buys is under warranty or not. I need to use the data from the current system, the system can provide Web API, which is a .csv file. So how can I make use of the current system data?
A solution I think of is to download all the .csv files and write scripts to insert every record into the database I built which contains the necessary tables and relations to hold the data I retrieve. Then I can have a new database which I want. because I never done this before so I want know if it is feasible?
And one more question would be, if I should store the data locally or use a cloud database like Firebase?

High-end databases like SQL Server and Oracle come with utilities that allow you to read directly from a csv file. Check the docs. Having done this many times, the best procedure I found was to read the file into one holding table. This gives you the chance to examine the data and find any unexpected quirks or missing fields. This allows you to correct the data, where possible.
Then write the scripts to move the data from the holding table into the proper tables you have designed. This must be done in a logical manner. For example, move the customer data before the buy transactions. Thus any error messages you get will not be because you tried to store a transaction before you stored the customer. (You will have referential integrity set up, yes?) This gives you more chances to correct or adjust the data or just identify problems more or less at your leisure.
Whether or not to store the data in the cloud is strictly according to the preferences of your employer.

Writing to PostgreSQL database format without using PostgreSQL

I am collecting lots of data from lots of machines. These machines cannot run PostgreSQL and the cannot connect to a PostgreSQL database. At the moment I save the data from these machines in CSV files and use the COPY FROM command to import the data into the PostgreSQL database. Even on high-end hardware this process is taking hours. Therefore, I was thinking about writing the data to the format of PostgreSQL database directly. I would then simply copy these files into the /data directory, start the PostgreSQL server. The server would then find the database files and accept them as databases.
Is such a solution feasible?

Theoretically this might be possible if you studied the source code of PostgreSQL very closely.
But you essentially wind up (re)writing the core of PostgreSQL, which qualifies as "not feasible" from my point of view.
Edit:
You might want to have a look at pg_bulkload which claims to be faster than COPY (haven't used it though)

Why can't they connect to the database server? If it is because of library-dependencies, I suggest that you set up some sort of client-server solution (web services perhaps) that could queue and submit data along the way.
Relying on batch operations will always give you a headache when dealing with large amount of data, and if COPY FROM isn't fast enough for you, I don't think anything will be.

Yeah, you can't just write the files out in any reasonable way. In addition to the data page format, you'd need to replicate the commit logs, part of the write-ahead logs, some transaction visibility parts, any conversion code for types you use, and possibly the TOAST and varlena code. Oh, and the system catalog data, as already mentioned. Rough guess, you might get by with only needing to borrow 200K lines of code from the server. PostgreSQL is built from the ground up around being extensible; you can't even interpret what an integer means without looking up the type information around the integer type in the system catalog first.
There are some tips for speeding up the COPY process at Bulk Loading and Restores. Turning off synchronous_commit in particular may help. Another trick that may be useful: if you start a transaction, TRUNCATE a table, and then COPY into it, that COPY goes much faster. It doesn't bother with the usual write-ahead log protection. However, it's easy to discover COPY is actually bottlenecked on CPU performance, and there's nothing useful you can do about that. Some people split the incoming file into pieces and run multiple COPY operations at once to work around this.
Realistically, pg_bulkload is probably your best bet, unless it too gets CPU bound--at which point a splitter outside the database and multiple parallel loading is really what you need.

Is it better to log to file or database?

We're still using old Classic ASP and want to log whenever a user does something in our application. We'll write a generic subroutine to take in the details we want to log.
Should we log this to, say, a txt file using FileSystemObject or log it to a MS SQL database?
In the database, should we add a new table to the one existing database or should we use a separate database?

Edit
In hindsight, a better answer is to log to BOTH file system (first, immediately) and then to a centralized database (even if delayed). Most modern logging frameworks follow a publish-subscribe model (often called logging sources and sinks) which will allow multiple logging sinks (targets) to be defined.
The rationale behind writing to file system that if an external infrastructure dependency like network, database, or security issue prevents you from writing remotely, that at least you have a fall back if you can recover data from the server's hard disk (something akin to a black box in the airline industry). Log data written to a file system can be deleted as soon as it is confirmed that the central database has recorded the data, so generally file system retention sizes or rotation times need not be large.
Enterprise log managers like Splunk can be configured to scrape your local server log files (e.g. as written by log4net, the EntLib Logging Application Block, et al) and then centralize them in a searchable database, where data logged can be mined, graphed, shown on dashboards, etc.
But from an operational perspective, where it is likely that you will have a farm or cluster of servers, and assuming that both the local file system and remote database logging mechanisms are working, the 99% use case for actually trying to find anything in a log file will still be via the central database (ideally with a decent front end system to allow you to query, aggregate, graph and build triggers or notifications from log data).
Original Answer
If you have the database in place, I would recommend using this for audit records instead of the filesystem.
Rationale:
typed and normalized classification of data (severity, action type, user, date ...)
it is easier to find audit data (select ... from Audits where ... ) vs Grep
it is easier to clean up (e.g. Delete from Audits where = Date ...)
it is easier to back up
The decision to use existing db or new one depends - if you have multiple applications (with their own databases) and want to log / audit all actions in all apps centrally, then a centralized db might make sense.
Since you say you want to audit user activity, it may would make sense to audit in the same db as your users table / definition (if applicable).

I agree with the above with the perhaps obvious exception of logging database failures which would make logging to the database problematic. This has come up for me in the past as I was dealing with infrequent but regular network failovers.

Either works. It's up to your preference.
We have one central database where ALL of our apps log their error messages. Every app we write is set up in a table with a unique ID, and the error log table contains a foreign key reference to the AppId.
This has been a HUGE bonus for us in giving us one place to monitor errors. We had done it as a file system or by sending emails to a monitored inbox in the past, but we were able to create a fairly nice web app for interacting with the error logs. We have different error levels, and we have an "acknowledged" flag field, so we have a page where we can view unacknowledged events by severity, etc.,

Looking at the responses, I think the answer may actually be both.
If it's a user error that's likely to happen during expected usage (e.g. user enters an invalid email etc.), that should go into a database to take advantage of easy queries.
If it's a code error that shouldn't happen (can't get username of a logged in user), that should be reserved for a txt file.
This also nicely splits the errors between non-critical and critical. Hopefully the critical error list stays small!
I'm creating a quick prototype of a new project right now, so I'll stick with txt for now.
On another note, email is great for this. Arguably you could just email them to a "bug" account and not store them locally. However this shares the database risk of bad connections.

Should we log this to say a txt file using FileSystemObject or log it
to a MSSQL database?
Another idea is to write the log file in XML and then query it using XPath. I like to think that this is the best of both worlds.

What arguments to use to explain why SQL Server is far better than a flat file

The higher-ups in my company were told by good friends that flat files are the way to go, and we should switch from SQL Server to them for everything we do. We have over 300 servers and hundreds of different databases. From just the few I'm involved with we have > 10 billion records in quite a few of them with upwards of 100k new records a day and who knows how many updates... Me and a couple others need to come up with a response saying why we shouldn't do this. Most of our stuff is ASP.NET with some legacy ASP. We thought that making a simple console app that tests/times the same interactions between a flat file (stored on the network) and SQL over the network doing large inserts, searches, updates etc along with things like network disconnects randomly. This would show them how bad flat files can be, especially when you are dealing with millions of records.
What things should I use in my response? What should I do with my demo code to illustrate this?
My sort list so far:
Security
Concurrent access
Performance with large amounts of data
Amount of time to do such a massive rewrite/switch and huge $ cost
Lack of transactions
PITA to map relational data to flat files
NTFS doesn't support tons of files in a directory well
Lack of Adhoc data searching/manipulation
Enforcing data integrity
Recovery from network outage
Client delay while waiting for other clients changes to commit
Most everybody stopped using flat files for this type of storage long ago for good reason
Load balancing/replication
I fear that this will be a great post on the Daily WTF someday if I can't stop it now.
Additionally
Does anyone know if anything about HIPPA could be used in this fight? Many of our records are patient records...

Data integrity. First, you can enforce it in a database and cannot in a flat file. Second, you can ensure you have referential integrity between different entities to prevent orphaning rows.
Efficiency in storage depending on the nature of the data. If the data is naturally broken into entities, then a database will be more efficient than lots of flat files from the standpoint of the additional code that will need to be written in the case of flat files in order to join data.
Native query capabilities. You can query against a database natively whereas you cannot with a flat file. With a flat file you have to load the file into some other environment (e.g. a C# application) and use its capabilities to query against it.
Format integrity. The database format is more rigid which means more consistent. A flat file can easily change in a way that the code that reads the flat file(s) will break. The difference is related to #3. In a database, if the schema changes, you can still query against it using native tools. If the flat file format changes, you have to effectively do a search because the code that reads it will likely be broken.
"Universal" language. SQL is somewhat ubiquitous where as the structure of the flat file is far more malleable.

I'd also mention data corruption. Most modern SQL databases can have the power killed on the server, or have the server instance crash and you won't (shouldn't) loose data. Flat files aren't really that way.
Also I'd mention search times. Perhaps even write a simple flat file database with 1mil entries and show search times vs MS SQL. With indexes you should be able to search a SQL database thousands of times faster.
I'd also be careful how quickly you write off flat files. Id go so far as saying "it's a good idea for many cases, but in our case....". This way you won't sound like you're not listening to the other views. Tact in situations like this is a major thing to consider. They may be horribly wrong, but you have to convince your boss of that.

What do they gain from using flat files? The conversion process will be hundreds of hours - hours they pay for. How quickly can flat files generate a positive return on that investment? Provide a rough cost estimate. Translate the technical considerations into money (costs), and it puts the problem in their perspective.
On top of just the data conversion, add in the hidden costs for duplicating a database's capabilities...
Indexing
Transaction processing
Logging
Access control
Performance
Security

Databases allow you to easily index your data to be able to particular records or groups of records by searching any number of different columns.
With flat files you have to write your own indexing mechanisms. There is no need to do all that work again when the database does it for you already.

If you use "text files", you'll need to build an interface on top of it which Microsoft has already done for you and called it SQL Server.
Ask your managers if it makes sense to your company to spend all these resources building a home-made database system (because really that's what it is), or would these resources be better spent focusing on the business.
Performance: SQL Server is built for storing conveniently searchable data. It has optimized data structures in memory built with searching/inserting/deleting in mind. Usage of the disk is lowered, as data regularly queried is kept in memory.
Business partners: if you ever plan to do B2B with 3rd party companies, SQL Server has built-in functionality for it called Linked Servers. If you have only a bunch of files, your business partner will give up on you as no data interconnection is possible. Unless you want to re-invent the wheel again, and build an interface for each business partner you have.
Clustering: you can easily cluster servers in SQL Server for high availability and speed, a lot more than what's possible with text based solution.

You have a nice start to your list. The items I would add include:
Data integrity - SQL engines provide built-in mechanisms (relationships, constraints, triggers, etc.) that make it very simple to reduce the amount of "bad" data in your system. You would need to hand code all data constraint separately if you use flat files.
Add-Hoc data retrieval - SQL engines, through the use of SELECT statements, provide a means of filtering and summarizing your data with very little code. If you are using flat files, considerably more code is needed to get the same results.
These items can be replicated if you want to take the time to build a data engine, but what would be the point? SQL engines already provide these benefits.

I don't think I can even start to list the reasons. I think my head is going to explode. I'll take the risk though to try to help you...
Simulate a network outage and show what happens to one of the files at that point
Demo the horrors of a half-committed transaction because text files don't pass the ACID test
If it's a multi-user application, show how long a client has to wait when 500 connections are all trying to update the same text file
Try to politely explain why the best approach to making business decisions is to listen to the professionals who you are paying money and who know the domain (in this case, IT) and not your buddy who doesn't have a clue (maybe leave out that last bit)
Mention the fact that 99% (made up number) of the business world uses relational databases for their important data, not text files and there's probably a reason for that
Show what happens to your application when someone goes into the text file and types in "haha!" for a column that's supposed to be an integer

If you are a public company, the shareholders would be well served to know this is being seriously contemplated. "We" all know this is a ridiculous suggestion given the size and scope of your operation. Patient records must be protected, not only from security breaches but from irresponsible exposure to loss - lives may depend up the data. If the Executives care at all about the patients, THIS should be their highest concern.
I worked with IBM 370 mainframes from '74 onwards and the day that DB2 took over from plain old flat files, VSAM and ISAM was a milestone day. Haven't looked back to flat-file storage, except for streaming data, in my 25 years with RDBMSs of 4 flavors.
If I owned stock in "you", dumping it in a hurry the moment the project took off would seem appropriate...

Your list is a great start of reasons for sticking with a database.
However, I would recommend that if you're talking to a technical person, to shy away from technical reasons in a recommendation because they might come across as biased.
Here are my 2 points against flat file data storage:
1) Security - HIPPA audits require that patient data remain in a secure environment. The common database systems (Oracle, Microsoft SQL, MySQL) have methods for implementing HIPPA compliant security access. Doing so on a flat-file would be difficult, at best.
Side note: I've also seen medical practices that encrypt the patient name in the database to add extra layers of protection & compliance to ensure even if their DB is compromised that the patient records are not at risk.
2) Reporting - Reporting from any structured database system is simple and common. There are hundreds of thousands of developers that can perform this task. Reporting from flat-files will require an above-average developer. And, because there is no generally accepted method for doing reporting off of a flat-file database, one developer might do things different than another. This could impact the talent pool able to work on a home-grown flat-file system, and ultimately drive costs up by having to support that type of a system.
I hope that helps.

How do you create a relational model with plain text files?
Or are you planning to use a different file for each entity?

Pro file system:
Stable (less lines of code = less bugs, easier to understand, more reliable)
Faster with huge data blobs
Searching/sorting is somewhat slow (but sort can be faster than SQL's order by)
So you'd chose a filesystem to create log files, for example. Logging into a DB is useless unless you need to do complex analysis of the data.
Pro DB:
Transactions (which includes concurrent access)
It can search through huge amounts of records (but not through huge blobs of data)
Chopping the data in all kinds of ways with queries is easy (well, if you know your SQL and the special "oddities" of your DB)
So if you need to add data rarely but search it often, select parts of it by certain criteria or aggregate values, a DB is for you.

NTFS does not support mass amounts of .txt files well. Depending on how a flat file system is developed, the health of a harddrive can become an issue. A lot of older file systems use mass amount of small .txt files to store data. It's bad design, but tends to happen as a flat file system gets older.
Fragmentation becomes an issue, and you lose a text file here and there, causing you to lose small amounts of data. Health of a hard drive should not be an issue when it comes to database design.

This is indeed, on the part of your employer, a MAJOR WTF if he's seriously proposing flat files for everything...
You already know the reasons (oh - add Replication / Load Balancing to your list) - what you need to do now is to convince him of them. My approach on this would two fold.
First of all, I would write a script in whatever tool you currently use to perform a basic operation using SQL, and have it timed. I would then write another script in which you sincerely try to get a flat text solution working, and then highlight the difference in performance. Give him both sets of code so he knows you aren't cheating.
Point out that technology evolves, and that just because someone was successful 20 years ago, this does not automatically entitle them to a credible opinion now.
You might also want to mention the scope for errors in decoding / encoding information in text files, that it would be trivial for someone to steal them, and the costs (justify your estimate) in adapting the current code base to use text files.
I would then ask serious questions of management - foremost amongst them, and I would ask this DIRECTLY, is "Why are you prepared to overrule your technical staff on technical matters" based on one other individual's opinion - especially when said individual is not as familiar with our set up as we are...
I'd also then use the phrase "I do not mean to belittle you, but I seriously feel I have to intervene at this point for the good of the company..."
Another approach - turn the tables - have Mr. Wonderful supply arguments as to why text files are the way forward. You'll then either a) Learn something (not likely), or b) Be in a position to utterly destroy his arguments.
Good luck with this - I feel your pain...
Martin

I suggest you get your retalliation in first, post on Daily WTF now.
As to your question: a business reason would be why does your boss want to rewrite all your systems. From scratch as you would, effectively, have to write your own database system.
For a development reason, you would lose access to the SQL server ecosystem, all the libraries, tools, utilities.
Perhaps the guy that suggested this is actually thinking of going into competition with your company.

Simplest way to refute this argument - name a fortune 500 company that processes data on this scale using flat files?
Now name a fortune 500 company that doesn't use a relational database...
Case closed.

Something is really fishy here. For someone to get the terminology right ( "flat file" ) but not know how overwhelmingly stupid an idea that is, it just doesn't add up. I would be willing to be your manager is non-technical, but the person your manager is talking to is. This sounds more like a lost in translation problem.
Are you sure they don't mean no-SQL, as if you are in a document centric environment, moving away from a relational database actually does make sense in some regards, while still having many of the positives of a tradition RDBMS.
So, instead of justifying why SQL is better than flat files, I would invert the problem and ask what problems flat files are meant to solve. I would put odds on money that this is a communication problem.
If its not and your company is actually considering replacing its DB with a home grown flat file system off the recommendation of "a friend", convincing your manager why he is wrong is the least of your worries. Instead, dust off and start circulating your resume.

•Amount of time to do such a massive
rewrite/switch and huge $ cost
It's not just amount of time it is the introduction of new bugs. A re-write of these proportions would cause things that currenty work to break.
I'd suggest a giving him a cost estimate of the hours to do such a rewrite for just one system and then the number of systems that would need to change. Once they have a cost estimate, they will run from this as fast as they can.
Managers like numbers, so do a formal written decision analysis. Compare the two proposals by benefits and risks, side by side with numeric values. When you get to cost 0 to maintain and 100,000,000 to convert they will get the point.

The people that doesn't distinguish between flat files and sql, doesnt understand all arguments that you say before.
The explanation must simple as possible, something like this:
SQL is a some kind of search/concurrency wrapper around the flat files.
All the problems that exist currently, will stay even the company going to write the wrapper from zero.
Also you must to give some other way to resolve the current problems, use smart words like advanced BLL or install/uninstall scripting environment. :)

You have to speak executive. Without saying it, make them realize they're in way over their heads here. Here's some ammunition:
Database theory is hardcore computer science. We're talking about building a scalable system that can handle millions of records and tolerate disasters without putting everyone out of business.
This is the work of PhD-level specialists. They've been refining the field for a good 20 years now, and the great thing about that is this: it allows us to specialize in building business systems.
If you have to, come right out and say that this just isn't done in the enterprise. It would be costly and the result would be inferior. It's exactly the kind of wheel that developers love to reinvent, and in my opinion the only time you should is if the result is going to be a product or service that you can sell. And it won't be.

How do you handle small sets of data?

With really small sets of data, the policy where I work is generally to stick them into text files, but in my experience this can be a development headache. Data generally comes from the database and when it doesn't, the process involved in setting it/storing it is generally hidden in the code. With the database you can generally see all the data available to you and the ways with which it relates to other data.
Sometimes for really small sets of data I just store them in an internal data structure in the code (like A Perl hash) but then when a change is needed, it's in the hands of a developer.
So how do you handle small sets of infrequently changed data? Do you have set criteria of when to use a database table or a text file or..?
I'm tempted to just use a database table for absolutely everything but I'm not sure if there are any implications to this.
Edit: For context:
I've been asked to put a new contact form on the website for a handful of companies, with more to be added occasionally in the future. Except, companies don't have contact email addresses.. the users inside these companies do (as they post jobs through their own accounts). Now though, we want a "speculative application" type functionality and the form needs an email address to send these applications to. But we also don't want to put an email address as a property in the form or else spammers can just use it as an open email gateway. So clearly, we need an ID -> contact_email type relationship with companies.
SO, I can either add a column to a table with millions of rows which will be used, literally, about 20 times OR create a new table that at most is going to hold about 20 rows. Typically how we handle this in the past is just to create a nasty text file and read it from there. But this creates maintenance nightmares and these text files are frequently looked over when data that they depend on changes. Perhaps this is a fault with the process, but I'm just interested in hearing views on this.

Put it in the database. If it changes infrequently, cache it in your middle tier.

The example that springs to mind immediately is what is appropriate to have stored as an enumeration and what is appropriate to have stored in a "lookup" database table.
I tend to "draw the line" with the rule that if it will result in a column in the database containing a "magic number" that maps to an enumeration value, then the enumeration should really exist as a lookup table. If it's unrelated to the data stored in the database (eg. Application configuration data rather than user generated data), then it's an enumeration all the way.

Surely it depends on the user of the software tool you've developed to consume the set of data, regardless of size?
It might just be that they know Excel, so your tool would have to parse a .csv file that they create.
If it's written for the developers, then who cares what you use. I'm not a fan of cluttering databases with minor or transient data however.

We have a standard config file format (key:value) and a class to handle it. We just use that on all projects. Mostly we're just setting persistent properties for our applications (mobile phone development) so that's an appropriate thing to do. YMMV

In cases where the program accesses a database, I'll store everything in there: easier for backup and moving data around.
For small programs without database access I store my data in the .net settings, which are stored in an xml file - of course this is a feature of c#, so it might not apply to you.
Anyway, I make sure to store all data in one place. Usually a database.

Have you considered sqlite ? It's file-based, which addresses your feeling that "just a file might do" (zero configuration), but it's a perfectly good database and scales remarkably well. It supports a number of APIs and there are numerous front ends for administering it.

If these are small config-like data, i use some simple and common format. ini, json and yaml are usually ok. Java and .NET fans also like XML. in short, use something that you can easily read to an in-memory object and forget about it.

I would add it to the database in the main table:
Backup and recovery (you do want to recover this text file, right?)
Adhoc querying (since you can do it will a SQL tool and join it to the other database data)
If the database column is empty the store requirements for it should be minimal (nothing if it's a NULL column at the end of the table in Oracle)
It will be easier if you want to have multiple application servers as you will not need to keep multiple copies of some extra config file around
Putting it into a little child table only complicates the design without giving any real benefits
You may well already be going to that same row in the database as part of your processing anyway, so performance is not likely to be a problem. If you are not, you could cache it in memory.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight