How to split DB2 load files by node on ETL server? - database

I'm building a DB2 "Infosphere" data warehouse and am expecting to have 8-16 nodes or partitions.
Since I'll be loading from 130-300 million rows a day, and my load process is also my recovery process - I want the loads to be as fast as possible. I'm not surprised to find this tip in the IBM "infocenter" documentation:
"Better performance can be expected if the database partitions participating in the distribution process are different from the loading database partitions, since there is less contention for CPU cycles."
I'd prefer not to dedicate an expensive DB2 node just to splitting load files by hashkey - since my ETL servers are so cheap (we use python, not a licensed commercial product). Plus, since I rely on archived loads for recovery - I may have to convert them in case we add nodes to the database. I'd like that also done on an ETL server. Note - I believe DataStage also performs this task on the ETL server rather than through DB2.
Can anyone suggest how our python ETL process can efficiently use the same hashing algorithm and mapping tables that DB2 will use? And other tips?
Thanks

First of all:
You do not need to pre-split the data inside your ETL process. The LOAD utility will handle splitting the data for you. Your python process can either write the data to load to a flat file or write directly to a pipe (that the LOAD utility reads from). In almost every case, it is easier to let the database handle partitioning the data for you.
The InfoCenter comment about the splitters taking up CPU cycles is probably not something you need to worry about. This generally applies only in extreme situations, where there are many more database partitions (i.e., when you need to have multiple processes splitting the data) and when CPU utilization on the database nodes is very high.
From a LOAD perspective, the amount of time you'll save by having pre-split data is negligible. The limiting factor when loading data is writing the data out to disk – not partitioning it. If reloading data is your primary method of recovery, then I wouldn't worry too much about this.
If all of this does not convince you and you really want to go down the path of having your ETL process split the data, DB2 does provide an API (in C) that applications can call to handle this: db2GetDistMap() and db2GetRowPartNum(). You may be able to write a native python module to handle this.
These are most useful in cases where an application is using SQL to INSERT rows into the table (as opposed to using the LOAD utility), and spawns multiple threads to write data to each partition independently (i.e., each thread is doing the transformation and loading in parallel). If you can't parallelize the transformation portion, then don't bother with this.
Obviously, there are a lot of variables, so YMMV.

Related

MPI how to send and receive SQLite database

I have a big SQLite database to process, so I would like to use MPI for parallelization to accelerate the speed. What I want to do is sending a database from root to every slave, and sending the modified databases to root after slave add some table into it. I want to use MPI_Type_create_struct to create a datatype to store database, but the database is too complicated. IS there any other way to handle this situation? Thank you in advance!
I recently dealt with a similar problem - I have a large MPI application that uses SQLite as a configuration store. Handling multi-process writes is a challenge with an embedded SQL database. My experience with this involves a massively parallel application (running up to 65,535 ranks) with a shared filesystem.
Based on the FAQ from SQLite and some experience with database engines, there are a few ways to approach this problem. I am making the assumption that you are operating with a shared distributed file system, and multiple separate computers (a standard HPC cluster setup).
Since SQLite will block when multiple processes write to the database (but not read), reads will most likely not be an issue. Each process can run multiple SELECT commands at the same time without issue.
The challenge will be in the writing. Disk I/O is several orders of magnitude slower than computation, so generally this will be the bottleneck. Having said that, network communication may also be a significant slowdown, so how you approach the problem really depends on where the weakest link of your running environment will be.
If you have a fast network and slow disk speed, or if you want to implement this in the most straightforward way possible, your best bet is to have a single MPI rank in charge of writing to the database. Your compute processes would independently run SELECT commands until computation was complete, then send the new data to the MPI database process. The database control process would then write the new data to disk. I would not try to send the structure of the database across the network, rather I would send the data that should be written, along with (possibly) a flag that would identify what table/insert query the data should be written with. This technique is sort of similar to how a RDBMS works - while RDBMS servers do support concurrent writes, there is a "central" process in control of the ordering of write operations.
One thing to note is that if a process writes to the SQLite database, the file is locked for all processes that are trying to read or write to it. You will need to either handle the SQLITE_BUSY return code in your worker processes, register a callback to handle this, change the busy behavior, or use an alternate technique. In my application, I found that loading the database as an in-memory database, (https://www.sqlite.org/inmemorydb.html) for the readers provided a good workaround. Readers access the in-memory database, but sent results to the controlling process for writes. The downside is that you will have multiple copies of the database in memory.
Another option that might be less network intensive is to do the reads concurrently and have each worker process write out to their own file. You could write out to separate SQLite database files, or even export something like CSV (depending on the complexity of the data). When writes are complete, you would then have a single process merge the individual files into a single result database file - see How can I merge many SQLite databases?. This method has its own issues, but depending on where your bottlenecks are and how the system as a whole is laid out, this technique may work.
Finally, you might consider reading from the SQLite database and saving the data to a proper distributed file format, such as HDF5 (or using MPI IO). Once the computation is done, it would be pretty straightforward to write a script that would create a new SQLite database from this foreign file format.

Which approach and database to use in performance-critical solution

I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to
the server sending its id, status (online or offiline), IP, location
(latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending
signals in the process. So I need a way to "forget" the equipments if
they have not sent a signal in the last X days. Also new equipments
might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over
an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and
at the same time as the database is updating. So I need indexes on
the main attributes (id, status, IP, location and parent node). The
query results do not need to be 100% accurate, eventual consistency
is fine as long as it doesn't take too long (more than 20 min on
avarage) for them to appear in the queries results.
I don't need
persistence at all, if the power goes out it's okay to lose
everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know
beforehand if the signal I receive is already in the database or not.
Almost all of the signals will be in the same state as they were the
last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night
removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need
to run fast. So I guess I need to have a cluster that perform a
single query on multiple nodes of the cluster (does CouchDB MapReduce
splits the workload to multiple nodes of the cluster?). I'm not
enterily sure I need a cluster though, could a single more expensive
machine handle all the load?
I have never used a noSQL system before, but I have theoretical
knowledge of the subject.
Does this make sense?
Apache Flume for collecting the signals.
It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.
Hive for batch queries.
Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.
HBase for random real-time reads/writes.
Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.
Impala for fast, interactive queries.
Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.
HTH
Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

Auto sharding postgresql?

I have a problem where I need to load alot of data (5+ billion rows) into a database very quickly (ideally less than an 30 min but quicker is better), and I was recently suggested to look into postgresql (I failed with mysql and was looking at hbase/cassandra). My setup is I have a cluster (currently 8 servers) that generates alot of data, and I was thinking of running databases locally on each machine in the cluster it writes quickly locally and then at the end (or throughout the data generating) data is merged together. The data is not in any order so I don't care which specific server its on (as long as its eventually there).
My questions are , is there any good tutorials or places to learn about PostgreSQL auto sharding (I found results of firms like sykpe doing auto sharding but no tutorials, I want to play with this myself)? Is what I'm trying to do possible? Because the data is not in any order I was going to use auto-incrementing ID number, will that cause a conflict if data is merged (this is not a big issue anymore)?
Update: Frank's idea below kind of eliminated the auto-incrementing conflict issue I was asking about. The question is basically now, how can I learn about auto sharding and would it support distributed uploads of data to multiple servers?
First: Do you really need to insert the generated data from your cluster straight into a relational database? You don't mind merging it at the end anyway, so why bother inserting into a database at all? In your position I'd have your cluster nodes write flat files, probably gzip'd CSV data. I'd then bulk import and merge that data using a tool like pg_bulkload.
If you do need to insert directly into a relational database: That's (part of) what PgPool-II and (especeially) PgBouncer are for. Configure PgBouncer to load-balance across different nodes and you should be pretty much sorted.
Note that PostgreSQL is a transactional database with strong data durability guarantees. That also means that if you use it in a simplistic way, doing lots of small writes can be slow. You have to consider what trade-offs you're willing to make between data durability, speed, and cost of hardware.
At one extreme, each INSERT can be its own transaction that's synchronously committed to disk before returning success. This limits the number of transactions per second to the number of fsync()s your disk subsystem can do, which is often only in the tens or hundreds per second (without battery backup RAID controller). This is the default if you do nothing special and if you don't wrap your INSERTs in a BEGIN and COMMIT.
At the other extreme, you say "I really don't care if I lose all this data" and use unlogged tables for your inserts. This basically gives the database permission to throw your data away if it can't guarantee it's OK - say, after an OS crash, database crash, power loss, etc.
The middle ground is where you will probably want to be. This involves some combination of asynchronous commit, group commits (commit_delay and commit_siblings), batching inserts into groups wrapped in explicit BEGIN and END, etc. Instead of INSERT batching you could do COPY loads of a few thousand records at a time. All these things trade data durability off against speed.
For fast bulk inserts you should also consider inserting into tables without any indexes except a primary key. Maybe not even that. Create the indexes once your bulk inserts are done. This will be a hell of a lot faster.
Here are a few things that might help:
The DB on each server should have a small meta data table with that server's unique characteristics. Such as which server it is; servers can be numbered sequentially. Apart from the contents of that table, it's probably wise to try to keep the schema on each server as similar as possible.
With billions of rows you'll want bigint ids (or UUID or the like). With bigints, you could allocate a generous range for each server, and set its sequence up to use it. E.g. server 1 gets 1..1000000000000000, server 2 gets 1000000000000001 to 2000000000000000 etc.
If the data is simple data points (like a temperature reading from exactly 10 instruments every second) you might get efficiency gains by storing it in a table with columns (time timestamp, values double precision[]) rather than the more correct (time timestamp, instrument_id int, value double precision). This is an explicit denormalisation in aid of efficiency. (I blogged about my own experience with this scheme.)
Use citus for PostgreSQL auto sharding. Also this link is helpful.
Sorry I don't have a tutorial at hand, but here's an outline of a possible solution:
Load one eight of your data into a PG instance on each of the servers
For optimum load speed, don't use inserts but the COPY method
When the data is loaded, do not combine the eight databases into one. Instead, use plProxy to launch a single statement to query all databases at once (or the right one to satisfy your query)
As already noted, keys might be an issue. Use non-overlapping sequences or uuids or sequence numbers with a string prefix, shouldn't be too hard to solve.
You should start with a COPY test on one of the servers and see how close to your 30-minute goal you can get. If your data is not important and you have a recent Postgresql version, you can try using unlogged tables which should be a lot faster (but not crash-safe). Sounds like a fun project, good luck.
You could use mySQL - which supports auto-sharding across a cluster.

Importing new database table

Where I'm at there is a main system that runs on a big AIX mainframe. To facility reporting and operations there is nightly dump from the mainframe into SQL Server, such that each of our 50-ish clients is in their own database with identical schemas. This dump takes about 7 hours to finish each night, and there's not really anything we can do about it: we're stuck with what the application vendor has provided.
After the dump into sql server we use that to run a number of other daily procedures. One of those procedures is to import data into a kind of management reporting sandbox table, which combines records from a particularly important table from across the different databases into one table that managers who don't know sql so can use to run ad-hoc reports without hosing up the rest of the system. This, again, is a business thing: the managers want it, and they have the power to see that we implement it.
The import process for this table takes a couple hours on it's own. It filters down about 40 million records spread across 50 databases into about 4 million records, and then indexes them on certain columns for searching. Even at a coupld hours it's still less than a third as long as the initial load, but we're running out of time for overnight processing, we don't control the mainframe dump, and we do control this. So I've been tasked with looking for ways to improve one the existing procedure.
Currently, the philosophy is that it's faster to load all the data from each client database and then index it afterwards in one step. Also, in the interest of avoiding bogging down other important systems in case it runs long, a couple of the larger clients are set to always run first (the main index on the table is by a clientid field). One other thing we're starting to do is load data from a few clients at a time in parallel, rather than each client sequentially.
So my question is, what would be the most efficient way to load this table? Are we right in thinking that indexing later is better? Or should we create the indexes before importing data? Should we be loading the table in index order, to avoid massive re-ordering of pages, rather than the big clients first? Could loading in parallel make things worse by causing to much disk access all at once or removing our ability to control the order? Any other ideas?
Update
Well, something is up. I was able to do some benchmarking during the day, and there is no difference at all in the load time whether the indexes are created at the beginning or at the end of the operation, but we save the time building the index itself (it of course builds nearly instantly with no data in the table).
I have worked with loading bulk sets of data in SQL Server quite a bit and did some performance testing on the Index on while inserting and the add it afterwards. I found that BY FAR it was much more efficient to create the index after all data was loaded. In our case it took 1 hour to load with the index added at the end, and 4 hours to add it with the index still on.
I think the key is to get the data moved as quick as possible, I am not sure if loading it in order really helps, do you have any stats on load time vs. index time? If you do, you could start to experiment a bit on that side of things.
Loading with the indexes dropped is better as a live index will generate several I/O's for every row in the database. 4 million rows is small enough that you would not expect to get a significant benefit from table partitioning.
You could get a performance win by using bcp to load the data into the staging area and running several tasks in parallel (SSIS will do this). Write a generic batch file wrapper for bcp that takes the file path (and table name if necessary) and invoke a series of jobs in half a dozen threads with 'Execute Process' tasks in SSIS. For 50 jobs it's probably not worth trying to write a data-driven load controller process. Wrap these tasks up in a sequence container so you don't have to maintain all of the dependencies explicitly.
You should definitely drop and re-create the indexes as this will greatly reduce the amount of I/O during the process.
If the 50 sources are being treated identically, try loading them into a common table or building a partitioned view over the staging tables.
Index at the end, yes. Also consider setting the log level setting to BULK LOGGED to minimize writes to the transaction log. Just remember to set it back to FULL after you've finished.
To the best of my knowledge, you are correct - it's much better to add the records all at once and then index once at the end.

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.
By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.
I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace
When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.
Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.
The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.
The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.
My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.
There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...
Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.
Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.
If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Resources