Related
I am running lots of very slow computations with reusable results (and often computing something new relies on a computation that was already performed before). To make use of them, I want to store the results somewhere (permanently). The computations can be uniquely identified by two identifiers: experiment name and computation name, and the value is an array of floats (which I currently store as raw binary data). They need to be individually accessed (read and written) by experiment and computation name very often, and sometimes also just by experiment name (i.e. all computations with their results for a given experiment). They are also sometimes concatenated, but if reading and writing is fast, no additional support for this operation is needed. This data will not need to be accessed for any web application (used only by non-production scripts that need the results of the computations, but calculating them each time is not feasible), and there is no need for transactions, but every write needs to be atomic (e.g. turning off the computer should not result in corrupted/partial data). Reading also needs to be atomic (e.g. if two processes try to access a result of one computation, and it's not there, so one of them starts saving the new result, the other process should either receive it when it's done, or receive nothing at all). Accessing the data remotely is not required, but helpful.
So, TL;DR requirements:
permanent storage of binary data (no metadata other than the identifier needs to be stored)
very fast access (read/write) based on a compound identifier
ability to read all data by one part of a compound identifier
concurrent, atomic read/write
no need for transactions, complex queries, etc.
remote access would be nice to have, but not required
the whole thing is there mostly to save time, so speed is critical
The solutions I tried so far are:
storing them as individual files (one directory per experiment, one binary file per computation) - requires manual handling of atomicity, and also most file systems support file names only up to 255 characters long (and computation names may be longer than that), so an additional mapping would be required; also I'm not sure if ext4 (which is the filesystem I'm using and can't change it) is designed to handle millions of files
using a sqlite database (with just one table and a compound primary key) - at first it seemed perfect, but when we got to hundreds of gigabytes of data (millions of ~100 KB blobs, and both number of them and their size will increase), it started being really slow, even after applying optimizations found on the internet
Naturally, after sqlite failed, the first idea was to just move to a "proper" database like postgres, but then I realized that perhaps in this case a relational database is not really the way to go (especially since speed is critical here, and I don't need most of their features) - and especially postgres is probably not the way to go, since the closest thing to a blob is bytea, which requires additional conversions (so a perfomance hit is guaranteed). However, after researching a bit about key-value databases (which seemed to apply to my problem), I found out that all of the databases that I checked do not support compound keys, and often have length limitations for keys (e.g. couchbase has just 250 bytes). So, should I just go with a normal relational database, try one of NoSQL databases, or maybe something completely different like HDF5?
One way to improve on the database solution is to externalize the data blob.
You can use SeaweedFS https://github.com/chrislusf/seaweedfs as an object store, upload the blob and get an file id, and then store the file id in the database. (I am working on SeaweedFS)
This should reduce the database load quite a bit, and querying will be much faster.
So, I ended up using a relational database anyway (since only there I could use compound keys without any hacks).
I performed a benchmark to compare sqlite with postgres and mysql - 500 000 inserts of ~60 KB blobs and then 50 000 selects by the whole key. This was not enough to slow down sqlite to the unacceptable levels I was experiencing, but set a point of reference (i.e. the speed at which sqlite was running with this few records was acceptable to me). I assumed that I wouldn't experience a huge performance hit when adding more records with mysql and postgres (since they were designed to work with much larger amounts of data than sqlite), and when finally using one of them, that turned out to be true.
The settings (other than defaults) were following:
sqlite: journal mode=wal (required for parallel access), isolation level autocommit, values as BLOB
postgres: isolation level autocommit (can't turn off transactions, and doing everything in one huge transaction was not an option for me), values as BYTEA (which sadly includes the double conversion I wrote about)
mysql: engine=aria, transactions disabled, values as MEDIUMBLOB
As you can see, I was able to customize mysql much more to fit the task at hand. The results below reflect it well:
sqlite postgres mysql
selects 90.816292 191.910514 106.363534
inserts 4367.483822 7227.473075 5081.281370
Mysql had similar speed to sqlite, with postgres being significantly slower.
I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.
The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.
As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.
Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.
There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.
I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:
Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).
This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.
You should be able to reduce your storage space and should not need twice as much storage space, as you state.
Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.
One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.
Scalability begins at a point beyond tab-separated ASCII.
Just be practical - don't academicise it - convention frees your fingers as well as your mind.
I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.
You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.
The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.
LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).
Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.
I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.
Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.
You can do this with VelocityDB. It is is very fast at reading tab seperated data into C# objects and databases. The entire Wikipedia text is a 33GB xml file. This file takes 18 minutes to read in and persist as objects (1 per Wikipedia topic) and store in compact databases. Many samples are shown for how to read in tab seperated text files as part of the download.
The question's already been answered, and I agree with the bulk of the statements.
At our centre, we have a standard talk we give, "so you have 40TB of data", as scientists are newly finding themselves in this situation all the time now. The talk is nominally about visualization, but primarly about managing large amounts of data for those that are new to it. The basic points we try to get across:
Plan your I/O
Binary files
As much as possible, large files
File formats that can be read in parallel, subregions extracted
Avoid zillions of files
Especially avoid zillions of files in single directory
Data Management must scale:
Include metadata for provenance
Reduce need to re-do
Sensible data management
Hierarchy of data directories only if that will always work
Data bases, formats that allow metadata
Use scalable, automatable tools:
For large data sets, parallel tools - ParaView, VisIt, etc
Scriptable tools - gnuplot, python, R, ParaView/Visit...
Scripts provide reproducability!
We have a fair amount of stuff on large-scale I/O generally, as this is an increasingly common stumbling block for scientists.
Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.
I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)
Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).
My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?
I am ultimately going to be pulling data into R for analysis.
EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do:
Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)
The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.
A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.
Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.
Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.
Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.
Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.
The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.
I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.
4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.
So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.
I have some experience with Infobright Community edition --- column-or. db, based on mysql.
Pro:
you can use mysql interfaces/odbc mysql drivers, from R too
fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
very fast native data loader and connectors for ETL (talend, kettle)
optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
FOSS
paid support option
?
Cons:
no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
no utf-8 official support (collation/sort etc), planned for q3 2009
no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
?
Resume:
Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.
Here is my 2 cents: SQL server does not scale well. We attempted to use SQL server to store financial data in real time (i.e. prices ticks coming in for 100 symbols). It worked perfectly for the first 2 weeks - then it went slower and slower as the database size increased, and finally ground to a halt, too slow to insert each price as it was received. We tried to work around it by moving data from the active database to offline storage every night, but ultimately the project was abandoned as it just didn't work.
Bottom line: if you're planning on storing a lot of data ( >1GB) you need something that scales properly, and that probably means a column database.
It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.
Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.
We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.
I'm creating a database, and prototyping and benchmarking first. I am using H2, an open-source, commercially free, embeddable, relational, java database. I am not currently indexing on any column.
After the database grew to about 5GB, its batch write speed doubled (the rate of writing was slowed 2x the original rate). I was writing roughly 25 rows per milliseconds with a fresh, clean database and now at 7GB I'm writing roughly 7 rows/ms. My rows consist of a short, an int, a float, and a byte[5].
I do not know much about database internals or even how H2 was programmed. I would also like to note I'm not badmouthing H2, since this is a problem with other DBMSs I've tested.
What factors might slow down the database like this if there's no indexing overhead? Does it mainly have something to do with the file system structure? From my results, I assume the way windows XP and ntfs handle files makes it slower to append data to the end of a file as the file grows.
One factor that can complicate inserts as a database grows is the number of indexes on the table, and the depth of those indexes if they are B-trees or similar. There's simply more work to do, and it may be that you're causing index nodes to split, or you may simply have moved from, say, a 5-level B-tree to a 6-level one (or. more generally, from N to N+1 levels).
Another factor could be disk space usage -- if you are using cooked files (that's the normal kind for most people most of the time; some DBMS use 'raw files' on Unix, but it is unlikely that your embedded system would do so, and you'd know if it did because you'd have to tell it to do so), it could be that your bigger tables are now fragmented across the disk, leading to worse performance.
If the problem was on SELECT performance, there could be many other factors also affecting your system's performance.
This sounds about right. Database performance usually drops significantly as the data can no longer be held in memory and operations become disk bound. If you are using normal insert operations, and want a significant performance improvement, I suggest using some sort of a bulk load API if H2 supports it (like Oracle sqlldr, Sybase BCP, Mysql 'load data infile'). This type of API writes data directly to the data-file bypassing many of the database subsystems.
This is most likely caused by variable width fields. I don't know if H2 allows this, but in MySQL, you have to create your table with all fixed width fields, then explicitly declare it as a fixed width field table. This allows MySQL to calculate exactly where it needs to go in the database file to do the insert. If you aren't using a fixed width table, then it has to read through the table to find the end of the last row.
Appending data (if done right) is an O(n) operation, where n is the length of the data to be written. It doesn't depend on the file length, there are seek operations to skip over that easily.
For most databases, appending to a database file is definitely slower than pre-growing the file and then adding rows. See if H2 supports pre-growing the file.
Another cause is whether the entire database is held in memory or if the OS has to do a lot of disk swapping to find the location to store the record.
I would blame it on I/O, specially if you're running your database on a normal PC with a normal hard disk (by that I mean not in server with super fast hard drives, etc).
Many database engines create an implicit integer primary key for each update, so even if you haven't declared any indexes, your table is still indexed. This may be a factor.
Using H2 for 7G datafile is a wrong choice from technological point of view. As you said, embeddable. What kind of "embedded" application do you have, if you need to store so much data.
Are you performing incremental commits? Since H2 is an ACID compliant database, if you are not performing incremental commits, then there is some type of redo log so that in the case of some accidental failure (say, power outage) or rollback, the deletes can be rolled back.
In that case, your redo log may be growing large and overflowing memory buffers and needing to write out your redo log to disk, as well as your actual data, adding to your I/O overhead.