Scalable, fast, text file backed database engine? - database

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.
The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.
As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.
Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.
I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:
Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).
This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.
You should be able to reduce your storage space and should not need twice as much storage space, as you state.
Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.

Scalability begins at a point beyond tab-separated ASCII.
Just be practical - don't academicise it - convention frees your fingers as well as your mind.

I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.

You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.
The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.
LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).
Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.
I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.
Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.

You can do this with VelocityDB. It is is very fast at reading tab seperated data into C# objects and databases. The entire Wikipedia text is a 33GB xml file. This file takes 18 minutes to read in and persist as objects (1 per Wikipedia topic) and store in compact databases. Many samples are shown for how to read in tab seperated text files as part of the download.

The question's already been answered, and I agree with the bulk of the statements.
At our centre, we have a standard talk we give, "so you have 40TB of data", as scientists are newly finding themselves in this situation all the time now. The talk is nominally about visualization, but primarly about managing large amounts of data for those that are new to it. The basic points we try to get across:
Plan your I/O
Binary files
As much as possible, large files
File formats that can be read in parallel, subregions extracted
Avoid zillions of files
Especially avoid zillions of files in single directory
Data Management must scale:
Include metadata for provenance
Reduce need to re-do
Sensible data management
Hierarchy of data directories only if that will always work
Data bases, formats that allow metadata
Use scalable, automatable tools:
For large data sets, parallel tools - ParaView, VisIt, etc
Scriptable tools - gnuplot, python, R, ParaView/Visit...
Scripts provide reproducability!
We have a fair amount of stuff on large-scale I/O generally, as this is an increasingly common stumbling block for scientists.

Related

Storing and accessing a large number of relatively small files

I am running lots of very slow computations with reusable results (and often computing something new relies on a computation that was already performed before). To make use of them, I want to store the results somewhere (permanently). The computations can be uniquely identified by two identifiers: experiment name and computation name, and the value is an array of floats (which I currently store as raw binary data). They need to be individually accessed (read and written) by experiment and computation name very often, and sometimes also just by experiment name (i.e. all computations with their results for a given experiment). They are also sometimes concatenated, but if reading and writing is fast, no additional support for this operation is needed. This data will not need to be accessed for any web application (used only by non-production scripts that need the results of the computations, but calculating them each time is not feasible), and there is no need for transactions, but every write needs to be atomic (e.g. turning off the computer should not result in corrupted/partial data). Reading also needs to be atomic (e.g. if two processes try to access a result of one computation, and it's not there, so one of them starts saving the new result, the other process should either receive it when it's done, or receive nothing at all). Accessing the data remotely is not required, but helpful.
So, TL;DR requirements:
permanent storage of binary data (no metadata other than the identifier needs to be stored)
very fast access (read/write) based on a compound identifier
ability to read all data by one part of a compound identifier
concurrent, atomic read/write
no need for transactions, complex queries, etc.
remote access would be nice to have, but not required
the whole thing is there mostly to save time, so speed is critical
The solutions I tried so far are:
storing them as individual files (one directory per experiment, one binary file per computation) - requires manual handling of atomicity, and also most file systems support file names only up to 255 characters long (and computation names may be longer than that), so an additional mapping would be required; also I'm not sure if ext4 (which is the filesystem I'm using and can't change it) is designed to handle millions of files
using a sqlite database (with just one table and a compound primary key) - at first it seemed perfect, but when we got to hundreds of gigabytes of data (millions of ~100 KB blobs, and both number of them and their size will increase), it started being really slow, even after applying optimizations found on the internet
Naturally, after sqlite failed, the first idea was to just move to a "proper" database like postgres, but then I realized that perhaps in this case a relational database is not really the way to go (especially since speed is critical here, and I don't need most of their features) - and especially postgres is probably not the way to go, since the closest thing to a blob is bytea, which requires additional conversions (so a perfomance hit is guaranteed). However, after researching a bit about key-value databases (which seemed to apply to my problem), I found out that all of the databases that I checked do not support compound keys, and often have length limitations for keys (e.g. couchbase has just 250 bytes). So, should I just go with a normal relational database, try one of NoSQL databases, or maybe something completely different like HDF5?
One way to improve on the database solution is to externalize the data blob.
You can use SeaweedFS https://github.com/chrislusf/seaweedfs as an object store, upload the blob and get an file id, and then store the file id in the database. (I am working on SeaweedFS)
This should reduce the database load quite a bit, and querying will be much faster.
So, I ended up using a relational database anyway (since only there I could use compound keys without any hacks).
I performed a benchmark to compare sqlite with postgres and mysql - 500 000 inserts of ~60 KB blobs and then 50 000 selects by the whole key. This was not enough to slow down sqlite to the unacceptable levels I was experiencing, but set a point of reference (i.e. the speed at which sqlite was running with this few records was acceptable to me). I assumed that I wouldn't experience a huge performance hit when adding more records with mysql and postgres (since they were designed to work with much larger amounts of data than sqlite), and when finally using one of them, that turned out to be true.
The settings (other than defaults) were following:
sqlite: journal mode=wal (required for parallel access), isolation level autocommit, values as BLOB
postgres: isolation level autocommit (can't turn off transactions, and doing everything in one huge transaction was not an option for me), values as BYTEA (which sadly includes the double conversion I wrote about)
mysql: engine=aria, transactions disabled, values as MEDIUMBLOB
As you can see, I was able to customize mysql much more to fit the task at hand. The results below reflect it well:
sqlite postgres mysql
selects 90.816292 191.910514 106.363534
inserts 4367.483822 7227.473075 5081.281370
Mysql had similar speed to sqlite, with postgres being significantly slower.

Excel file size efficiency

I am working with an organisation that is using Excel as their data storage center right now, their data might scale up in the near future, so I was wondering when the data size is too large for Excel to handle efficiently. The data stored is just strings and integers of limited lengths (below 50 char and integer values between 1 and 100000).
Is there some kind of general guideline to how big an Excel sheet can be before it becomes inefficient and you should use a database like SQL Server instead?
I have usually found that once the Excel file size crosses 30MB it becomes very slow and unresponsive. In order to manipulate higher data size i started using a add-on called Powerpivot which is very very fast ! See instructions on how to enable it here - https://support.office.com/en-us/article/Start-the-Power-Pivot-in-Microsoft-Excel-add-in-a891a66d-36e3-43fc-81e8-fc4798f39ea8
Efficiency is subjective and depends on your goals. The thing is that an RDBMS like SQL Server will always be more efficient, because you will get optimized searches instead of reading a textual file. Also, relations between tables are handled well and if you design your DB scheme well, then you will have benefits like handling concurrent access, avoiding redundancy and inconsistency. It is worth the effort on the long run.

Need a highly compressed datastore for Crawl data and log data

I have to store a lot of crawl and log data in a Datastore with an efficient compression ratio.
So far I tried and installed Cassandra, Couchbase, Mysql and an FlatFile format and read the architectual overview of Google Big Table, Hypertable and the LevelDB File Layout.
Cassandra and Couchbase are about 1/5 in disk size of the uncompressed Mysql Database, but I want better results.
So I need a Simple Data Store with high compression features as in vertica, teradata, oracle and sqlserver products. (Page level compression)
The actual flatFile dataSet looks like
/oil_type/gas_station/2014-03/2014-03-05-23.csv
/oil_type/gas_station/2014-03/2014-03-06-00.csv
/oil_type/gas_station/2014-03/2014-03-06-01.csv
Per File are about 400 high redundant entries each about 5kb
A File can be compressed from 1722 KB to 39 KB so an compression ratio of 44:1 up to 100:1 depending on the compression chunk size should be possible.
Defining the use case:
I have to poll all relevant gas_station webpages/apis every 30 seconds to get up to the minute pricing information, because it is not possible to write a parser for every gas station, a generic solution is required for index creation. With a database holding all craweld gas station pages a generic parser can easily be developed and backtestet. With this raw data model data loss through broken specific converters should be avoided.
With keys like "oil_type-gas_station-timestamp-content", its easy and efficient to compare two gas_station pricings over time. For reading a Time Serie that is smaller then the compression chunk size only 2 to 4 chunks should be decompressed.
So the following features are optimal:
SSTables
Configurable Compression Options (Level,Compression Engine,Chunk Size (from 64kb to 10 MB))
Range Scans
Java Bindings
column datasore for better compression
Nice to have:
Replication
Multi Master
write quorum of 1
Forward and backward iteration over the data. (to compare two time series)
configurable replica distribution
few dependencies
Question:
Wich free Database is able to hold archived data of high redundant crawl data (only a few bytes change) , compresses good and does not use too much time to query a random record.
(In opposit to the mysql archive format, that has to decompress the whole table until the requested row)
Maybe there is a log database, that is able to index a lot of log lines and compresses them internaly? (scope of logstash, fluentd, flume)
If someone would know some benchmarks, numbers on this topic it would help a lot, to evaluate the right technology.
I am glad for your help!
Assuming you are in a multithreaded environment, possible multi-process, LevelDB is NOT a good idea.
Cassandra is written in Java, therefore you'll see excessive memory consumption when handling a big load of big files, at least without tweaking the JVM. Also, since it's written in Java, it probably won't be fast enough for really good compression.
I use HyperTable on my Linux-box to store photos and movies.
You can use HyperTable from any language with Thrift-support.
Additionally, if you need it, you can use the C++ drivers, for extra-speed.
One thing that is nice about HyperTable is that it doesn't add a dependency on Java, since it's written in C++, which also means it's blazing fast and not garbage-collected (no memory overhead).
Hypertable does have a Java client however, out of the box.
I use my own C# Thrift-client, which I ported from Java.
See >here< for the code.
Since HyperTable operates on Byte-Arrays, you can simply put your file in the thrift-client as byte-array, and HyperTable will compress it for you automagically, if you have told it to do so in the column definition.
You could also try MongoDb, if you absolutely want to.
Mongo actually derives from humongous, by the way.
However, I must say I have never "really" used it.

Flexible storage and retrieval of motion capture data

I want to flexibly access motion capture data from C/C++ code. We currently have a bunch of separate files (.c3d format). We can expect the full set of data to be several hours long and tracking about 50 markers (4 floats each) per frame, sampled at 60 hz. So we're probably looking at a couple of gigabytes of data.
I'd like to have a database that can hold the data, allowing it to be relatively rapidly retrieved, augmented, and modified. I like to be able to apply labels to the data and retrieve sequences of frames by label, time indices (e.g., frame 400-2000, or every 30th frame) or other potential criteria.
Does such a thing already exist? Could I do it with SQLite for example? Does anyone have an intuition for what kind of performance I might get?
Currently, I'm just loading one .c3d file at a time and processing it. I haven't yet begun to apply meta-data/labels to sequences. I'll be accessing the sequences for visualization, statistical analysis, and training for machine-learning.
If you need to store multi-gigabytes of data with a known schema you might want to look into a binary flat file database. Of those available, I would recommend HDF5. It is not a relational database like SQLite, but provides rich support for array and matrix data with excellent performance. It also includes MPI support, if you ever expand your machine-learning onto a cluster.

How should I store extremely large amounts of traffic data for easy retrieval?

for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).
This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.
What is a good way to do this? I already have some ideas:
Create a file for each user and day and append every dataset to it.
Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
Use a database
Advantage: It's very easy to find specific data with the right SQL query.
Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.
Advantage: It would be easy to get information for one user using SQL queries on his file.
Disadvantage: Getting overall information would still be difficult.
But perhaps someone else has a very good idea?
Thanks very much in advance.
First, get The Data Warehouse Toolkit before you do anything.
You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.
[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]
SQL databases are slow, but that slow is good for flexible retrieval.
The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.
A typical DW approach for this is to do this.
Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.
Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.
Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.
Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.
When someone wants to do analysis, build them a datamart.
For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.
You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.
I think the proper answer really depends on the definition of a "dataset". As you mention in your question you are storing individual sets of information for each record; timestamp, userid, destination ip, source ip, number of bytes etc..
SQL Server is perfectly capable of handing this type of data storage with hundreds of millions of records without any real difficulty. Granted this type of logging is going to require some good hardware to handle it, but it shouldn't be too complex.
Any other solution in my opinion is going to make reporting very hard, and from the sounds of it that is an important requirement.
So you are in one of the cases where you have much more write activity than read, you want your writes not to block you, and you want your reads to be "reasonably fast" but not critical. It's a typical business intelligence use case.
You should probably use a database and store your data in as a "denormalized" schema to avoid complex joins and multiple inserts for each record. Think of your table as a huge log file.
In this case, some of the "new and fancy" NoSQL databases are probably what you're looking for: they provide relaxed ACID constraints, which you should not terribly mind here (in case of crash, you can loose the last lines of your log), but they perform much better for insertion, because they don't have to sync journals on disk at each transaction.

Resources