C: Sorting Big Data; Not in Memory - c

I'm learning to work with large amounts of data.
I've generated a file of 10,000,000 ints. I want to perform a number of sorts on the data and time the sorts (maybe plot the values for performance analysis?) but I've never worked with big data before and don't know how to sort (say, even bubble sort!) data that isn't in memory! I want to invoke the program like so:
./mySort < myDataFile > myOutFile
How would I go about sorting data that can't fit into a linked list, or array?

There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
One of the best references on this, though rather technical and dense is Donald Knuth's treatment of tape sorting algorithms. Back in the day where data was stored on tape and could only be read sequentially and then written out to other tapes this kind of sorting was often done by repeatedly shuffling data back and forth between different tape drives.
Depending upon the size and type of dataset you are working with it may be worthwhile to make use of either a dedicated database to load the data into, or to make use of a cloud based service like Google's BigQuery. BigQuery has no cost to upload and download your dataset, you just pay for the processing. The first TB of processed data each month is free and you have less than even one GB of data.
Edit: Here's a very nice set of undergraduate lecture notes on external sorting algorithms. http://www.math-cs.gordon.edu/courses/cs321/lectures/external_sorting.html

You need to use external sorting
Bring in part of data at a time , sort it in memory and then merge it
More details here
http://en.m.wikipedia.org/wiki/External_sorting

Related

How to train a machine learning model on huge amount of data?

KEY POINT: the Dataset is so large that I am barely able to store it in hardware. (PetaBytes)
Say I have trillions and trillion of rows in a dataset. This dataset is too large to be stored in memory. I want to train a machine learning model, say logisitc regression, on this dataset. How do I go about this?
Now, I know Amazon/Google does machine learning on huge amounts of data. How do they go about it? For example, click dataset, where globally each smart devices' inputs are stored in a dataset.
Desperately looking for new ideas and open to corrections.
My train of thoughts:
load a part of data into the memory
Perform gradient descent
This way the optimization is mini batch descent.
Now the problem is, in the optimization, be it SGD or mini batch, it stops when it has gone through ALL the data in the worst case. Traversing the whole dataset is not possible.
So I had the idea of early stopping. Early stopping reserves a validation set and will stop optimization when the error stops going down/converges on the validation set. But again this might not be feasible due to the size of the dataset.
Now I am thinking of simply random sampling a training set and a test set, with workable sizes to train the model.
Pandas read function loads the entire data into ram, which can be an issue.To solve this process the data in chunks.
In case of a huge amount of data, you can use batches for training the dataset. Use complex models such as Neural networks, xgboost instead of Logistic Regression.
Check out this website for more information on how to handle big data.

Suggestions on how to store and retrieve time-series data

I am currently working on a project that requires us to store a large amount of time series data, but more importantly, retrieve large amounts of it quick.
There will be N devices (>10,000) which will periodically send data to the system, lets say every 5 seconds. This data will quickly build up, but we are generally only interested in the most recent data, and want to compact the older data. We don't want to remove it, as it is still useful, but instead of having thousands of data point for a day, we might save just 5 or 10 after N days/weeks/months have passed.
Specifically we want to be able to fetch sampled data over a large time period, say a year or two. There might be millions of points here, but we just want a small, linearly distributed, sample of this data.
Today we are experimenting with influxdb, which initially seemed like an alright solution. It was fast enough and allows us to store our data in a reasonable structure, but we have found that it is not completely satisfactory. We were unable to perform the sample query described above and in general the system does not feel mature enough for us.
Any advice on how we can proceed, or alternative solutions, is much appreciated.
You might be interested in looking at TimescaleDB:
https://github.com/timescale/timescaledb
It builds a time-series DB on top of Postgres and so offers full SQL support, as well as generally the Postgres ecosystem/reliability. This can give you a lot greater query flexibility, which sounds like you want.
In terms of your specific use case, there would really be two solutions.
First, what people typically would do is to create two "hypertables", one for raw data, another for sampled data. These hypertables look like standard tables to the user, although heavily partitioned under the covers for much better scalability (e.g., 20x insert throughput vs. postgres for large table sizes).
Then you basically do a roll-up from the raw to the sampled table, and use a different data retention policy on each (so you keep raw data for say 1 month, with sampled data for years).
http://docs.timescale.com/getting-started/setup/starting-from-scratch
http://docs.timescale.com/api/data-retention
Second, you can go with a single hypertable, and then just schedule a normal SQL query to delete individual rows from data that's older than a certain time period.
We might even in the future add better first-class support for this latter approach if it becomes a common-enough requested feature, although most use cases we've encountered to date seemed more focused on #1, esp. in order to to keep statistical data about removed data-points, as opposed to just straight samples.
(Disclaimer: I'm one of the authors of TimescaleDB.)

How to store 7.3 billion rows of market data (optimized to be read)?

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.
Most (99.9%) of the time I will perform only read requests.
What is the best way to store this data in a db?
1 big table with 7.3B rows?
1000 tables (one for each stock symbol) with 7.3M rows each?
any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
Edit:
This is a sample row:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
About the hardware: I plan to use Amazon RDS so I'm flexible on that
So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
Prepare a C/C++ struct to hold the record format:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
Optimizing for time series involved certain tradeoffs. For example:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
In open sourced benchmarks,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
Tell us about the queries, and your hardware environment.
I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.
Update
Okay, why?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
you can better exploit parallelism on relatively small commodity hardware.
you can also better implement high reliability and redundancy
many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).
Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.
EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.
It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.
I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
I guess we could help you more with some idea as to the data usage.
Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
http://github.com/kriasoft/market-data
First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
k_equity_id
k_date
k_minute
and for the EOD table (that will be viewed 1000x over the other data)
k_equity_id
k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.
If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.
Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.
Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.
Like others have said, knowing more about the queries you will be running will help.
You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.
A free columnar database is MonetDB.
If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.
If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Architecture and pattern for large scale, time series based, aggregation operation

I will try to describe my challenge and operation:
I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year.
I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.
The main consideration is speed. I need to output a few months of this kind of index as fast as i can.
For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.
Here is something i had In mind, using NoSql or column oriented approach:
Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.
I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)
Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.
In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.
Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.
You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.
With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).
Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.
You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.
On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.
(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)
If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.
Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.
More info about logparser:
Wikipedia
Coding Horror
What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate.
http://www.freeinformix.com/time-series-presentation-technical.html

Flexible storage and retrieval of motion capture data

I want to flexibly access motion capture data from C/C++ code. We currently have a bunch of separate files (.c3d format). We can expect the full set of data to be several hours long and tracking about 50 markers (4 floats each) per frame, sampled at 60 hz. So we're probably looking at a couple of gigabytes of data.
I'd like to have a database that can hold the data, allowing it to be relatively rapidly retrieved, augmented, and modified. I like to be able to apply labels to the data and retrieve sequences of frames by label, time indices (e.g., frame 400-2000, or every 30th frame) or other potential criteria.
Does such a thing already exist? Could I do it with SQLite for example? Does anyone have an intuition for what kind of performance I might get?
Currently, I'm just loading one .c3d file at a time and processing it. I haven't yet begun to apply meta-data/labels to sequences. I'll be accessing the sequences for visualization, statistical analysis, and training for machine-learning.
If you need to store multi-gigabytes of data with a known schema you might want to look into a binary flat file database. Of those available, I would recommend HDF5. It is not a relational database like SQLite, but provides rich support for array and matrix data with excellent performance. It also includes MPI support, if you ever expand your machine-learning onto a cluster.

Resources