How to train a machine learning model on huge amount of data? - database

KEY POINT: the Dataset is so large that I am barely able to store it in hardware. (PetaBytes)
Say I have trillions and trillion of rows in a dataset. This dataset is too large to be stored in memory. I want to train a machine learning model, say logisitc regression, on this dataset. How do I go about this?
Now, I know Amazon/Google does machine learning on huge amounts of data. How do they go about it? For example, click dataset, where globally each smart devices' inputs are stored in a dataset.
Desperately looking for new ideas and open to corrections.
My train of thoughts:
load a part of data into the memory
Perform gradient descent
This way the optimization is mini batch descent.
Now the problem is, in the optimization, be it SGD or mini batch, it stops when it has gone through ALL the data in the worst case. Traversing the whole dataset is not possible.
So I had the idea of early stopping. Early stopping reserves a validation set and will stop optimization when the error stops going down/converges on the validation set. But again this might not be feasible due to the size of the dataset.
Now I am thinking of simply random sampling a training set and a test set, with workable sizes to train the model.

Pandas read function loads the entire data into ram, which can be an issue.To solve this process the data in chunks.

In case of a huge amount of data, you can use batches for training the dataset. Use complex models such as Neural networks, xgboost instead of Logistic Regression.

Check out this website for more information on how to handle big data.

Related

C: Sorting Big Data; Not in Memory

I'm learning to work with large amounts of data.
I've generated a file of 10,000,000 ints. I want to perform a number of sorts on the data and time the sorts (maybe plot the values for performance analysis?) but I've never worked with big data before and don't know how to sort (say, even bubble sort!) data that isn't in memory! I want to invoke the program like so:
./mySort < myDataFile > myOutFile
How would I go about sorting data that can't fit into a linked list, or array?
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
One of the best references on this, though rather technical and dense is Donald Knuth's treatment of tape sorting algorithms. Back in the day where data was stored on tape and could only be read sequentially and then written out to other tapes this kind of sorting was often done by repeatedly shuffling data back and forth between different tape drives.
Depending upon the size and type of dataset you are working with it may be worthwhile to make use of either a dedicated database to load the data into, or to make use of a cloud based service like Google's BigQuery. BigQuery has no cost to upload and download your dataset, you just pay for the processing. The first TB of processed data each month is free and you have less than even one GB of data.
Edit: Here's a very nice set of undergraduate lecture notes on external sorting algorithms. http://www.math-cs.gordon.edu/courses/cs321/lectures/external_sorting.html
You need to use external sorting
Bring in part of data at a time , sort it in memory and then merge it
More details here
http://en.m.wikipedia.org/wiki/External_sorting

Database System for timeseries & aggregation

I'm currently creating a raspberry pi based logging device for logging the power which is fed into the grid by a solar array.
The "main table" will be growing at ~ 20 entries representing the "current" power produced by several parts of the array.
Basically this isn't that much and can be handled at an acceptable performance using a raspberry pi, but with a growing amount of data queries like "select last 10 years, group by month" probably wouldn't be very effective... (the data should be displayed via an interactive web interface)
I thought of doing some "background aggregation" and maintaining several tables for containing the aggregated data of various timeframes, but this seems like a problem which probably has been dealt with by many people before.
What do you suggest me to do?
You do not know how much data growth is needed to affect performance.
You do not know by how much performance will be affected then.
You do not know if performance will be affected at all.
As long as you do not have even an estimate of how much performance improvement you need, it does not make sense to try to do optimizations.
Or, as said by Donald Knuth:
premature optimization is the root of all evil
If you really do want to create caches of aggregated values, I'd suggest to use triggers to keep the cache consistent after any change to the original data.

A.I.: How would I train a Neural Network across multiple machines?

So, for larger networks with large data sets, they take a while to train. It would be awesome if there was a way to share the computing time across multiple machines. However, the issue with that is that when a neural network is training, the weights are constantly being altered every iteration, and each iteration is more or less based on the last -- which makes the idea of distributed computing at the very least a challenge.
I've thought that for each portion of the network, the server could send maybe a 1000 sets of data to train a network on... but... you'd have roughly the same computing time as I wouldn't be able to train on different sets of data simultaneously (which is what I want to do).
But even if I could split up the network's training into blocks of different data sets to train on, how would I know when I'm done with that set of data? especially if the amount of data sent to the client machine isn't enough to achieve the desired error?
I welcome all ideas.
Quoting http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation:
When multicore computers are used multithreaded techniques can greatly decrease the amount of time that backpropagation takes to converge. If batching is being used, it is relatively simple to adapt the backpropagation algorithm to operate in a multithreaded manner.
The training data is broken up into equally large batches for each of the threads. Each thread executes the forward and backward propagations. The weight and threshold deltas are summed for each of the threads. At the end of each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to the neural network.
which is essentially what other answers here describe.
Depending on your ANN model you can exploit some parallelism on multiple machines by running the same model with the same training and validation data on multiple machines but set different ANN properies; initial values, ANN parameters, noise etc, for different runs.
I used to do this a lot to make sure I'd explored the problem space effectively and wasn't stuck in local minima etc. This is a very easy way to take advantage of multiple machines without having to recode your algorith. Just another approach you might want to consider.
My assumption is you have more than 1 training set, and you have a gold standard. Also, I assume you have some way of storing the state of the neural network (whether it's a list of probability weights for each node, or something along those lines).
Using as many compute nodes in a cluster as you can, launch the program on a data set on each node. Save the results for each, and test on the gold standard. Which ever neural network state performs best set as the input for the next round of training. Repeat as much as you see fit
If I understand correctly, you're trying to figure out a way to train an ANN on a cluster of machines? As you stated, partitioning the network isn't the right approach, and as far as I know, is seemingly unfeasible for most models. A possible approach might be to partition the training sets and run local copies of your network, and then merge the results. An intuitive way to do this and gain some validation along the way would be with cross-validation. As you stated, knowing when the network has had the right amount of training is a problem, but that variability is a problem inherent to neural nets in general, not in parallelizing the work.
As you also stated, the updates that happen during each iteration of training are dependent on the current state of the weights, but without mixing up training sets/validation, you're likely overfitting. This is why CV is nice, because your training sets will all get a chance to play a role in the training, and the validating, across multiple samples.
If you do batch training, the weight are only altered after you have been through the entire dataset. You can compute the weight update vector for each data point in the set on a separate machine/core and add them up at the end, then proceed with the next epoch.
Here is a link to a question about batch training.

Efficient dataset size for a feed-foward neural network training

I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training? Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.
How do I calculate the optimal amount
of data for my training?
It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.
The results were quite good with this
last size but I would like to find the
optimal amount.
"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.
The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.
You should customize your dataset to include and reinforce the data you want the network to learn.
After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.
For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.
Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?
You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:
10 95%
20 96%
30 96.5%
40 96.55%
50 96.56%
you can then clearly see your marginal gains, and make your decision accordingly.

Computed Values

Like most people, I work on a data-driven object-oriented business application. I use a relational database to store my data.
I have designed my application so far that I never store computed values. That is, if a user would like to consult the output of a "simulation" he ran last year, my application would simply recompute the report from existing historical data, instead of reading the result of the simulation. Since the report takes very little time to create - it's mostly simple arithmetic - can I safely assume I can get by without storing the result of their reports? I'm having a hard time figuring out a future business requirement which would make me regret not having stored the results in the first place.
Not storing the result of computations reduces redundancy and is generally, in my opinion, a good thing - but it comes down to a case of normalization vs. computational power required for an operation. Often, databases don't get 100% normalized because we don't live in the ideal world where that would be ideal (infinite CPU power and I/O speed), and so it should really be determined on a case-by-case basis.
If you can't foresee a need for storing the results of a computation in the DB, I'd suggest you don't store it. A DB is generally easier to maintain, the more normalized it is.
Tax changes, any other regulation that apply to All the data.
You can get around it by using values at the time, but when it involves any calculation that's changed complexity starts to build up.
If you have all the data you and can compute simulations based on that data then you should be fine. If in the future you find that these simulations are taking too long to run then you can begin storing the computed values and simply change your application to pull historical values from there.

Resources