I am trying to apply exponential SRGM to a large data which has about 50000 failure times data. This is taking forever to run and even the online tools are crashing with this data as it is too many data points. Can any of you suggest how can I solve this problem and fit Exponential (Goel-Okumoto) model to obtain MLEs (Maximum likelihood estimates) ?
I learnt that one best way to do this is to transform the data into failure counts format. So, I did the failure counts transformation considering equal time interval (per year), which reduced the length of my data set to 28. Then, I could apply any failure counting models to fit the data and make predictions. The article based on this study is available at https://books.google.com/books?id=uYiRDgAAQBAJ&pg=PA244&lpg=PA244&dq=An+Open+Source+Tool+to+Support+the+Quantitative+Assessment+of+Cybersecurity.+In+Proc.+International+Conference+on+Cyber+Warfare+and+Security&source=bl&ots=gJX5I0b8eH&sig=fp-EDU0z8AR1ZCVvjgqxrb1WF0c&hl=en&sa=X&ved=0ahUKEwjj1K6N09nUAhXBRCYKHZUWDfAQ6AEIMDAA#v=onepage&q=An%20Open%20Source%20Tool%20to%20Support%20the%20Quantitative%20Assessment%20of%20Cybersecurity.%20In%20Proc.%20International%20Conference%20on%20Cyber%20Warfare%20and%20Security&f=false
Related
KEY POINT: the Dataset is so large that I am barely able to store it in hardware. (PetaBytes)
Say I have trillions and trillion of rows in a dataset. This dataset is too large to be stored in memory. I want to train a machine learning model, say logisitc regression, on this dataset. How do I go about this?
Now, I know Amazon/Google does machine learning on huge amounts of data. How do they go about it? For example, click dataset, where globally each smart devices' inputs are stored in a dataset.
Desperately looking for new ideas and open to corrections.
My train of thoughts:
load a part of data into the memory
Perform gradient descent
This way the optimization is mini batch descent.
Now the problem is, in the optimization, be it SGD or mini batch, it stops when it has gone through ALL the data in the worst case. Traversing the whole dataset is not possible.
So I had the idea of early stopping. Early stopping reserves a validation set and will stop optimization when the error stops going down/converges on the validation set. But again this might not be feasible due to the size of the dataset.
Now I am thinking of simply random sampling a training set and a test set, with workable sizes to train the model.
Pandas read function loads the entire data into ram, which can be an issue.To solve this process the data in chunks.
In case of a huge amount of data, you can use batches for training the dataset. Use complex models such as Neural networks, xgboost instead of Logistic Regression.
Check out this website for more information on how to handle big data.
I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.
I will try to describe my challenge and operation:
I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year.
I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.
The main consideration is speed. I need to output a few months of this kind of index as fast as i can.
For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.
Here is something i had In mind, using NoSql or column oriented approach:
Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.
I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)
Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.
In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.
Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.
You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.
With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).
Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.
You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.
On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.
(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)
If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.
Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.
More info about logparser:
Wikipedia
Coding Horror
What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate.
http://www.freeinformix.com/time-series-presentation-technical.html
I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training? Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.
How do I calculate the optimal amount
of data for my training?
It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.
The results were quite good with this
last size but I would like to find the
optimal amount.
"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.
The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.
You should customize your dataset to include and reinforce the data you want the network to learn.
After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.
For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.
Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?
You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:
10 95%
20 96%
30 96.5%
40 96.55%
50 96.56%
you can then clearly see your marginal gains, and make your decision accordingly.
I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.