Calculate Elasticsearch Query Complexity (Scalabilty) - database

I'm doing a project with ElasticSearch and the goal of the project is to optimise the time of request, now I'm trying with 1Go of data and the request took about 1200ms, I wanna calculate the time with 60Go of data, I'm asking if there is techniques to calculate complexity of my query ?
Thanks

Unfortunately, it's not as easy as extrapolating, i.e. if the request takes 1200ms with 1GB of data, it doesn't mean it'll take 60 times more with 60GB. Going from 1GB to 60GB of data has unpredictable effects and totally depends on the hardware you're running your ES on. The server might be OK for 1GB but not for 60GB, so you might need to scale out, but you won't really know until you have that big an amount of data.
The only way to really know is to get to 60GB of data (and scale your cluster appropriately (start big & scale down) and do your testing on a real amount of data.

Related

How to train a machine learning model on huge amount of data?

KEY POINT: the Dataset is so large that I am barely able to store it in hardware. (PetaBytes)
Say I have trillions and trillion of rows in a dataset. This dataset is too large to be stored in memory. I want to train a machine learning model, say logisitc regression, on this dataset. How do I go about this?
Now, I know Amazon/Google does machine learning on huge amounts of data. How do they go about it? For example, click dataset, where globally each smart devices' inputs are stored in a dataset.
Desperately looking for new ideas and open to corrections.
My train of thoughts:
load a part of data into the memory
Perform gradient descent
This way the optimization is mini batch descent.
Now the problem is, in the optimization, be it SGD or mini batch, it stops when it has gone through ALL the data in the worst case. Traversing the whole dataset is not possible.
So I had the idea of early stopping. Early stopping reserves a validation set and will stop optimization when the error stops going down/converges on the validation set. But again this might not be feasible due to the size of the dataset.
Now I am thinking of simply random sampling a training set and a test set, with workable sizes to train the model.
Pandas read function loads the entire data into ram, which can be an issue.To solve this process the data in chunks.
In case of a huge amount of data, you can use batches for training the dataset. Use complex models such as Neural networks, xgboost instead of Logistic Regression.
Check out this website for more information on how to handle big data.

If we make a number every millisecond, how much data would we have in a day?

I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.

Which NoSQL Database for Mostly Writing

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

Recommended Document Structure for CouchDB

We are currently considering a change from Postgres to CouchDB for a usage monitoring application. Some numbers:
Approximately 2000 connections, polled every 5 minutes, for approximately 600,000 new rows per day. In Postgres, we store this data, partitioned by day:
t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101 inherits t_usage.
t_usage_20100102 inherits t_usage. etc.
We write data with an optimistic stored proc that presumes the partition exists and creates it if necessary. We can insert very quickly.
For reading of the data, our use cases, in order of importance and current performance are:
* Single Service, Single Day Usage : Good Performance
* Multiple Services, Month Usage : Poor Performance
* Single Service, Month Usage : Poor Performance
* Multiple Services, Multiple Months : Very Poor Performance
* Multiple Services, Single Day : Good Performance
This makes sense because the partitions are optimised for days, which is by far our most important use case. However, we are looking at methods of improving the secondary requirements.
We often need to parameterise the query by hours as well, for example, only giving results between 8am and 6pm, so summary tables are of limited use. (These parameters change with enough frequency that creating multiple summary tables of data is prohibitive).
With that background, the first question is: Is CouchDB appropriate for this data? If it is, given the above use cases, how would you best model the data in CouchDB documents? Some options I've put together so far, which we are in the process of benchmarking are (_id, _rev excluded):
One Document Per Connection Per Day
{
service_id:555
day:20100101
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 60,000 new documents a month. Most new data would be updates to existing documents, rather than new documents.
(Here, the objects in usage are keyed on the timestamp of the poll, and the values the bytes in and byes out).
One Document Per Connection Per Month
{
service_id:555
month:201001
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 2,000 new documents a month. Moderate updates to existing documents required.
One Document Per Row of Data Collected
{
service_id:555
timestamp:1265248762
in:584
out:11342
}
{
service_id:555
timestamp:1265249062
in:94
out:1242
}
Approximately 15,000,000 new documents a month. All data would be an insert to a new document. Faster inserts, but I have questions about how efficient it's going to be after a year or 2 years with hundreds of millions of documents. The file IO would seem prohibitive (though I'm the first to admit I don't fully understand the mechanics of it).
I'm trying to approach this in a document-oriented way, though breaking the RDMS habit is difficult :) The fact you can only minimally parameterise views as well has me a bit concerned. That said, which of the above would be the most appropriate? Are there other formats that I haven't considered which will perform better?
Thanks in advance,
Jamie.
I don't think it's a horrible idea.
Let's consider your Connection/Month scenario.
Given that an entry is ~40 (that's generous) characters long, and you get ~8,200 entries per month, your final document size will be ~350K long at the end of the month.
That means, going full bore, you're be reading and writing 2000 350K documents every 5 minutes.
I/O wise, this is less than 6 MB/s, considering read and write, averaged for the 5m window of time. That's well within even low end hardware today.
However, there is another issue. When you store that document, Couch is going to evaluate its contents in order to build its view, so Couch will be parsing 350K documents. My fear is that (at last check, but it's been some time) I don't believe Couch scaled well across CPU cores, so this could easily pin the single CPU core that Couch will be using. I would like to hope that Couch can read, parse, and process 2 MB/s, but I frankly don't know. With all it's benefits, erlang isn't the best haul ass in a straight line computer language.
The final concern is keeping up with the database. This will be writing 700 MB every 5 minutes at the end of the month. With Couchs architecture (append only), you will be writing 700MB of data every 5 min, which is 8.1GB per hour, and 201GB after 24 hrs.
After DB compression, it crushes down to 700MB (for a single month), but during that process, that file will be getting big, and quite quickly.
On the retrieve side, these large documents don't scare me. Loading up a 350K JSON document, yes it's big, but it's not that big, not on modern hardware. There are avatars on bulletin boards bigger than that. So, anything you want to do regarding the activity of a connection over a month will be pretty fast I think. Across connections, obviously the more you grab, the more expensive it will get (700MB for all 2000 connections). 700MB is a real number that has real impact. Plus your process will need to be aggressive in throwing out the data you don't care about so it can throw away the chaff (unless you want to load up 700MB of heap in your report process).
Given these numbers, Connection/Day may be a better bet, as you can control the granularity a bit better. However, frankly, I would go for the coarsest document you can, because I think that gives you the best value from the database, solely because today all the head seeks and platter rotations are what kill a lot of I/O performance, many disk stream data very well. Larger documents (assuming well located data, since Couch is constantly compacted, this shouldn't be a problem) stream more than seek. Seeking in memory is "free" compared to a disk.
By all means run your own tests on our hardware, but take all these considerations to heart.
EDIT:
After more experiments...
Couple of interesting observations.
During import of large documents CPU is equally important to I/O speed. This is because of the amount of marshalling and CPU consumed by converting the JSON in to the internal model for use by the views. By using the large (350k) documents, my CPUs were pretty much maxed out (350%). In contrast, with the smaller documents, they were humming along at 200%, even though, overall, it was the same information, just chunked up differently.
For I/O, during the 350K docs, I was charting 11MB/sec, but with the smaller docs, it was only 8MB/sec.
Compaction appeared to be almost I/O bound. It's hard for me to get good numbers on my I/O potential. A copy of a cached file pushes 40+MB/sec. Compaction ran at about 8MB/sec. But that's consistent with the raw load (assuming couch is moving stuff message by message). The CPU is lower, as it's doing less processing (it's not interpreting the JSON payloads, or rebuilding the views), plus it was a single CPU doing the work.
Finally, for reading, I tried to dump out the entire database. A single CPU was pegged for this, and my I/O pretty low. I made it a point to ensure that the CouchDB file wasn't actually cached, my machine has a lot of memory, so a lot of stuff is cached. The raw dump through the _all_docs was only about 1 MB/sec. That's almost all seek and rotational delay than anything else. When I did that with the large documents, the I/O was hitting 3 MB/sec, that just shows the streaming affect I mentioned a benefit for larger documents.
And it should be noted that there are techniques on the Couch website about improving performance that I was not following. Notably I was using random IDs. Finally, this wasn't done as a gauge of what Couch's performance is, rather where the load appears to end up. The large vs small document differences I thought were interesting.
Finally, ultimate performance isn't as important as simply performing well enough for you application with your hardware. As you mentioned, you're doing you're own testing, and that's all that really matters.

When is the size of the database call more expensive than the frequency of calls?

Can someone give me a relative idea of when it makes more sense to hit the database many times for small query results vs caching a large number of rows and querying that?
For example, if I have a query returning 2,000 results. And then I have additional queries on those results that take maybe 10-20 items, would it be better to cache the 2000 results or hit the database every time for each set of 10 or 20 results?
Other answers here are correct -- the RDBMS and your data are key factors. However, another key factor is how much time it will take to sort and/or index your data in memory versus in the database. We have one application where, for performance, we added code to grab about 10,000 records into an in-memory DataSet and then do subqueries on that. As it turns out, keeping that data up to date and selecting out subsets is actually slower than just leaving all the data in the database.
So my advice is: do it the simplest possible way first, then profile it and see if you need to optimize for performance.
It depends on a variety of things. I will list some points that come to mind:
If you have a .Net web app that is caching data in the client, you do not want to pull 2k rows.
If you have a web service, they are almost always better Chunky than Chatty because of the added overhead of XML on the transport.
In a fairly decently normalized and optimized database, there really should be very few times that you have to pull 2k rows out at a time unless you are doing reports.
If the underlying data is changing at a rapid pace, then you should really be careful caching it on the middle tier or the presentation layer because what you present will you will be out of date.
Reports (any DSS) will pull and chomp through much larger data sets, but since they are not interactive, we denormalize and let them have their fun.
In cases of cascading dropdowns and such, AJAX techniques will prove to be more efficient and effective.
I guess I'm not really giving you one answer to your question. "It depends" is the best I can do.
Unless there is a big performance problem (e.g. a highly latent db connection), I'd stick with leaving the data in the database and letting the db take care of things for you. A lot of things are done efficiently on the database level, for example
isolation levels (what happens if other transactions update the data you're caching)
fast access using indexes (the db may be quicker to access a few rows than you searching through your cached items, especially if that data already is in the db cache like in your scenario)
updates in your transaction to the cached data (do you want to deal with updating your cached data as well or do you "refresh" everything from the db)
There are a lot of potential issues you may run into if you do your own caching. You need to have a very good performance reason befor starting to take care of all that complexity.
So, the short answer: It depends, but unless you have some good reasons, this smells like premature optimizaton to me.
in general, network round trip latency is several orders of magnitude greater than the capacity of a database to generate and feed data onto the network, and the capacity of a client box to consume it from a network connection.
But look at the width of your network bus ( Bits/sec ) and compare that to the average round trip time for a database call...
On 100baseT ethernet, for example you are about 12 MBytes / sec data transfer rate. If your average round trip time is say, 200 ms, then your network bus can deliver 3 MBytes in each 200 ms round trip call..
If you're on gigabit ethernet, that number jumps to 30 Mbytes per round trip...
So if you split up a request for data into two round trips, well that's 400 ms, and each query would have to be over 3Mb (or 30Mb for gigibit ) before that would be faster...
This likely varies from RDBMS to RDBMS, but my experience has been that pulling in bulk is almost always better. After all, you're going to have to pull the 2000 records anyway, so you might as well do it all at once. And 2000 records isn't really a large amount, but that depends largely on what you're doing.
My advice is to profile and see what works best. RDBMSes can be tricky beasts performance-wise and caching can be just as tricky.
"I guess I'm not really giving you one answer to your question. "It depends" is the best I can do."
yes, "it depends". It depends on the volatility of the data that you are intending to cache, and it depends on the level of "accuracy" and reliability that you need for the responses that you generate from the data that you intend to cache.
If volatility on your "base" data is low, then any caching you do on those data has a higher probability of remaining valid and correct for a longer time.
If "caching-fault-tolerance" on the results you return to your users is zero percent, you have no option.
The type of data your bringing back affects the decision as well. You don't want to be caching volatile data or data for potential updates that may get stale.

Resources