Most efficient tool to store frequent temperature samples - database

I'm working on a system that will need to store lots of temperature data. I could potentially store 5 samples per second or more.
I've done this in the past with a relatively simple mysql database and performance became unbearable. Inserts were not too bad, but had a noticeable load. Queries, however, could take minutes.
At that time, I had something like 50 gb of data, which is ridiculous. I can think of many ways to compress or discard data without losing critical information, but that's a completely different problem.
I'd like to pick a tool/database that is optimized for this kind of data, preferably cross platform (at a minimum, linux/c++).
RRD (Round Robin Database) seems built for this kind of thing, but it seems designed more for processing data than for storing it.
What other tools are available?
Edit: more info...
This will be running on an embedded system (a Raspberry Pi) so an ideal tool has low computing overhead, low memory footprint, and few library dependencies.
The storage may not necessarily be on the same device.
I suppose growth could reach as much as 500k samples per hour in a contrived, extreme case. More likely it will be about 20k samples per hour.
Internet access should not be presumed.

Looks like you're looking for a time series DB.
I know of two candidates:
http://tempo-db.com/
http://opentsdb.net/
If you could be a bit more specific about your requirements (req/s, daily data growth, API type, self-hosted or a fully managed solution, etc), I'd be able to go into more details or recommend other solutions.
Good luck.

Related

Best DB architecture to maintain/update counters in near real time

I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.

Performance Testing - How much data should I create

I'm very new to performance engineering, so I have a very basic question.
I'm working in a client-server system that uses SQL server backend. The application is a huge tax-related application that requires testing performance at peak load. Meaning that there should be like 10 million tax returns in the system when we run scenarios related to creating tax returns and submitting them. Then there will also be proportional number of users that need to be created.
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
When one talks about creating a smaller dataset and extrapolating the performance planning, a very common answer I hear is that we need to 10 million records because we cannot tell from a smaller data set how the database or network will behave.
So how does one plan capacity and test performance on large enterprise application without creating peak level of data or running peak number of scenarios?
Thanks.
Personally, I would throw as much data and traffic at it as you can. Forget what traffic you "think you need to handle". And just see how much traffic you CAN handle and go from there. Knowing the limits of your system is more valuable than simply knowing it can handle 10 million records.
Maybe it does handle 10 million, but at 11 million it dies a horrible death. Or maybe it's well written and will scale to 100 million before it dies. There's a very distinct difference between the two even though both pass the "10 million test"
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
Why do you think so?
Of course you can (and should) test with limited amounts of data, but you also really, really need to test with a realistic load, which means testing with the amount (and type) of data that you will use in production.
This is just a special case of a general rule: For system or integration testing, you need to test in a scenario that is as close as possible to production; ideally you just copy/clone a live production system, data, config and all and use that for testing. That is actually what we do (if we technically can and the client agrees). We just run a few SQL scripts to randomize personal data in the test data set, so prevent privacy concerns.
There are always issues that crop up because production data is somehow different from what you tested on, and this is the only way to prevent (or at least limit) these problems.
I've planned and implemented reporting and imports, and they invariably break or misbehave the first time they're exposed to real data, because there are always special cases or scaling problems you didn't expect. You want that breakage to happen during development, not in production :-).
In short:
Bite the bullet, and (after having done all the tests with "toy data"), get a realistic dataset to test on. If you don't have the hardware to handle that, then you don't have the right hardware for your tests :-).
I would take a look at Redgate's SQL Data Generator. It does a good job of generating representative data.
Have a peek at "The art of application performance testing / Ian Molyneaux, O’Reilly, 2009".
Your test data is ideally a realistic variety of records. But for first approximations you could have just a few unique records, and duplicate them until you have the desired size. Then use ApacheBench to roughly approximate the traffic.
To help generate data look at ruby faker and perl data faker. I have had good luck with it in generating large data sets for testing. SQL generator from redgate is good too.

Database/data storage for high volume of simple transactions

I have written a PHP application which requires storage of millions of integers between 0 and 10,000,000 inclusive. Each number is incremented by one very frequently (on average 100 values are updated every second) and read very frequently (20,000 reads per second). The numbers are reset to 0 either nightly, weekly, monthly or annually.
I've got a fairly good handle on MySQL but it feels like overkill, and not very efficient in the process.
Has anyone had to deal with this before, and/or could shed some light on a suitable data storage system?
Depending on how much memory you have, and how long you have to keep them around - you could just use the APC cache, or memcache. memcache has a nice increment operation, so it's crazy efficient doing it.
BTW, you said you need 20K reads. And you are doing it in PHP? on one server or on a cluster? Sounds like you need some architectural guidance..... :)
If it's a web-app, you are going to be using a cluster. And if it's not a web app, I'd suggest you think about re-doing it in something which isn't PHP. Java, C++ and c# come to mind.

If you were asked if a system could sustain double growth, what 3 things would you do to answer?

Let's say at your job your boss says,
That system over there, which has lost all institutional knowledge but seems to run pretty good right now, could we dump double the data in it and survive?
You're completely unfamiliar with the system.
It's in SQL Server 2000 (primarily a database app).
There's no test environment.
You might be able to hijack it on the weekends if you needed to run a benchmark.
What would be the 3 things you'd do to convince yourself and then your manager that you could take on that extra load. And if you couldn't do it, on the same hardware... the extra hardware (measured in dollars) it would take to satisfy that request.
To address the response from doofledorfer, you assumptions are almost all 180 degrees off. But that's my fault for an ambiguous question.
One of the main servers runs 7x24 at 70% base and spikes from there and no one knows what it is doing.
This isn't an issue of buy-in or whining... Our company may not have much of a choice in the matter.
Because this is being externally mandated, delays in implementation could result in huge fines. So large meeting to assess risk are almost impossible. There is one risk, that dumping double the data would take the system down for the existing customers.
I was hoping someone would say something like, see if you take the system off line Sunday night at midnight and run SQLIO tests to see how close the storage subsystem is to saturation. Things like that.
Set up a test environment, even if I have to do it on my laptop.
Enable some kind of logging on the production system to get an idea of the volume of transactions in addition to the volume of data.
Read the source code as I run stress tests on my laptop with increasing amounts of data.
Having said that, I sympathize with this assignment, because it's unfair. It's like asking someone in a boat if the boat can float with twice the cargo -- but you can't get out of the boat or take it out of its regular service.
You've just described a typical Agile project. Your answer should be:
I don't know, and I won't be able to tell without testing.
In addition to data volume, there might be issues with usage patterns, application interactions, database and server tuning, etc.
So let's work through a basic list of risk factors, and how we might resolve them.
Once we've done that, let's work through them in inverse order of risk; and make a stop/continue decision as we develop the results.
etc.
Without management buy-in and participation at least at that level, any other answer you might give is high-risk wishing, and "3 most important" is a non sequitur.
I'd be optimistic unless your current system is substantially loaded already. Most servers should run at less than 50% capacity on all resources, or else be on life-support.
And I expect you wouldn't be having the conversation if the existing server were already dealing with load issues; although "seems to run pretty good right now" is imprecise enough to be worrisome.
it mostly depends on its current level. If doubling is going from 2GB to 4GB just do it. If it's going from 1TB to 2TB you've got some planning to do.
I'd collect some info using Performance Monitor and provide it to help make an educated decision.
It depends what you mean by "double the data".
If that is going to affect one table only (say product table) then you are probably safe as most queries that are referring to that one table are most likely to double the time of execution (that assumes that you do not reference the same time twice in a query).
The problem will arise if you double the amount of data in all the tables as the execution time may grow in exponential fashion then and it can lead to some serious issues.
But in general I would support the answer by doofledorfer

What advice can you give me for writing a meaningful benchmark?

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
Do you have any advice? Any remarks on this simple procedure?
Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
Which functions are most often used?
How much data is transferred?
Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
Create a top ten for 1+2 and work from that.
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
Microbenchmarking
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.

Resources