How to save frequent received data in database? - database

Me and 10 students are doing a big project where we need to receive temperature data from hardware in form av nodes, that should be uploaded and stored on a server. As we are all engineers in embedded systems and having minor database knowledge, I am turning to you guys.
I want to receive data from the nodes lets say, every 30 seconds. The table that will store that data in the database would quickly become very long if you store: [nodeId, time, temp] in a table. Do you have any suggestions how to store the data in another way?
A solution could be to store it like mentioned for a period of time and then "compromize" it somehow to a matrix of some sort? I still want to be able to reach old data.

One row every 30 seconds is not a lot of data. It's 2880 rows per day per node. I once designed a database which had 32 million rows added per day, every day. I haven't looked at it for a while but I know it's currently got more than 21 billion rows in it.
The only thing to bear in mind is that you need to think about how you're going to query it, and make sure it has appropriate indexes.
Have fun!

Related

Daily data generation and insertion

I'm facing a problem that perhaps someone around here can help me with.
I work in a business intelligence company and I'd like to simulate the whole usage cycle of our product the way our clients use it.
The short version is that our customers are inserting some 20 million records to their database on a daily basis, and our product crunches the new data at the end of the day.
I would like to automatically create around 20 million records and insert them into some database, everyday (MSSQL probably).
I should point out that the number of records should change from day to day between 15 to 25 million. Other than that, the data is supposed to be inserted to 6 tables linked with foreign keys.
I ususally use Redgate's SQL Generator to create data, but as far as I can tell it's good for one time data generation as opposed to the on going data generation I'm looking for.
If anyone knows of methods/tools adequate to this situation, please let me know.
Thanks!
You could also write a small Java (or similar) program to get the starting ID from the database, pick a random number of rows to insert, and then execute the data-generation tool as a child process.
For example, see Runtime.exec():
http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html
You can then run your program as a scheduled task or cron job.

How to store and retrieve large numbers of data points for graphical visualization?

I'm thinking about building a web-based data logging and visualization service. The basic idea is that at some timed interval something (e.g. a sensor) reports a value (e.g. temperature) to the server. The server records this value into a database. There would be a web-based UI that allows me to view this data on a time-based graph. Ideally this graph would have various resolutions (last 30 seconds, last week, last year, etc). In a super ideal world, I would be able to zoom into the data for any point in time.
The problem is that the sensors are going to generate enormous amounts of data. For example, a sensor that reports a value every 5 seconds will generate about 18k values a day. I'm imagining a system that has thousands of sensors. Over time, this becomes lots of data.
The naive solution is to throw this data into a relational database and retrieve it in the various ways I want, but that won't scale.
The simple solution is to reduce the amount of data by performing periodic roll-ups of the data. New data might go into a table that has data points every 5 seconds. Every hour, some system pumps this data into another table that has data points every minute and the original data is deleted. This repeats for a few levels. The downside to this is that the further back in time you go, the less detailed the data is. That's probably fine. I would imagine that I would need enormous amounts of hardware to support full resolution of data over all time as compared to a system with this sort of rollup.
Is there a better way to do this? Is there an existing solution? I have to imagine this is a fairly common problem.
You probably want a fixed sized database like RRDTool: http://oss.oetiker.ch/rrdtool/
Also Graphite is built on top of a similar datastore implementation: http://graphite.wikidot.com/

Best way to access averaged static data in a Database (Hibernate, Postgres)

Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Printing the names of all the people greater than age 18?

This was a pretty good question that was posed to me recently. Suppose we have a hypothetical (insert your favorite data storage tool here) database that consists of the names, ages and address of all the people residing on this planet. Your task is to print out the names of all the people whose age is greater than 18 within an HTML table. How would you go about doing that? Lets say that hypothetically the population is growing at the rate of 1200/per second and the database is updated accordingly(don't ask how). What would be your strategy to print the names of all these people and their addresses on an HTML table?
Storing the ages in a DB tables sounds like a recipe for trouble to me - it would be impossible to maintain. You would be better off storing the birth dates, then building an index on that column/attribute.
You have to get an initial dump of the table for display. Just calculate the date 18 years ago (let's say D0) and use a query for any person born earlier than that.
Use DB triggers to receive notifications about deaths, so that you can remove them from the table immediately.
Since people only get older (unfortunately?), you can use ranged queries to get new additions (i.e. people that become 18 years old since yo last queried the table). E.g. if you want to update the display the next day, you issue a query for the people that were born in day D0 + 1 only - no need to request the whole table again.
You could even prefetch the people who reach 18 years of age the next day, keep the entries in memory, and add them to the display at the exact moment they reach that age.
BTW, even with 2KB of data for each person, you get a 18TB database (assuming 50% overhead). Any slightly beefed up server should be able to handle this kind of DB size. On the other hand, the thought of a 12 TB HTML table terrifies me...
Oh, and beware of timezone and DST issues - time is such a relative thing these days...
I don't see what the problem is. You don't have to worry about new records being added at all, since none of them will be included in your query unless that query takes 18 or more years to run. If you have an index on age, and presumably any DB technology sufficient to handle that much data and 1200 inserts a second updates indexes on insert, it should just work.
In the real world, using existing technologies or something like it, I would create a daily snapshot once a day and do queries on that read-only snapshot that would not include records for that day. That table would certainly be good enough for this query, and most others.
Are you forced to aggregate all of the entries into one table?
It would be simpler if you were to create a table for each age group (only around 120 tables would be needed) and just insert the inputs into those, as it's computationally simpler to look over 120 tables when you insert an entry than to look over 6,000,000,000 when looking for entries.

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Resources