I am working on a project in which we collect a lot of data from many sensors. These sensors, in many cases, return low precision floats represented with 1 and 2 byte integers. These integers are mapped back to floats via some simple relation; for instance,
x_{float} = (x_{int} + 5) / 3
Each sensor will return around 70+ variables of this kind.
Currently, we expect to store a minimum of 10+ millions of entries per day, possibly even 100+ million entries per day. However, we only require like 2 or 3 of these variables on a daily basis and the others one will rarely be used (we require them for modeling purposes).
So, in order to save some space I was considering storing these small precision integers directly into the DB instead of the float value (with the exception of the 2-3 variables we read regularly, which will be stored as floats to avoid the constant overhead of mapping them back from ints). In theory, this should reduce the size of the database by almost a half.
My questions is if this is a good idea?. Will this backfire when we have to map back all the data to train models?.
Thanks in advance.
P.S. We are using Cassandra, I don't know it this may be relevant for the question.
Related
I'm planning on a web app with a database in the background having big numbers stored. I'm thinking of if it could be possible, that integers need more space than storing the same number as a string.
So normally an integer is stored by base 2 numbers:
That means for 0 and 1 I need 1 bit while I would need 8 bit to write
them as a char.
Writing 2 I would need 2 bit but still 8 bit as char.
Is there something like a break even point therefor. If so, at what number is it?
Thanks so far.
Optimizing bitwise operations is not anything that people using databases do (with perhaps some minor exceptions).
You are using the database for its ACID properties, and perhaps for its ability to query and manage data. You are using it because it scales easily, manages multiple processors, manages multiple disks, and manages memory hierarchies. You are not using it because it stores the data in the smallest amount of space.
You should worry about other aspects of your application and the data model you want to use.
I am designing a Microsoft Access database to store results from lab equipment. They are in the form of hundreds of lists of frequency vs. response curves which I have previously stored rather easily, but inefficiently in Excel.
The difficulty comes from the fact that the frequency can vary from 1 - 50E9 Hz, the step size between data points can vary from 1 - 1E9, Hz, and the number of points can vary from ~ 100 - 40,000. This has brought up a challenge when it comes to table design because everything I try seems to be very inefficient.
I have considered using links to external text files to store the data points which solves the table design, but seems to violate good database design. I've considered using tables of arrays (i.e. Start Freq, Stop Freq, Freq Step Size, and Array of Responses), but the array sizes could vary greatly which seems just as inefficient.
Is there a recommended practice for storing this type of data? It seems like a common task when storing instrument data, but I can't seem to find anything in web searches. Any assistance will be greatly appreciated.
Looks like a classic 1:N relationship to me. "1" is the measurement session and "N" is all the measurements (i.e. data points) taken in that session. This is modeled by two tables and one foreign key between them, similar to this:
Tweak the fields to suit your needs, but this general design should be more than able to handle large amounts of data and varying numbers of measurements per session.
That being said, MS Access has historically had significant limitations on the size of the data that can be stored in a single database. If you hit these limits, consider using a "real" DBMS.
I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.
I am creating an little hobby database driven browser based game and I stumbled across this problem: I store money owned by users as an 32bit integer field (to be precise: two fields. One stores money in players hand, the other - money stored in bank). We all know, that maximum value, which can be stored in 32 bits is 2^32-1.
I am absolutelly sure, that 95% of players will not be able to reach the upper limit - but on the other hand (and after doing some calculations today) good players will be able to accumulate that much.
Having that in mind I came with the following ideas:
store money in 64bits, which doubles space of each record.
store money as string and convert to/from long long in the runtime.
change game mechanics so players will not be able to gain that amount of wealth.
I know that existence of reachable upper limit is rather limiting for some players, so for me the third option is worst from the proposed ones.
Are there any other ways of dealing with this kind of problems? Which one would You go for?
Taking an example from the real world, why not have different types of coins e.g a column for a million units of the currency.
Changing to a larger datatype is likely the easiest solution and considerations of disk space/memory aren't likely to be significant unless your game is huge in scale. Have 5,000 users playing your game? Changing from 32-bits to 64-bits will consume roughly 20k extra. That's not enough to lose any sleep over.
The best answer would likely come from someone familiar with how banks handle these types of situations, though their requirements may be far more complicated than what you need.
Space on memory shouldn't be a problem depending on the amount of players you'll have simultaneously, but storing as string will definitely use more disk space.
But seriously, 4 294 967 296 rupees/simoleons/furlongs? Who are they? Sim Gates?
Why not store money the way it should be stored, as a Money data type? This is assume of course you are using SQL Server. The money data type won't have this limitation and won't be affected by rounding issues.
I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.