DB design for sensor data (lots and LOTS of data) - database

I am writing an application for viewing and management of sensor data. I can have unlimited number of sensors, and each sensors makes one reading every minutes and records the values as (time, value, sensor_id, location_id, [a bunch of other doubles]).
As an example, I might have 1000 sensors and collect data every minute for each one of them, which ends up generating 525,600,000 rows after a year. Multiple users (say up to 20) can plot the data of any time period, zoom in and out in any range, and add annotations to the data of a sensor at a time. Users can also modify certain data points and I need to keep track of the raw data and modified one.
I'm not sure how the database for application like this should look like! Should it be just one table SensorData, with indices for time and sensor_id and location_id? Should I partition this single table based on sensor_id? should I save the data in files for each sensor each day (say .csv files) and load them into a temp table upon request? How should I manage annotations?
I have not decided on a DBMS yet (maybe MySQL or PostgreSQL). But my intention is to get an insight about data management in applications like this in general.

I am assuming the the users cannot change the fields you show (time, value, sensor_id, location_id) but the other fields implied.
In that case, I would suggest Version Normal Form. The fields you name are static, that is, once entered, they never change. However, the other fields are changeable by many users.
You fail to state if users see all user's changes or only their own. I will assume all changes are seen by all users. You should be able to make the appropriate changes if that assumption is wrong.
First, let's explain Version Normal Form. As you will see, it is just a special case of Second Normal Form.
Take a tuple of the fields you have named, rearranged to group the key values together:
R1( sensor_id(k), time(k), location_id, value )
As you can see, the location_id (assuming the sensors are movable) and value are dependent on the sensor that generated the value and the time the measurement was made. This tuple is in 2nf.
Now you want to add updatable fields:
R2( sensor_id(k), time(k), location_id, value, user_id, date_updated, ... )
But the updateable fields (contained in the ellipses) are dependent not only on the original key fields but also on user_id and date_updated. The tuple is no longer in 2nf.
So we add the new fields not to the original tuple, but create a normalized tuple:
R1( sensor_id(k), time(k), location_id, value )
Rv( sensor_id(k), time(k), user_id(k), date_updated(k), ... )
This makes it possible to have a series of any number of versions for each original reading.
To query the latest update for a particular reading:
select R1.sensor_id, R1.time, R1.location_id, R1.value, R2.user_id, R2.date_updated, R2.[...]
from R1
left join Rv as R2
on R2.sensor_id = R1.sensor_id
and R2.time = R1.time
and R2.date_updated =(
select max( date_update )
from Rv
where sensor_id = R2.sensor_id
and time = R2.time )
where R1.sensor_id = :ThisSensor
and R1.time = :ThisTime;
To query the latest update for a particular reading made by a particular user, just add the user_id value to the filtering criteria of the main query and subquery. It should be easy to see how to get all the updates for a particular reading or just those made by a specific user.
This design is very flexible in how you can access the data and, because the key fields are also indexed, it is very fast even on Very Large Tables.

Looking for an answer I came across this thread. While it is not entirely the same as my case, it answers many of my questions; such as is using a relational database a reasonable way of doing this (to which the answer is "Yes"), and what to do about partitioning, maintenance, archiving, etc.
https://dba.stackexchange.com/questions/13882/database-redesign-opportunity-what-table-design-to-use-for-this-sensor-data-col

Related

A problem with relationships between tables, and data overload caused by it

I don't know how to explain my exact problem, but in any case I will try:
In the item_period table, if I wanted to modify any data and this data was used in the invoices, for example, it will create a new period for the item data, and when creating this period I will have to copy all units and all pricing in the new period, in my view this will cause a large inflation of the base Data, especially since the items may be 100000 or more, in this case the volume of data will be very large, my method may be incorrect, but I use it in order to make each period take its correct data..
And for work, I forgot to set the billing table, because it depends on entering the item on the id_item_rate and not the name of the item or unit, so that it will take the id_item_rate to know the price and to know the item and its unit
My question here is how to avoid data inflation if you modify any data in the item_period table

Database design for application with wiki-like functions

I'm making an api for movie/tv/actors etc. with web api 2 and sql server. The database now has >30 tables, most of them storing data users will be able to edit.
How should I store old version of entries?
Say someone edits description, runtime and tagline for a entry(movie) in the movies table.
I'll have a table(movies_old), where I store the editable files in 'movies' pluss who/when it was edited.
All in the same database. The '???_old' tables has no relationships.
I'm very new to database design. Is there something obviously wrong with this?
To my mind, there are two issues here: what table you store the data in, and what goes in the "historical value" field.
On the first question, there are two obvious options: Store old and new records in the same table, with some sort of indication of which is "current" and which is "history", or have a separate table for history.
The main advantage of one table is that you have a simpler schema. This is especially true if the table contains many fields. If there are two tables, then all the field definitions are duplicated. When you move data from the current table to the history table, you have to copy every field, and if the list of fields changes, or their formats change, you have to remember to update the copy. Any queries that show the history have to read two tables. Etc. But with one table, all that goes away. Converting a record from current to history just means changing the setting of the "is_current" flag or however you indicate it.
The main advantages of two tables are, (a) Access is probably somewhat faster, as you don't have so many irrelevant records to skip over. (b) When reading the current table you don't have to worry about excluding the history records.
Oh, an annoying thing about SQL: In principle you could put a date on each record, and then the record with the latest date is the current one. In practice this is a pain: you usually have to have an inner query to find the latest date, and then feed this back in to an outer query that re-reads the record with that date. (Some SQL engines have ways around this. Postgres, for example.) So in practice, you need an "is_current" flag, probably 1 for current and 0 for history or some such.
The other issue is what to put in the contents. If you're dealing with short fields, customer number and amount billed and so forth, then the simple and easy thing to do is just store the complete old contents in one record and the complete new contents in the new record. But if you're dealing with a long text block, like a plot synopsis or a review, there could be many small editorial changes. If every time someone fixes a grammar or spelling error, we have a whole new record with the entire 1000 characters, of which 5 characters are different, this could really clutter up the database. If that's the case you might want to investigate ways to store changes more efficiently. May or may not be an issue to you.

Range Key Querying on composed keys

Currently I have a collection which contains the following fields:
userId
otherUserId
date
status
For my Dynamo collection I used userId as the hashKey and for the rangeKey I wanted to use date:otherUserId. By doing it like this I could retrieve all userId entries sorted on a date which is good.
However, for my usecase I shouldn't have any duplicates, meaning I shouldn't have the same userId-otherUserId value in my collection. This means I should do a query first to check if that 'couple' exist, remove it if needed and then do the insert, right?
EDIT:
Thanks for your help already :-)
The goal of my usecase would be to store when userA visits the profile of userB.
Now, The kind of queries I would like to do are the following:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
Retrieve a particular pair visit of UserA and UserB
I think you have a lot of choices, but here is one that might work based on the assumption that your application is time-aware i.e. you want to query for interactions in the last N minutes, hours, days etc.
hash_key = userA
range_key = [iso1860_timestamp][1]+userB+uuid
First, the uuid trick is there to avoid overwriting a record of an interaction between userA and userB happening exactly at the same time (can occur depending on the granularity/precision of your clock). So insert-wise we are safe : no duplicates, no overwrites.
Query-wise, here is how things get done:
Retrieve all the UserB's that visited the profile of UserA, in an unique (= No double UserB's) and sorted by time way.
query(hash_key=userA, range_key_condition=BEGIN(common_prefix))
where common_prefix = 2013-01-01 for all interactions in Jan 2013
This will retrieve all records in a time range, sorted (assuming they were inserted in the proper order). Then in the application code you filter them to retain only those concerning userB. Unfortunately, DynamoDB API doesn't support a list of range key conditions (otherwise you could just save some time by passing an additional CONTAINS userB condition).
Retrieve a particular pair visit of UserA and UserB
query(hash_key=userA, range_key_condition=BEGINS(common_prefix))
where common_prefix could be much more precise if we can assume you know the timestamp of the interaction.
Of course, this design should be evaluated wrt to the properties of the data stream you will handle. If you can (most often) specify a meaningful time range for your queries, it will be fast and bounded by the number of interactions you have recorded in the time range for userA.
If your application is not so time-oriented - and we can assume a user have most often only a few interactions - you might switch to the following schema:
hash_key = userA
range_key = userB+[iso1860_timestamp][1]+uuid
This way you can query by user:
query(hash_key=userA, range_key_condition=BEGIN(userB))
This alternative will be fast and bounded by the nber of userA - userB interactions over all time ranges, which could be meaningful depending on your application.
So basically you should check example data and estimate which orientation is meaningful for your application. Both orientations (time or user) might also be sped up by manually creating and maintaining indexes in other tables - at the cost of a more complex application code.
(historical version: trick to avoid overwriting records with time-based keys)
A common trick in your case is to postfix the range key with a generated unique id (uuid). This way you can still do query calls with BETWEEN condition to retrieve records that were inserted in a given time period, and you don't need to worry about key collision at insertion time.

Database Normalization and User Defined Data Storage

am looking to let the users of my web application define their own attributes for products and then enter data for those products. I have found out that this technique is called n(th) normal form.
The following is DB structure I am currently considering deploying and was wondering what the positives and negatives would be in regards to integrity and scalability (and any other -ity's you can think of)
EDIT
(Sorry, This is more what I mean)
I have been staring at this for the last 15mins and I know (where the red arrow is) induces duplication and hence you would have to have integrity checks. But I just don't understand how else what I want could be done.
The products would number no more then 10. The variables would number no more then 200 (max 20 per product). The number of product instances would not exceed 100,000, therefore the maximum size of pVariable_data would not exceed 2 million
This model is called a database in a database and is not nice. Though sometimes it is impossible first check whether you really need it and your database is really the right database for the job.
With PostgreSQL you could use: http://www.postgresql.org/docs/8.4/static/hstore.html which is a standardized solution for this kind of issues.
Assuming that pVariable is more of a pVariable type, drop the reference to product_fk. It would mean that you need a new entry in that table for every Product record. Maybe try something like this:
Product(id, active, allow_new)
pVariable_type(id, name)
pVariable_data(id, product_fk, pvariable_fk, non_typed_value, bool, int, etc)
I would use the non_typed_value as your text value, and (unless you are keeping streams) write a record into that field along with the typed value. It will mean keeping the value of a record twice (and more of a pain on updates etc) but it will make querying easier, along with reporting (anything you just need to display the value for).
Note: it would also be idea to pull anything that is common to all products and put them in the product table. For example all products will most likely have a name, suggested price, etc.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

Resources