data modelling in influxdb - data-modeling

I would like to take your suggestion on data modelling in influxdb. I have vehicles which are producing a lot of values like soc, odometer, lat, lon etc.
I have 2 options that I see as modeling this data in influx
1) Create a single measurement as 'x' with tag as vehicle_id and various fields for this measurement like soc, odometer, lat, lon
2) Create multiple measurement as soc, odometer, lat, lon ... each as a tab vehicle_id
Which one would you prefer and why.
I am using first one which seems to be a more natural approach.

According to their official documentation, measurement is conceptually a table. fields are more like attributes under a single measurement. Because influxdb is a columnar database, records/points are organised in a structure of series, where each series belongs to a set of a db, a measurement and field(s), with one or multiple field(s) as identifier.
To answer your question, if u would like to join multiple features for query, you should have put them as field(s) under one single measurement; if they have no inter-connection whatsoever, then put them as separate measurement would be more human-readable.
Moreover, theoretically, you can actually put everything into one single measurement with hundred field(s) and query with only the relevant fields. But this would create high cardinality which would put pressure on the read/write process. You would need to take care of high-cardinality through batching and cache on influxdb configuration. Therefore it is more read & write efficient to use as few fields as possible.

Go with option 1.
It'll be easier to query use this data later.
eg, a single "vehicle" metric
+------------+-----+----------+-----+-----+
| vehicle_id | soc | odometer | lat | lon |
+------------+-----+----------+-----+-----+
| ac31b2ea31 | ? | 123144 | 42 | 22 |
+------------+-----+----------+-----+-----+
Make the vehicle_id a tag, and the rest fields.
Keep in mind the more tags you have and the more unique values (aka "cardinality") there are in those tag values, the larger and slower your index.

In our case some of the devices we manage hold multiple data (e.g. GPS provides lat, lon, etc...) hence it felt natural to have one measurement for the device (GPS) and enclose the different values in fields. We add a tag for the vehicle id.
This approach also implies that there is no naming convention to be implemented in the fields name.
Please share what you finally used.

The cardinality problem is the same in both models, what counts towards the cardinality is the number of unique combinations of (measurement,tags,field) so splitting fields in measurements does not reduce cardinality.
What you should consider is the type of analysis you would like to perform on your data. Historically and I still think it is the case, InfluxQL could not do computations across measurements so splitting your various fields in separate measurements would prohibit you from doing some computations.
The tabular model is not the ideal approach when it comes to time series, it is better to consider each time series as a separate objet and leverage functions which can work on a set of series, this way no matter how you decided to organize your data in the storage layer, the analytics layer can make use of it.

Related

Sensor network calculating features - design database

How would you model a system involving sensors
which have some spatial data like lat,lon,altitude,...
that produce raw measurements let's say distance, temperature,...
for the purpose of calculating "features" of the location they are at like the volume, temperature
Here's my idea for tables
Sensor
- id
- model
- type
SensorMetadata
- sensorId
- timestamp // for time series to account for changing
- lat
- long
- alt
- metadata: json // some dynamic changeable data based on the domain lets say relative distance to something...
- unit
Measurements
- id
- timestamp
- sensorId
- value
Feature
- id
- type // complex features like volume
- value
- timestamp
- // relation to location (maybe sensorMetadata)
1 would you model it to be specific for the domain if let's say tanks of fluids are in question or would you model it generic using "features" language?
2 how and when would you calculate "features" based on measurements? Clearly here I'm missing
what sensors would I take into account to calculate the feature of the location (let's say they use multiple sensors in some cases)
regarding Sensor, don't be afraid to use something like SensorInstance - ambiguity is something to be avoided if you can, especially if you can do so with out making really long names. In the long run, a clear (unambiguious) database design is usually better than a concise one.
SensorMetadata - SensorState is another option.
Measurements - try to avoid plural names, singular (like Sensor) is usually better. This page has some good guidance: https://www.sqlshack.com/learn-sql-naming-conventions/
The big issue I can see though is Measurements.value - how do you interpret the value? What data type is it - i.e. what's the data type(s) your sensors are producing? Measurements relate to sensors - do sensors measurements always result in one value (as implied by your design)?
I'm not clear what the Feature table's purpose is.
For design challenges like this, have a look for existing reference architectures and designs - this outfit looks like they might have some useful whitepapers: https://crate.io/use-cases/iot-sensor-data/ This might also be useful reading for you but won't provide a direct answer itself: https://hackernoon.com/ingesting-iot-and-sensor-data-at-scale-ee548e0f8b78
Update 28-Jun-21
So Value column in Measurement would be interpreted based on SensorId it came from and since Sensor has its SensorMetadata which stores its Unit I would be able to interpret the Value if that's ok?
Should be fine.
For Feature - a couple of approaches:
Calculate it later - i.e. just gather raw data, and do bulk calculations later. For example you could ETL the data from the data capture system (which due to the table design is basically OLTP data model) to a reporting system (OLAP data model).
Calculate it now - i.e. as it's captured by the application feeding the database, or a trigger/logic in the database. For that you'll want some sort of reference between the feature table and the measurement and/or sensor table. That way you are able to test the logic that calculated the Feature.Value, because if you can't do that - can you trust your data? It would also allow you to calculate new values if you decided you wanted to change the calculation formula.
Personally I think #1 is more flexible, once you have the raw data you can retrospectively add any new feature you like, but that might not fit with how you need the system to work.
Option 2 is tricky. If Feature.Value is calculated off more than one Measurement.Value (and potentially more than one sensor) then it's also likely the number of measurements and sensors will vary across different features - so you'd need a many-to-many relationship between them (Features and Measurements).
Having a many-to-many relationship is fine: a joining table is common practice (in general terms), and fits into the OLTP model which is good for transactions i.e. data gathering. Flipping this into a more reporting friendly OLAP model will be more complicated though.
Lastly, regarding your question #2 - I think I have partially answered it indirectly. Basically it depends: how soon is the Feature data needed? If it's needed at runtime, to provide immediate information to people or systems then you'll want to calculate it as soon as possible. Maybe pout that as a second question, and provide information on the feature use case, solution context (who is using it, under what circumstances), etc.

Cassandra Geolocation, to index or not to index?

My goal is to be able to write a query such that I can find all of the rows in a table between a certain radius of a lat and long.
So a query like this:
SELECT * FROM some_table WHERE lat > someVariableMinLat AND
lat < someVariableMaxLat AND
lng > someVariableMinLng AND lng < someVariableMaxLng;
along those lines.
Now, my thought is of course these should be an index, and I just wanted to confirm that, and related reading or info would be great, thank you!
Your query requires ALLOW FILTERING to run, assuming you've set lat and lng as secondary indices.
Since you're interested in related readings and information, I would gladly shere my little knowledge with you. let me start with Allow Filtering. You've created a rather complex query that (1) uses < and > instead of = (2) on more than one non-primary-key column.
What Allow Filtering does is that it will query a database first, and then it applies some part of your conditions on it. Therefore, it's far from efficient if performance is your concern.
Speaking of performance, it's important to note that a column that tends to have more distinct values is not a good candidate to be set as a secondary index. You may find out more about this topic here.
How would I do that?
I'm not sure about your requirements. But you could consider using Geohash. Geohash is the encoded form of both longitude and latitude. It can get pretty precise as well. By using geohash strings, you can play a tradeoff game between the length of your geohash in characters and their precision (the lengthier the string, the more pricise they become). Perhaps you may set the geohash as your index column which implies that the lengthier the geohash, the more distinct values the column would have. You may even consider setting it as the primary key to take the performace to a higher level.
Or maybe, you could set two primary keys. One, to keep short geohash, and another one to keep the longer hash for the same location if you want different level of precision :)

Making a table with fixed columns versus key-valued pairs of metadata?

I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.
The request was to have a dimension for each type of hours or pay like this:
date | employee_id | type | hours | amount
2016-04-22 abc123 regular 80 3500
2016-04-22 abc123 overtime 6 200
2016-04-22 abc123 adjustment 1 13
2016-04-22 abc123 paid time off 24 100
2016-04-22 abc123 commission 600
2016-04-22 abc123 gross total 4413
There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.
The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.
I have only seen the raw data from one system and it like this:
date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours | OT_amount | classification | amount | hours
It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.
Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:
date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount
I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)
Any advice? Is the table which was suggested to me from my superior a viable solution?
TAM makes lots of good points, and I have only two additional suggestions.
First, I would generate some fake data in the table as described above, and see if it can generate the required reports. Show your manager each of the reports based on the fake data, to check that they're OK. (It appears that the reports are the ultimate objective, so work back from there.)
Second, I would suggest that you get sample data from as many of the input systems as you can. This is to double check that what you're being asked to do is possible for all systems. It's not so you can design the ETL, or gather new requirements, just testing it all out on paper (do the ETL in your head). Use this to update the fake data, and generate fresh fake reports, and check the reports again.
Let me recapitulate what I understand to be the basic task.
You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.
This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.
In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.
I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.
There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not nullĀ“, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.
If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.
All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.
There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.

Efficiently search large DB for similar records

Suppose the following: as input, one would get a record consisting of N numbers and booleans. This vector has to be compared to a database of vectors, which include M additional "result" elements. That means, the database holds P N+M sized vectors.
Each vector in the database holds as last element a boolean. The aim of the exercise is to find as fast a possible the record(s) which are closest match to the input vector AND have a resulting vector ending with a TRUE boolean.
To make the above a bit more comprehensible, give the following exampe:
A database with personal health information, consisting of records holding:
age
gender
weight
lenght
hearth issues (boolean)
lung issues (boolean)
residence
alternative plan Chosen (if done)
accepted offer
The program would then search get an input like
36 Male 185pound 68in FALSE FALSE NYC
It would then find out which plan would be the best to offer the client, based on what's in the database.
I know of a few methods which would help to do this, eg the levenshtein distance method. However, most methods would involve searching the entire database for the best matches.
Are there any algorithms, methods which would cut back on the processing power/time required? I can't imagine that eg. insurance agencies don't use more efficient methods to search their databases...
Any insights into this area would be greatly appreciated!
Assumption: this is a relational database. If instead it were NOSQL then please provide more info on which db.
Do you have option to create bitmap indexes? They can cut down the # of records returned . That is useful for almost all of the columsn since the cardinalities are low.
After that the only one left is the residence, and you should use a Geo distance for that.
If you are unable to create bitmap indexes then what are your filtering options? If none then you have to do a full table scan.
For each of the components e.g. age, gender, etc. you need to
(a) determine a distance metric
(b) determine how to compute both the metric and the distance between different records.
I'm not sure an Levenshtein would work here - you need to take each field separately to find their contribution to the whole distance measure.

Strategy in storing ad-hoc numbers/constants?

I have a need to store a number of ad-hoc figures and constants for calculation.
These numbers change periodically but they are different type of values. One might be a balance, a money amount, another might be an interest rate, and yet another might be a ratio of some kind.
These numbers are then used in a calculation that involve other more structured figures.
I'm not certain what the best way to store these in a relational DB is - that's the choice of storage for the app.
One way, I've done before, is to create a very generic table that stores the values as text. I might store the data type along with it but the consumer knows what type it is so, in situations I didn't even need to store the data type. This kind of works fine but I am not very fond of the solution.
Should I break down each of the numbers into specific categories and create tables that way? For example, create Rates table, and Balances table, etc.?
Yes, you should definitely structure your database accordingly. Having a generic table holding text values is not a great solution, and it also adds overhead when using those values in programs that may pull that data for some calculations.
Keeping each of the tables and values separated allows you to do things like adding dates and statuses to your values (perhaps some are active while others aren't?) and also allows you to keep an accurate history (what if i want to see a particular rate from last year?). It also makes things easier for those who come behind you to sift through your data.
I suggest reading this article on database normalization.

Resources