Is QuestDB suitable for querying interval data, such as energy meters - database

I want to design a database to store IoT data from several utility meters (electricity, gas, water) using mqtt protocol.
Is QuestDB suitable for this type of data where it would store meter readings (it is the difference between readings that is mainly of interest as opposed to the readings themselves)? More specifically, I am asking if the database would allow me to quickly and easily query the following? If so, some example queries would be helpful.
calculate energy consumption of a specific/all meters in a given date period (essentially taking the difference between readings between two dates for example)
calculate the rate of energy consumption with time over a specific period for trending purposes
Also, could it cope with situations where a faulty meter is replaced and therefore the reading is reset to 0 for example, but consumption queries should not be affected?
My initial ideas:
Create a table with Timestamp, MeterID, UtilityType, MeterReading, ConsumptionSinceLastReading fields
When entering a new meter reading record, it would calculate consumption since last reading and store it in the relevant table field. Although this doesn't seem like the right approach and perhaps a time series db like QuestDB has a built-in solution for this kind of problem?

I think you have approached the problem the right way.
By storing both the actual reading and the difference to the previous reading you have everything you need.
You could calculate the energy consumption and the rate with SQL statements.
Consumption for a period:
select sum(ConsumptionSinceLastReading) from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-19T00:00:00.000000Z';
Rate (consumption/sec) within a period:
select (max(MeterReading)-min(MeterReading)) / (cast(max(ts)-min(ts) as long)/1000000.0) from (
select MeterReading, ts from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-20T00:00:00.000000Z' limit 1
union
select MeterReading, ts from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-20T00:00:00.000000Z' limit -1
);
There is no specific built-in support for your use case in QuestDB but it would do a good job. The above selects are optimised since they use the LIMIT keyword.

Related

Get all latest sensor values for each device in PostgreSQL & TimescaleDB

Description
So, I'm working on a project that stores sensor measurements from multiple devices in PostgreSQL+TimescaleDB database.
The structure of the table (hypertable):
column_name
type
comment
identifier
text
device identifier
key
text
name of the metric
value_num
double precision
numeric measurement value
value_text
text
text measurement value
timestamp
timestamp with time zone
timestamp of the measurement
Table has indexes on (identifier, timestamp) and (identifier, key, timestamp).
Measurement value
The measurement can have measurement value in either value_num or value_text column depending on the measurement type.
Metric types
Each device can have different metrics. For example one device (FOO) might have:
temperature_air (with value_num as that metric has numeric measurement)
current_program_identifier (with value_text as that metric has text measurement)
and other device (BAR) might have:
temperature_water (with value_num as that metric has numeric measurement)
water_level (with value_num as that metric has numeric measurement)
current_program_identifier (with value_text as that metric has text measurement)
Now I want to have a query, or, better yet, materialized view, that would show me the most recent measurements of all metrics grouped by device. Meaning, that I would expect to have something like:
device
temperature_air
temperature_water
current_program_identifier
FOO
24.0
NULL
H41S
BAR
NULL
32.05
W89G
Even better if it would be possible to use query to derive the column to which the measurement should go, so the result could be reduced to:
device
temperature
current_program_identifier
FOO
24.0
H41S
BAR
32.05
W89G
Requirements
Query needs to be fast, because:
Basically each device generates ~500k rows per day, so the dataset is quite big and grows fast;
Query will be executed asynchronously from multiple client computers every few seconds;
Other thoughts
Database remodeling
I've thought about re-modeling the database to something more normalized, but that appears to be a no-go because the collected metrics are constantly changing and we have no control over them, so we need table structure that would allow us to store any metric. If you have any ideas on a better table structure - please share it with me.
Having a separate table
I've thought that I could simply store latest values of metrics that are interesting for us to the separate table at the ingestion time, but the data isn't guaranteed to come in correct time order, so that would add a big overhead of reading current data, determining if the data received is newer than the one that is already in the DB and only then performing the insert to that separate table. So that was a no-go. Also, the metrics comes in separate messages and the message contains timestamp only for that specific metric, so each metric column would have to be accompanied by the timestamp column.
I've thought that I could simply store latest values of metrics that are interesting for us to the separate table at the ingestion time, but the data isn't guaranteed to come in correct time order, so that would add a big overhead of reading current data, determining if the data received is newer than the one that is already in the DB and only then performing the insert to that separate table. So that was a no-go. Also, the metrics comes in separate messages and the message contains timestamp only for that specific metric, so each metric column would have to be accompanied by the timestamp column.
Maybe that is not a real issue because if you have a single record in the table, that would always be in the cache and not reading from the disk every insert.
Also, if you need a very flexible schema, I'd recommend you use Promscale, which would allow you to have a very flexible schema storing one metric per table. You can also use PromQL to fetch and join metrics in the same query. A significant advantage I see here is that you can have different retention policies for each metric, and that's a great advantage because probably some of the metrics will be more important than others.
Through the labels, you can also gain the flexibility in attaching more data to a metric in case you need to enhance some metrics with more information.
The remote write allows you to send the data, and it will just create the hypertables on the fly for you.
So, I've solved my problem by creating a _log table and adding a trigger to my main table, which, on every insert, updates _log table with the latest data.
Now I have sensors table, which contains all sensor readings from all devices, and sensors_log table, which contains only latest sensor readings for each device.
Basically, the approach was described in https://www.timescale.com/blog/select-the-most-recent-record-of-many-items-with-postgresql/ as an Option 5.
It seems to be working quite well for the moment, but, in the future I will dig into other methods for solving that and might update this answer if I find a more efficient way of solving this issue.

Speed up ETL transformation - Pentaho Kettle

For a project, I have to deal with many sensors Time Series data.
I have an industrial machine that produces some artifacts. For each work (max 20 mins in time) sensors record oil pressure and temperature, and some other vibrational data (very high frequencies). All these Time Series are recorded in a .csv file, one for each sensor and for each work. Each file is named:
yyyy_mm_dd_hh_mm_ss_sensorname.csv
and contains just a sequence of real numbers.
I have to store somehow this kind of data. I am benchmarking many solution, relational and not, like MySQL, Cassandra, Mongo, etc.
In particular, for Cassandra and Mongo, I am using Pentaho Data Integration as ETL tool.
I have designed a common scheme for both DBs (unique column family/collection):
---------------------------------------
id | value | timestamp | sensor | order
---------------------------------------
The problem is that I am forced to extract timestamp and sensor information from filenames, and I have to apply many transformation to have the desired formats.
This slows my whole job down: uploading a single work (with just a single high-frequency metric, for a total of 3M rows, more or less) takes 3 mins for MongoDB, 8 mins for Cassandra.
I am running both DBs on a single node (for now), with 16 GB RAM and an 15 Core CPU.
I am sure I am doing the transformation wrong, so the question is: how can I speed things up??
Here is my KTR file: https://imgur.com/a/UZu4kYv (not enough rep to post images)
You cannot unfortunately use the filename which is on the Additional output field tab because this field is populated in parallel and there are chance it is not known when you use it in computations.
However, in your case, you can put the filename in a field, for example with a data grid, and use it for computations of timestamp and sensor. In parallel, you make the needed transforms on id, value and order. When finished you put them together again. I added a Unique Row on the common flow, just in case the input is buggy and have more than one timestamp, sensor.

Optimal Riak storage strategy

I'm planning to use Riak for storing some sensor data, but sensors are connected to different users. My plan is to make a structure like this:
Bucket = user id
key = time, new key each minute (or two minutes maybe)
When I say a new key each minute, the readings are not always continuous and are not real time, but they are being uploaded later. They are recorded at certain periods of the day. The frequency of metering is quite high, 250 samples a second. If I make a new key for each measurement, I will get an explosion of keys very fast and I don't think it will do good for performance. Besides that, I do not really need to know the precise number at each given moment, I will use them more sequentially in a period (values from minute N to minute M).
So I'm thinking of "grouping" the results for each minute, and storing them like that as some JSON.
Does this strategy look feasible?
Also, I'm thinking about using LevelDB as the storage engine, just to be on the safe side as far as RAM usage goes.
Lower keys count seems better for me then key for each event. How would you use this data later?
If data is intended for further analyze, leveldb and secondary indexes allow you to pick a data for certain period (if your keys somehow ordered, datetime for instance) in a mapreduce job (with additional efforts it could be done in a background).
Also leveldb do not store all keys in memory, it is good for continuously growing dataset, if you plan to store all the data forever.
If your application depends on predictable latency and need fixed amount of data per query It better to group data like application wants (for sample all keys for a 10 min in one object).
One more concern is total object size, as riak docs says it better not exceed 10mb size for single object.

Best way to access averaged static data in a Database (Hibernate, Postgres)

Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Architecture and pattern for large scale, time series based, aggregation operation

I will try to describe my challenge and operation:
I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year.
I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.
The main consideration is speed. I need to output a few months of this kind of index as fast as i can.
For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.
Here is something i had In mind, using NoSql or column oriented approach:
Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.
I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)
Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.
In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.
Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.
You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.
With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).
Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.
You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.
On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.
(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)
If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.
Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.
More info about logparser:
Wikipedia
Coding Horror
What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate.
http://www.freeinformix.com/time-series-presentation-technical.html

Resources