Speed up insert into questdb using influxDBLine protocol - database

I'm doing a project related to storing live trading data.
Currently, we are using questDB and insert to it using influxdb line protocol(TCP).
Here is the link https://questdb.io/docs/reference/api/ilp/overview.
However, the rate of packets dropped is about 5% and, I want it to be less than 1%.
A packet is likely to be dropped when the number of packets received per minute is more than 2 million.
Are there any way to speed up the insert speed or any other time series database I can try?

Related

Is QuestDB suitable for querying interval data, such as energy meters

I want to design a database to store IoT data from several utility meters (electricity, gas, water) using mqtt protocol.
Is QuestDB suitable for this type of data where it would store meter readings (it is the difference between readings that is mainly of interest as opposed to the readings themselves)? More specifically, I am asking if the database would allow me to quickly and easily query the following? If so, some example queries would be helpful.
calculate energy consumption of a specific/all meters in a given date period (essentially taking the difference between readings between two dates for example)
calculate the rate of energy consumption with time over a specific period for trending purposes
Also, could it cope with situations where a faulty meter is replaced and therefore the reading is reset to 0 for example, but consumption queries should not be affected?
My initial ideas:
Create a table with Timestamp, MeterID, UtilityType, MeterReading, ConsumptionSinceLastReading fields
When entering a new meter reading record, it would calculate consumption since last reading and store it in the relevant table field. Although this doesn't seem like the right approach and perhaps a time series db like QuestDB has a built-in solution for this kind of problem?
I think you have approached the problem the right way.
By storing both the actual reading and the difference to the previous reading you have everything you need.
You could calculate the energy consumption and the rate with SQL statements.
Consumption for a period:
select sum(ConsumptionSinceLastReading) from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-19T00:00:00.000000Z';
Rate (consumption/sec) within a period:
select (max(MeterReading)-min(MeterReading)) / (cast(max(ts)-min(ts) as long)/1000000.0) from (
select MeterReading, ts from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-20T00:00:00.000000Z' limit 1
union
select MeterReading, ts from Readings
where ts>'2022-01-15T00:00:00.000000Z' and ts<'2022-01-20T00:00:00.000000Z' limit -1
);
There is no specific built-in support for your use case in QuestDB but it would do a good job. The above selects are optimised since they use the LIMIT keyword.

Speed up ETL transformation - Pentaho Kettle

For a project, I have to deal with many sensors Time Series data.
I have an industrial machine that produces some artifacts. For each work (max 20 mins in time) sensors record oil pressure and temperature, and some other vibrational data (very high frequencies). All these Time Series are recorded in a .csv file, one for each sensor and for each work. Each file is named:
yyyy_mm_dd_hh_mm_ss_sensorname.csv
and contains just a sequence of real numbers.
I have to store somehow this kind of data. I am benchmarking many solution, relational and not, like MySQL, Cassandra, Mongo, etc.
In particular, for Cassandra and Mongo, I am using Pentaho Data Integration as ETL tool.
I have designed a common scheme for both DBs (unique column family/collection):
---------------------------------------
id | value | timestamp | sensor | order
---------------------------------------
The problem is that I am forced to extract timestamp and sensor information from filenames, and I have to apply many transformation to have the desired formats.
This slows my whole job down: uploading a single work (with just a single high-frequency metric, for a total of 3M rows, more or less) takes 3 mins for MongoDB, 8 mins for Cassandra.
I am running both DBs on a single node (for now), with 16 GB RAM and an 15 Core CPU.
I am sure I am doing the transformation wrong, so the question is: how can I speed things up??
Here is my KTR file: https://imgur.com/a/UZu4kYv (not enough rep to post images)
You cannot unfortunately use the filename which is on the Additional output field tab because this field is populated in parallel and there are chance it is not known when you use it in computations.
However, in your case, you can put the filename in a field, for example with a data grid, and use it for computations of timestamp and sensor. In parallel, you make the needed transforms on id, value and order. When finished you put them together again. I added a Unique Row on the common flow, just in case the input is buggy and have more than one timestamp, sensor.

Frequently Updated Table in Cassandra

I am doing an IoT sensor based project. In this each sensor is sending data to the server in every minute. I am expecting a maximum of 100k sensors in the future.
I am logging the data sent by each sensor in history table. But I have a Live Information table in which latest status of each sensor is being updated.
So I want to update the row corresponding to each sensor in Live Table, every minute.
Is there any problem with this? I read that frequent update operation is bad in cassandra.
Is there a better way?
I am already using Redis in my project for storing session etc. Should I move this LIVE table to Redis?
This is what you're looking for: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html
How you tune memtable thresholds depends on your data and write load. Increase memtable throughput under either of these conditions:
The write load includes a high volume of updates on a smaller set of data.
A steady stream of continuous writes occurs. This action leads to more efficient compaction.
So increasing commitlog_total_space_in_mb will make Cassandra flush memtables to disk less often. This means most of your updates will happen in memory only and you will have fewer duplicates of data.
At C* there's consistency levels for reading and consistency levels to write. If are going to have only one node then this not apply, zero problems, but if are going to use more than one dc or racks you need to increase the consistency level to grant that what you are retrieving is the last version of the updated row, or at writing level use an high consistency level. In my case I'm using ANY to write and QUORUM to read. This allows me to have all nodes expect one down to write and 51% up of the nodes to read. This is a trade off in the CAP theorem. Pls take a look at:
http://docs.datastax.com/en/cassandra/latest/cassandra/dml/dmlConfigConsistency.html
https://wiki.apache.org/cassandra/ArchitectureOverview

Hadoop on periodically generated files

I would like to use Hadoop to process input files which are generated every n minute. How should I approach this problem? For example I have temperature measurements of cities in USA received every 10 minute and I want to compute average temperatures per day per week and month.
PS: So far I have considered Apache Flume to get the readings. Which will get data from multiple servers and write the data periodically to HDFS. From where I can read and process them.
But how can I avoid working on same files again and again?
You should consider a Big Data stream processing platform like Storm (which I'm very familiar with, there are others, though) which might be better suited for the kinds of aggregations and metrics you mention.
Either way, however, you're going to implement something which has the entire set of processed data in a form that makes it very easy to apply the delta of just-gathered data to give you your latest metrics. Another output of this merge is a new set of data to which you'll apply the next hour's data. And so on.

Low cost way to host a large table yet keep the performance scalable?

I have a growing table storing time series data, 500M entries now, and 200K new records every day. The total size is around 15GB for now.
My clients are querying the table via a PHP script mostly, and the size of the result set is around 10K records (not very large).
select * from T where timestamp > X and timestamp < Y and additionFilters
And I want this operation cheap.
Currently my table is hosting in Postgres 7, on a single 16G memory Box, and I would love to see some good suggestion for me to host this in low cost and also allow me to scale up for performance if needed.
The table serves:
1. Query: 90%
2. Insert: 9.9%
2. Update: 0.1% <-- very rare.
PostgreSQL 9.2 supports partitioning and partial indexes. If there are a few hot partitions, and you can put those partitions or their indexes on a solid state disk, you should be able to run rings around your current configuration.
There may or may not be a low cost, scalable option. It depends on what low cost and scalable mean to you.

Resources