I am streaming data into QuestDB using the ILP protocol with one of their official clients. I would expect to see the data available immediately after sending, but that's not the case.
If I go to the web interface, the table has been created, but if I run SELECT count() FROM sensors or SELECT * FROM sensors I am not getting any results.
The logs are not showing any errors either.
Thanks
update: If I check after a few minutes, the data is in there, but it always takes at least 5 minutes until I can see it
This used to be one of the most frequently asked questions by QuestDB's new users. Before QuestDB version 6.6.1 (released in November 2022), QuestDB would use a mechanism called "CommitLag" to trade off ingestion performance and readiness of fresh data in your queries.
This was designed specifically for data arriving out of order (relative to the designated timestamp), but in many cases it would have side effects also when data was ingested in order. CommitLag defaulted to 5 minutes, but it could be changed (down to the millisecond) for individual tables.
The reason why this was needed for out-of-order data (or o3 in QuestDB terms), is because QuestDB stores data physically sorted by increasing designated timestamp, so data arriving late means the engine needs to rewrite the partitions where those data belong.
Starting from version 6.6.1, QuestDB changed the way it persist data to the table files, introducing "Dynamic Commits". This new mechanism automatically decides how often to physically write to the table files. As long as data is arriving in order, writes are immediate and your data will be able in your SELECT statements straight away.
If data starts coming out of order (for example, due to network lag in the origin, or because the business logic allows for older data being sent), QuestDB will figure out how late the data is arriving and will adjust the write frequency in consequence. This heuristic is calculated once every second, so responding to changes in the ingestion pattern is very fast.
The new functionality is configuration-free and works out-of-the-box when you are using QuestDB 6.6.1 or above, so my advice would be to upgrade to the latest version.
Related
I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.
Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.
We are currently developing a tool to count wildlife passing through defined areas. The gadget that automatically counts the animals will be sending data (weather, # of animals passing etc.) in a 5 minute interval via HTTP to our API. There will be hundreds of these measurement stations and it should be scalable.
Now the question arised whether to use a filesystem or a RDBMS to save this data.
Pro DB
save exact time and date when the entry was created
directly related to area# via indexed key
Pro Filesystem
Collecting data is not as resource intensive since for every API call only 1 line will be appended to the file
Properties of the data:
only related to 1 DB entry (the area #)
the measurement stations are in remote areas we have to account for outages
What will be done with the data
Give a overview over timeperiods per area#
act as a early warning system if the # of animals is surprisingly low/high
Probably by using a cronjob and comparing to simliar data
We are thinking to chose a RDBMS to save the data but I am worried that after millions of entries the DB will slow down and eventually stop working. This question was asked here where 360M entries is not really considered "big data" so I'm not quite sure about my task either.
Should we chose these recommended techniques (MongoDB ...) or can this task be handled by PostgreSQL or MySQL?
I have created such a system for marine boyes. The devices sends data over GPRS / iridum using HTTP or raw tcp sockets (to minimize bandwidth).
The recieving server stores the data in a db-table, with data provided and timestamp.
The data is then parsed and records are created in another table.
The devices can also request UTC-time from the server, thus not needing a RTC.
Before any storage is made to the "raw" table, a row is appended to a text-file. This is puerely for logging or being able to recover from database downtime.
As for database type, I'd recommend regular RDBMS. Define markers for your data. We use 4-digit codes that gives headroom for 10000 types of measure values.
On the system I am developing, we have a PostgreSQL database that contains set up data which when updated must be transfered to handsets when and while those handsets are "docked". While the handsets are docked, our "service software" can talk to them, but not while they are undocked (they are not wireless).
At the moment, the service software that the handsets talk to loads the set up data from the database on startup and caches it. Thereafter it queries the latest timestamp of the setup data every 5 seconds and reloads parts of the set up if the timestamp queried is higher than the latest cached timestamp.
However, I find this method haphazard. It may be possible to miss an update for instance if an update transaction takes longer than a second, or at least if the period between submitting the transaction and completion of the transaction takes it over a 1-second boundary (the now() function is resolved at the beginning of the transaction by PostgreSQL). The only way I can think of round that is to do a table level lock before querying the latest timestamp. I'm not a fan of table locks but it is the only way I can think of to get round the problem.
Another problem with this approach is, I have to query for new data based on the update timetamp being >= the last latest timestamp, as opposed to just > the last latest timestamp. Why? Because a record may of been committed within the same second, just after my query - so I'd miss the record.
Another approach I've thought of is, storing "last synchronised date-time" data in the database for each logical item of data that must be stored on the handsets. I would do this on a per handset basis. I can then periodically query for all data not currently synchronised on a particular handset, and then mark it as synchronised once the handset is up to date (I have worked out a mechanism for this to be failsafe which takes into account the data being updated during synchronisation).
My only problem with this approach is that it means the database is storing non-business centric data - as in it is storing data to make the system work. I'm not convinced data about what handsets are in sync is "business" data. To me it is more the responsibility of the handset service software / handset software to know how to keep itself up to date, though it is tempting as it describes perfectly what data is and is not on each handset and allows queries to only return the data needed.
The first approach however at least only uses data appropriate to the business - i.e. the timestamp of when the data was last changed.
The ideal way would be to use some kind of notification system, but unfortunately postgres only has a basic NOTIFY / EVENT system and that doesn't seem to work over ODBC (which I foolishly decided to use and do not have time to change just now). If I was using Oracle I could use Streams..
Thoughts?
Note: The database is purely relational - I am not interested in any "object oriented" approaches to this problem or any framework based solutions.
Thanks.
First of all, if you are using PostgreSQL version at least 7.2, the now function returns values with microsecond precision rather that second precision; although the value is ultimately derived from the operating system and will be accurate only up to a few hundredths of a second.
The method that you describe appears to be safe against permanently missing any updates. Just make sure that you reload data every time unless the timestamps prove that you have reloaded long enough after the last update. Alternately, you could update a timestamp upon data update in a separate transaction; in that case, ever seeing such a timestamp is a proof that all updates had finished before the timestamp value.
Another approach I've thought of is, storing "last synchronised
date-time" data in the database for each logical item of data that
must be stored on the handsets. I would do this on a per handset
basis. I can then periodically query for all data not currently
synchronised on a particular handset, and then mark it as synchronised
once the handset is up to date
I can not recommend this for the following reasons:
As synchronization is a state of a handset and not a state of the data, this information should better be stored on the handset.
The database should be scalable to many handsets and it ideally should not have to keep track of them.
If a handset can change its identity, or be wiped or restored (reimaged) to a previous state without changing its identity, the database will get out of sync with the real state of the headset and no mechanism will ensure proper synchronization.
While NOTIFY is certainly preferable to constant polling, it is a problem orthogonal to where you store the synchronization progress. You still need to have a polling capability to be able to deal with a freshly connected devide, and notifications would be just a bandwidth/latency optimization.
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);