First of all, very nice that SiteWise now allows ingestion of data that is up to 7 days old: https://aws.amazon.com/about-aws/whats-new/2020/11/aws-iot-sitewise-now-allows-ingestion-of-data-up-to-7-days-in-past/
And I can confirm that, as long as the late data is less than 7 days old, SiteWise will accept the data. From the SiteWise monitor dashboard, I can see the late sensor data measurements, as well as their transforms, always show up.
However, AWS IoT SiteWise seems not consistent in calculating (or recalculating) related metrics on arrival of late data. In my experiment, sometimes the metrics are generated, while sometimes they are not, as shown in this dashboard:
Any insight? Thank you.
FYI:
Sometimes the metrics do get generated.
Before ingesting late data, notice that there is no data between 13:37:09 and 14:00:00
After ingesting late data, we can see that SiteWise accepts the data and recognises the lateness of the data, the metrics are generated.
Related
I am streaming data into QuestDB using the ILP protocol with one of their official clients. I would expect to see the data available immediately after sending, but that's not the case.
If I go to the web interface, the table has been created, but if I run SELECT count() FROM sensors or SELECT * FROM sensors I am not getting any results.
The logs are not showing any errors either.
Thanks
update: If I check after a few minutes, the data is in there, but it always takes at least 5 minutes until I can see it
This used to be one of the most frequently asked questions by QuestDB's new users. Before QuestDB version 6.6.1 (released in November 2022), QuestDB would use a mechanism called "CommitLag" to trade off ingestion performance and readiness of fresh data in your queries.
This was designed specifically for data arriving out of order (relative to the designated timestamp), but in many cases it would have side effects also when data was ingested in order. CommitLag defaulted to 5 minutes, but it could be changed (down to the millisecond) for individual tables.
The reason why this was needed for out-of-order data (or o3 in QuestDB terms), is because QuestDB stores data physically sorted by increasing designated timestamp, so data arriving late means the engine needs to rewrite the partitions where those data belong.
Starting from version 6.6.1, QuestDB changed the way it persist data to the table files, introducing "Dynamic Commits". This new mechanism automatically decides how often to physically write to the table files. As long as data is arriving in order, writes are immediate and your data will be able in your SELECT statements straight away.
If data starts coming out of order (for example, due to network lag in the origin, or because the business logic allows for older data being sent), QuestDB will figure out how late the data is arriving and will adjust the write frequency in consequence. This heuristic is calculated once every second, so responding to changes in the ingestion pattern is very fast.
The new functionality is configuration-free and works out-of-the-box when you are using QuestDB 6.6.1 or above, so my advice would be to upgrade to the latest version.
I’m working on a new project that will involve storing events from various systems at random intervals. Events such as deployment completions, production bugs, continuous integration events etc. This is somewhat time series data, although the volume should be relatively low, a few thousand a day etc.
I had been thinking maybe InfluxDB was a good option here as the app will revolve mostly around plotting time lines and durations etc, although there will need to be a small amount of data stored with these datapoints. Information like error messages , descriptions , url and maybe twitter sized strings. I would say that there is a good chance most events will not actually have a numerical value but more just act as a point in time reference for an event.
As an example, I would expect a lot of events to look like (in Influx line protocol format)
events,stream=engineering,application=circleci,category=error message="Web deployment failure - Project X failed at step 5", url=“https://somelink.com”,value=0
My question here is, am I approaching this wrong? Is InfluxDB the wrong choice for this type of data? I have read a few horror stories about data corruption and i’m a bit nervous there but i’m not entirely sure of any better (but also affordable) options.
How would you go about storing this type of data in a way that can can accessed at a high frequency for a purpose such as a realtime dashboard?
Im resisting the urge to just rollout a Postgres database.
Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.
I'm currently reading into Kafka, trying to find a way to seperate our timeseries database storage engine from our application by making this more of a generic stand-alone microservice rather than an integral part of our application as it currently is.
We currently store our sample data (with timestamp) in a our in-house developed timeseries database and our application enables us to do a scala of analyses dedicated for our industry.
Kafka seems to be ideal for continously streaming data into it and out of it (what we need as well), but querying a datasource over a set period of time in the past, to get a data result stream, which therefore has a begin and an end, seems not to be a part of the scope of Kafka.
That is, I can't find a proper way to create that in Kafka yet.
Having read this: https://www.confluent.io/blog/hello-world-kafka-connect-kafka-streams/ I think I'm very close to what I want but I can't see yet how Kafka handles various queries for various recorded sample sets over different periods of time.
We have a lot of sample data sets over a long period of time (3+ years of 10000s of sample sets at a sampling rate of every 5 seconds to every 1 minute), and as our storage is limited, I hope Kafka does offer a more 'transient' way, than storing the result data of every request for 2 days (as it is set as default), if I understand it correctly, to get our data every time we want to do analysis.
I'm just so close, but I can't get my head around it how to do this properly in Kafka.
Thank you very very much for your time.
I think I'm going to use crontab to run a bunch of scripts that will:
close all expired posts
accept uncontested disputes
add interest charges
email out invoices
send "about to expire" notifications
I want the expired stuff to be removed pretty shortly after the event occurs, so I'm thinking about writing one script that will run and check for all these various dates every 5 to 15 minutes. Can I expect any troubles doing this? At what number of "posts" might I start seeing performance issues? Are we talking thousands or millions?
If you're doing anything with the "posts" and there's only 15k per year...I wouldn't worry about any performance issues with them on reasonably modern hardware for quite some time. Unless you're doing something pretty crazy with invoices, you shouldn't have issues with 250k per year for a good number of years. This is shooting from the hip though, it all depends on hardware and what exactly you're doing