Save measurement data into Database or Filesystem - database

We are currently developing a tool to count wildlife passing through defined areas. The gadget that automatically counts the animals will be sending data (weather, # of animals passing etc.) in a 5 minute interval via HTTP to our API. There will be hundreds of these measurement stations and it should be scalable.
Now the question arised whether to use a filesystem or a RDBMS to save this data.
Pro DB
save exact time and date when the entry was created
directly related to area# via indexed key
Pro Filesystem
Collecting data is not as resource intensive since for every API call only 1 line will be appended to the file
Properties of the data:
only related to 1 DB entry (the area #)
the measurement stations are in remote areas we have to account for outages
What will be done with the data
Give a overview over timeperiods per area#
act as a early warning system if the # of animals is surprisingly low/high
Probably by using a cronjob and comparing to simliar data
We are thinking to chose a RDBMS to save the data but I am worried that after millions of entries the DB will slow down and eventually stop working. This question was asked here where 360M entries is not really considered "big data" so I'm not quite sure about my task either.
Should we chose these recommended techniques (MongoDB ...) or can this task be handled by PostgreSQL or MySQL?

I have created such a system for marine boyes. The devices sends data over GPRS / iridum using HTTP or raw tcp sockets (to minimize bandwidth).
The recieving server stores the data in a db-table, with data provided and timestamp.
The data is then parsed and records are created in another table.
The devices can also request UTC-time from the server, thus not needing a RTC.
Before any storage is made to the "raw" table, a row is appended to a text-file. This is puerely for logging or being able to recover from database downtime.
As for database type, I'd recommend regular RDBMS. Define markers for your data. We use 4-digit codes that gives headroom for 10000 types of measure values.

Related

Why is QuestDB not showing me the data I just ingested?

I am streaming data into QuestDB using the ILP protocol with one of their official clients. I would expect to see the data available immediately after sending, but that's not the case.
If I go to the web interface, the table has been created, but if I run SELECT count() FROM sensors or SELECT * FROM sensors I am not getting any results.
The logs are not showing any errors either.
Thanks
update: If I check after a few minutes, the data is in there, but it always takes at least 5 minutes until I can see it
This used to be one of the most frequently asked questions by QuestDB's new users. Before QuestDB version 6.6.1 (released in November 2022), QuestDB would use a mechanism called "CommitLag" to trade off ingestion performance and readiness of fresh data in your queries.
This was designed specifically for data arriving out of order (relative to the designated timestamp), but in many cases it would have side effects also when data was ingested in order. CommitLag defaulted to 5 minutes, but it could be changed (down to the millisecond) for individual tables.
The reason why this was needed for out-of-order data (or o3 in QuestDB terms), is because QuestDB stores data physically sorted by increasing designated timestamp, so data arriving late means the engine needs to rewrite the partitions where those data belong.
Starting from version 6.6.1, QuestDB changed the way it persist data to the table files, introducing "Dynamic Commits". This new mechanism automatically decides how often to physically write to the table files. As long as data is arriving in order, writes are immediate and your data will be able in your SELECT statements straight away.
If data starts coming out of order (for example, due to network lag in the origin, or because the business logic allows for older data being sent), QuestDB will figure out how late the data is arriving and will adjust the write frequency in consequence. This heuristic is calculated once every second, so responding to changes in the ingestion pattern is very fast.
The new functionality is configuration-free and works out-of-the-box when you are using QuestDB 6.6.1 or above, so my advice would be to upgrade to the latest version.

How does real-time collaborative applications saves the data?

I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

Does SQLite feature consolidation functions for multiple entries?

I am planning to use an SQLite Database on an embedded Linux Computer (Raspberry Pi or similar specs) to store Sensor-Data of about 16 floats for a period of one to two years.
The Data will be accessed through a web interface which is being served from the embedded board as well. The purpose is to visualize the data with graphs, etc.
Let's say the User want's to view the data for a whole years inside a graph. In order to not flood the Client Browser with millions of data it makes sense to consolidate that data before it goes up to the browser. For example one year will be described with average values for each week of the year.
Does SQLite feature such data aggregation commands, like averaging huge amounts of entries for a single table (averaging, summing)?
Is this operation performant on an embedded computer which specs are similar to those of the famous Raspberry Pi?
Does these operations lock up the database? So new entries will have to wait before they can enter the database?
Simple answer is 'Yes'
https://www.sqlite.org/lang_aggfunc.html
But you may want to consider that there are many factors that contribute to the speed of a query, not least of which is scheme/data model design as well as the index's on the tables used.
See https://www.sqlite.org/queryplanner.html for discussion on how queries are executed.
You have 3 options for this:
1) Pre Calculate the data when generated: Whenever you trap new sensor data, do the updates to your aggregates. Down side is limited flexibility to user on being able to change parameters, they get a set list of aggregates and time periods, and that's it.
2) Send the data to a central more powerful server and get the client to login and use the horse power of the central server to do the aggregates. Down side is the sensor collectors will need to be connected to central server and there will be scale issues as all data for all clients is calculated centrally. More clients, more horsepower needed. There are many server side scaling paradigms so this is more a cost constraint than a technical one.
3) Send raw data to client and let the client machine handle aggregation. Down side is data transmission if you are talking about millions of records. However, with client side db engines, like Google's lovefield, this is the best future proof architecture option in my opinion as it allows you to give significant power to the user via client side libraries and to use the client's machine resources. You could also look at using a mixed lower level pre aggregation model where some data is pre aggregated on the server before being sent to client to reduce the data size.

What methods are available to implement a local cache of a large DB driven data?

My company maintains a number of large time series databases of process data. We implement a replica of a subset at a pseudo-central location. I access the data from my laptop. The data access over our internal WAN even to the pseudo-central server is fairly expensive (time).
I would like to cache data requests locally on my laptop so that when I access it for a second time I actually pull from a local db.
There is a fairly ugly client side DAO that I can wrap to maintain the cache but I'm unsure how I can get the "official" client applications to talk to the cache easily. I have the freedom to write my own "client" graphing/plotting system, and already have a custom application that does some data mining already implemented. The custom application dumps data into .csv files that are manually moved around on a very ad-hoc basis.
What is the best approach to this sort of caching/synchonization? What tools could implement the cache?
For further info, the raw data set I have estimated at approx 5-8Tb of RAW time series data per year, with at least half of the data being very compressible. I only want to cache say a few hundred Mb locally. When ad-hoc queries are made on the data it tends to be very repetitive over very small chunks of the data.

Resources