MQTT-Broker meets WebServer with Database - database

I have a question about a combination of a MQTT-Broker and a Webserver.
Please check out the image bellow.
Is this a good way to save data from different sensors in a database?
In the picture the WebServer which communicates with the database is an MQTT Client. The WebServer just subscribe too all topics via #.
Is this scalable? I mean if there are 100.000 sensors out there and all send messages to this one WebServer..?

If you want to keep a record of all sensor data then it's about the only option (unless you have different client for different sensor types so split things up a bit). The only other option to a seperate client subscribed to # would be to use a broker like HiveMQ which has a plugin mechanism that can record all messages in a database.
Also # should probably be sensors/# in order to skip any other messages that may be on the system.
100,000 sensors isn't the deciding factor here, the rate at which these sensors deliver messages will be the important point as it will determine the actual load.

Related

Save measurement data into Database or Filesystem

We are currently developing a tool to count wildlife passing through defined areas. The gadget that automatically counts the animals will be sending data (weather, # of animals passing etc.) in a 5 minute interval via HTTP to our API. There will be hundreds of these measurement stations and it should be scalable.
Now the question arised whether to use a filesystem or a RDBMS to save this data.
Pro DB
save exact time and date when the entry was created
directly related to area# via indexed key
Pro Filesystem
Collecting data is not as resource intensive since for every API call only 1 line will be appended to the file
Properties of the data:
only related to 1 DB entry (the area #)
the measurement stations are in remote areas we have to account for outages
What will be done with the data
Give a overview over timeperiods per area#
act as a early warning system if the # of animals is surprisingly low/high
Probably by using a cronjob and comparing to simliar data
We are thinking to chose a RDBMS to save the data but I am worried that after millions of entries the DB will slow down and eventually stop working. This question was asked here where 360M entries is not really considered "big data" so I'm not quite sure about my task either.
Should we chose these recommended techniques (MongoDB ...) or can this task be handled by PostgreSQL or MySQL?
I have created such a system for marine boyes. The devices sends data over GPRS / iridum using HTTP or raw tcp sockets (to minimize bandwidth).
The recieving server stores the data in a db-table, with data provided and timestamp.
The data is then parsed and records are created in another table.
The devices can also request UTC-time from the server, thus not needing a RTC.
Before any storage is made to the "raw" table, a row is appended to a text-file. This is puerely for logging or being able to recover from database downtime.
As for database type, I'd recommend regular RDBMS. Define markers for your data. We use 4-digit codes that gives headroom for 10000 types of measure values.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

Does SQLite feature consolidation functions for multiple entries?

I am planning to use an SQLite Database on an embedded Linux Computer (Raspberry Pi or similar specs) to store Sensor-Data of about 16 floats for a period of one to two years.
The Data will be accessed through a web interface which is being served from the embedded board as well. The purpose is to visualize the data with graphs, etc.
Let's say the User want's to view the data for a whole years inside a graph. In order to not flood the Client Browser with millions of data it makes sense to consolidate that data before it goes up to the browser. For example one year will be described with average values for each week of the year.
Does SQLite feature such data aggregation commands, like averaging huge amounts of entries for a single table (averaging, summing)?
Is this operation performant on an embedded computer which specs are similar to those of the famous Raspberry Pi?
Does these operations lock up the database? So new entries will have to wait before they can enter the database?
Simple answer is 'Yes'
https://www.sqlite.org/lang_aggfunc.html
But you may want to consider that there are many factors that contribute to the speed of a query, not least of which is scheme/data model design as well as the index's on the tables used.
See https://www.sqlite.org/queryplanner.html for discussion on how queries are executed.
You have 3 options for this:
1) Pre Calculate the data when generated: Whenever you trap new sensor data, do the updates to your aggregates. Down side is limited flexibility to user on being able to change parameters, they get a set list of aggregates and time periods, and that's it.
2) Send the data to a central more powerful server and get the client to login and use the horse power of the central server to do the aggregates. Down side is the sensor collectors will need to be connected to central server and there will be scale issues as all data for all clients is calculated centrally. More clients, more horsepower needed. There are many server side scaling paradigms so this is more a cost constraint than a technical one.
3) Send raw data to client and let the client machine handle aggregation. Down side is data transmission if you are talking about millions of records. However, with client side db engines, like Google's lovefield, this is the best future proof architecture option in my opinion as it allows you to give significant power to the user via client side libraries and to use the client's machine resources. You could also look at using a mixed lower level pre aggregation model where some data is pre aggregated on the server before being sent to client to reduce the data size.

I want to log all mqtt messages of the broker. How should I design schema of database. Avoiding dulplicate entries and fast searching

I am implementing a callback in java to store messages in a database. I have a client subscribing to '#'. But the problem is when this # client disconnects and reconnect it adds duplicate entries in the database of retained messages. If I search for previous entries bigger tables will be expensive in computing power. So should I allot a separate table for each sensor or per broker. I would really appreciate if you suggest me better designs.
Subscribing to wildcard with a single client is definitely an anti-pattern. The reasons for that are:
Wildcard subscribers get all messages of the MQTT broker. Most client libraries can't handle that load, especially not when transforming / persisting messages.
If you wildcard subscriber dies, you will lose messages (unless the broker queues endlessly for you, which also doesn't work)
You essentially have a single point of failure in your system. Use MQTT brokers which are hardened for production use. These are much more robust single point of failures than your hand-written clients. (You can overcome the SIP through clustering and load balancing, though).
So to solve the problem, I suggest the following:
Use a broker which can handle shared subscriptions (like HiveMQ or MessageSight), so you can balance all messages between many clients
Use a custom plugin for doing the persistence at the broker instead of the client.
You can also read more about that topic here: http://www.hivemq.com/blog/mqtt-sql-database
Also consider using QoS = 3 for all message to make sure one and only one message is delivered. Also you may consider time-stamp each message to avoid inserting duplicate messages if QoS requirement is not met.

One database, multiple frontends, maintaing request ordering

Assuming I have one database keeping a simple history with multiple front ends talking to it (one front end per server), I wonder what are the common solutions to deal with time. As soon as I have multiple servers, I cannot assume a global consistent clock, and I was interested in the possible solutions to maintain some kind of ordering between requests.
For a concrete example, let's say I want to record histories of customers, where history is defined as time ordered set of records. The record table would be as simple as (customer_id, time, data), and history would be all the rows where customer_id == requested id. Each request sent by the user would contain one record sent to one customer. Ideally, the time should refer to the "actual" time the request was sent to the front end by the customer (as that's the time as seen from the user POV). To be exact, I only care about preserving the ordering between records for each customer, not about the absolute time.
I am aware of solutions such as vector clocks, etc... but that seems rather complex, and I would expect this to be a rather common issue ?
Solutions which are not acceptable in my case:
Changing the requests arriving at the front end: I unfortunately have to work under the constraint that the requests are passed as is. I have complete control of whatever communication protocol is needed between front ends and database, though.
Server time clocks are synchronized
All request which require being ordered to each other are handled by the same front end server
[EDIT]: the question may sound a bit like red-herring, so here is my rationale for asking it: while this is not my issue right now, I am interested in the possibility to go to a platform like Google App Engine, which explicitly says that their servers are not guaranteed to be time synchronized. The solution to that issue for request ordering does not sound obvious to me - but maybe something like vector clock is actually the only "good" solution ?
When you perform any action that records history data to the database you could record two sets of datetime info:
the datetime as set by the DB when the record was inserted
the datetime passed through with the data as a legitimate piece of metadata.
The former would give you a central view of the world if you ever needed it, and the latter would let you reconstruct datetime from customers perspective.
If you were ultra-keen you could also pass through the datetime from the users browser by filling some sort of parameter/field using JavaScript.
As soon as I have multiple servers, I
cannot assume a global consistent
clock
Well, you can configure servers to sync their clocks to a time server. You could also configure your database server to sync to a time server, and configure the other servers to sync to your database server as often as you need to. (I'm not saying that's a great idea, just saying it's possible. If you have access to all the servers.)
Anyway . . . so the front ends are the only pieces of software you have that actually know when a request arrives. Is that right?
If that's right, then it's the front ends job to record the time of the customer's request, possibly in UTC, and then forward that timestamp to the database.
If you can't synchronize the server's clocks, then I think your only hope is to have every front ends ask just one specific server--maybe your database server, but maybe not--what time it is when a customer request arrives. A front end can do that by asking for daytime on port 13 (DAYTIME protocol, RFC-867), asking for time on port 37 (TIME protocol, RFC-868), or asking a time server on port 123 (either NTP or SNTP protocol, RFC-1305 and RFC-2030).
But after reading your edit, I think what you want is impossible. You seem to be saying that
what the front ends send doesn't
contain enough information to
reconstruct the "true" ordering
what the front ends send cannot be
changed
If the front ends can't send you any other information, vector clocks and interval tree clocks won't help.

Resources