"Standard" approach to collecting data from/distributing data to multiple devices/servers? - database

I'll start with the scenario I am most interested in:
We have multiple devices (2 - 10) which all need to know about
a growing set of data (thousands to hundreds of thousands of small chunks,
say 100 - 1000 bytes each).
Data can be generated on any device and we
want every device to be able to get all the data (edit: ..eventually. devices are not connected and/or online all the time, but they synchronize now and then) No data needs
to be deleted or modified.
There are of course a few naive approaches to handle this, but I think
they all have some major drawbacks. Naively sending everything I
have to everyone else will lead to poor performance with lots of old data
being sent again and again. Sending an inventory first and then letting
other devices request what they are missing won't do much good for small
data. So maybe having each device remember when and who they talked to
could be a worthwhile tradeoff? As long as the number of partners
is relatively small saving the date of our last sync does not use that much
space, but it should be easy to just send what has been added since then.
But that's all just conjecture.
This could be a very broad
topic and I am also interested in the problem as a whole: (Decentralized) version control probably does something similar
to what I want, as does a piece of
software syncing photos from a users smart phone, tablet and camera to an online
storage, and so on.
Somehow they're all different though, and there are many factors like data size, bandwith, consistency requirements, processing power or how many devices have aggregated new data between syncs, to keep in mind, so what is the theory about this?
Where do I have to look to find
papers and such about what works and what doesn't, or is each case just so much
different from all the others that there are no good all round solutions?
Clarification: I'm not looking for ready made software solutions/products. It's more like the question what search algorithm to use to find paths in a graph. Computer science books will probably tell you it depends on the features of the graph (directed? weighted? hypergraph? euclidian?) or whether you will eventually need every possible path or just a few. There are different algorithms for whatever you need. I also considered posting this question on https://cs.stackexchange.com/.

In your situation, I would investigate a messaging service that implements the AMQP standard such as RabbitMQ or OpenAMQ, each time a new chunk is emitted, it should be sent to the AMQP broker which will broadcast it to all devices queues. Then the message may be pushed to the consumers or pulled from the queue.

You can also consider Kafka for data streaming from several producers to several consumers. Other possibility is ZeroMQ. It depends on your specific needs

Have you considered using Amazon Simple notification service to solve this problem?
You can create a topic for each group of device you want to keep in sync. Whenever there is an update in dataset, the device can publish to the topic which in turn will be pushed to all devices using SNS.

Related

Best DB architecture to maintain/update counters in near real time

I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

Message queue like RabbitMQ for high volume writes to SQL database?

The scenario is needing to write high volume data, like tracking clicks or mouse movements, from a web application to a SQL database. The data doesn't need to be written right away because the analysis on the data happens on some recurring basis, like daily or weekly.
I want some feedback on a solution that comes to mind:
The click and mouse data is published to a message queue. This stores the queue items in memory so it should be fast and faster than SQL. Then on some other server a job plugs away on retrieving the next queue item and writing the data to SQL.
Does anyone know of implementations like this? What pitfalls am I failing to see? If this solution is not a good one are there other alternatives?
Regards
RabbitMQ is meant for real time message exchange and not for temporary buffering data. If you are able to consume all data as soon as it arrives in your queues, then this solution will work for you. Otherwise RabbitMQ will grow in memory and eventually die. Then you will have to configure it to throw some data away (there are a lot of options to choose rules for this).
You could possibly store data in Redis cache, you can do it as fast as you publish your events to RabbitMQ. Then you can listen to the new changes in Redis from remote server and fill up whatever database storage you use, or even use it as your data storage.
To solve a very similar problem I was considering doing exactly this. In the end we decided not to go for it because we did need access to the data very quickly. However I still like the idea.
Ive also recently learnt that under the hood this is exactly the way that Microsofft Dynamics CRM does its database updates, using message passing.
Things I think you would need to pay careful attention to.
Make sure that if your RabbitMQ instance disappeared it wouldnt have any affect on your client. Rabbit dying is bad enough, your client erroring because Rabbit is down would be terrible.
If it's truly very high volume (and its good practice for reliability anyway) clustering is something worth looking at.
Obviously paying attention to your deadletter queues is a must. But the ability to play back messages which failed for some reason is awesome, in theory at least your data should eventually always get to you database. Even if it went down for a period of time.
Make sure you can keep up with the number of messages being passed in. Of course, this should be solvable by adding more consumer to a given queue. Which leads to...
Idempotency of messages. Given that your messages relate directly to a DB write, they HAVE to be idempotent.

Are AKKA actors a good solution for optimizing my setup?

I work on a project, where much of concurrent reading and writing to the DB is bringing performance down. Imagine that I need to sort of reindex the entire DB from time to time, so, the simplest way possible is to set a "dirty" flag to true, and let multiple machines grab at the "dirty" items, do some processing, and then set their state to "clean" again. As you can imagine, this is a paradise for deadlocks.
I want to optimize this, and leave the DB IO operations to one coordinating machine, and the rest of the possible concurrent computing to other machines. I thought that Akka, with its distributed actor model can be an ideal for for this. My idea is to have the coordinating actor read batches of "dirty" items from the DB, fire off numerous processing actors, passing an item to each one. The idea is that the processing actors will reside on different machines, but neither the coordinator, nor the processors should be aware of this. This seems to be made possible by, and of the big advantages of using Akka. I want to make this a deployment configuration issue, to make scaling at will possible.
After the processing actors finish the processing, they can send the result as a message to the coordinating actor, which will use the same connection to save its state.
Am I going I'm the right direction with this setup?
You can use Cluster Singleton as coordinator. Note that coordination actor will take all requests sequentially, so it should be pretty lightweight. At least you may want to separate bulk reading and writing back. And maybe (if you don't have triggers) also read dirty blocks with pagination to not block actor for a long time. I used iterator defined over a ResultSet (Oracle JDBC drivers automatically do pagination) - and smthng like case ScheduledBulk if previousBulkFinished => future {getIterator(...).foreach(coordinator ! _)}}.
If you want fast writes to DB - you may use Fixed Size Router to distribute writes between many actors (count < size of your connection pool to DB)

Queues against Tables in messaging systems [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've been experiencing the good and the bad sides of messaging systems in real production environments, and I must admit that a well organized table or schema of tables simply beats every time any other form of messaging queue, because:
Data are permanently stored on a table. I've seen so many java (jms) applications that lose or vanish messages on their way for uncaught exceptions or other bugs.
Queues tend to fill up. Db storage is virtually infinite, instead.
Tables are easily accessible, while you have to use esotic instruments to read from a queue.
What's your opinion on each approach?
The phrase beats every time totally depends on what your requirements were to begin with. Certainly its not going to beat every time for everyone.
If you are building a single system which is already using a database, you don't have very high performance throughput requirements and you don't have to communicate with any other teams or systems then you're probably right.
For simple, low thoughput, mostly single threaded stuff, database are a totally fine alternative to message queues.
Where a message queue shines is when
you want a high performance, highly concurrent and scalable load balancer so you can process tens of thousands of messages per second concurrently across many servers/processes (using a database table you'd be lucky to process a few hundred a second and processing with multiple threads is pretty hard as one process will tend to lock the message queue table)
you need to communicate between different systems using different databases (so don't have to hand out write access to your systems database to other folks in different teams etc)
For simple systems with a single database, team and fairly modest performance requirements - sure use a database. Use the right tool for the job etc.
However where message queues shine is in large organisations where there are lots of systems that need to communicate with each other (and so you don't want a business database to be a central point of failure or place of version hell) or when you have high performance requirements.
In terms of performance a message queue will always beat a database table - as message queues are specifically designed for the job and don't rely on pessimistic table locks (which are required for a database implementation of a queue - to do the load balancing) and good message queues will perform eager loading of messages to queues to avoid the network overhead of a database.
Similarly - you'd never use a database to do load balancing of HTTP requests across your web servers - as it'd be too slow - if you have high performance requirements for your load balancer you'd not use a database either.
I've used tables first, then refactor to a full-fledged msg queue when (and if) there's reason - which is trivial if your design is reasonable.
The biggest benefits are a.) it's easier, (b. it's a better audit trail because you have the other tables to join to, c.) if you know the database tools really well, they are easier to use than the Message Queue tools, d.) it's generally a bit easier to set up a test/dev environment in a context that already exists for your app (if same familiarity applies).
Oh, and e.) for perhaps you and others, it's not another product to learn, install, configure, administer, and support.
IMPE, it's just as reliable, disconnectable, and you can convert if it needs more scalable.
Data are permanently stored on a table. I've seen so many java (jms) applications that loose or vanish messages on their way for uncaught exceptions or other bugs.
Which JMS implementation? Sun sells reliable queue which can't lose messages. Perhaps you just purchased a cheesy JMS-compliant product. IBM's MQ is extremely reliable, and there are JMS libraries to access it.
Queues tend to fill up. Db storage is virtually infinite, instead.
Ummm... If your queue fills up, it sounds like something is broken. If your apps crash, that's not a good thing, and queues have little to do with that. If you've purchased a really poor JMS implementation, I can see where you might be unhappy with it. It's a competitive market-place. Find a better queue manager. Sun's JCAPS has a really good queue manager, formerly the SeeBeyond message queue.
Tables are easily accessible, while you have to use esotic instruments to read from a queue.
That doesn't fit with my experience. Tables are accessed through this peculiar "other language" (SQL), and requires that I be aware of structure mappings from tables to objects and data type mappings from VARCHAR2 to String. Further, I have to use some kind of access layer (JDBC or an ORM which uses JDBC). That seems very, very complex. A queue is accessed through MessageConsumers and MessageProducers using simple sends and receives.
It sounds as though the problems you've experienced are not inherent to messaging, but rather are artifacts of poorly-implemented messaging systems. Is building messaging systems harder than building database systems? Yes, if all you ever do is build database systems.
Losing messages to uncaught exceptions? That's hardly the fault of the message queue. The applications you're using are poorly engineered. They're removing messages from the queue before processing completes. They're not using transactions, or journalling.
Message queues fill up while DB storage is "virtually infinite"? You talk as though managing disk space were something that databases didn't require. Message queue servers require administration, just like database servers do.
Esoteric instruments to read from a queue? Maybe if you find asynchronous methods esoteric. Maybe if you find serialization and deserialization esoteric. (At least, those are the things I found esoteric when I was learning messaging. Like many seemingly-esoteric technologies, they're actually quite mundane once you understand them, and understanding them is an important part of the seasoned developer's education.)
Aspects of messaging that make it superior to databases:
Asynchronous processing. Message queues notify waiting processes when new messages arrive. To accomplish this functionality in a database, the waiting processes have to poll the database.
Separation of concerns. The communications channel is decoupled from the implementation details of the message content. Only the sender and the receiver need to know anything about the format of the data stream within a given message.
Fault-tolerance.. Messaging can function when connections between servers are intermittent. Message queues can store messages locally and only forward them to remote servers when the connection is live.
Systems integration. In the Windows world, at least, messaging is built into the operating system. It uses the OS's security model, it's managed through the OS's tools, etc.
If you don't need these things, you probably don't need messaging.
Here's a simple example of an application for messaging: I'm building a system right now where users, distributed across multiple networks, are entering fairly intricate sets of transactions that are used to produce printed output. Output generation is computationally expensive and not part of their workflow; i.e. the users don't care when the output gets generated, just that it does.
So we serialize the transactions into a message and drop it in a queue. A process running on a server grabs messages from the queue, produces the output, and stores the output in an imaging system.
If we used a database as our message store, we'd have to come up with a schema to store a transaction format that right now only the sender and receiver care about, we'd need to make sure every workstation on the network had permanent persistent connections to the database server, we'd have no capacity to distribute this transaction load across multiple servers, and our output server would have to query the database thousands of times a day waiting to see if there were new jobs to process.
Queues provide reliable messaging. The store-and-forward, disconnected nature of queueing make it much more scalable than databases, not to mention more robust.
And queues shouldn't really be used for permanent storage of information - it is best to think of them as temporary inboxes, unlike databases.

Resources