I have a Flink project that receives an events streams, and executes some logic to add a flag of this event, then it saves the flag and the eventID for a while to be reused or to be queried by other system.
in this case, the volume of data is not too many, and need to be good reliability, of course, better to be updated in time before being used.
Traditionally, we can use an external database to save this kind of data.
But after I learned the state, I saw it seems to be very useful, and has a good backends mechanism, and can be queryable.
So I am asking question to listen more to your arguments and evidence.
I am moving my last two comments to here as an answer since I realized I am essentially doing that.
Ok, It might have been the Uber keynote then. But the bottom line is that there are companies that are using extremely large state to hold data that you need to perform calculations against effectively.
For example, I made a program that took in messages that with an unique ID and a value field(int). I then had a stateful function that was keyed by the ID of the received message and every message I received for that ID would be added to a stateful value object, updating the the total for that ID. You could make a stateful list object to hold all the messages you received if you needed that. An alternative to that is to use a "new age" database that is designed for quick read/writes, like Cassandra, to store that. But that approach comes with its own limitations because of the I/O (long story short, Flink and Cassandra could handle lots of dat fast, the network bandwidth could not).
So keeping all that data in state in flink can be done and used well and has many benefits.
The one thing that I have to caveat this with is that I do not know if Flink's state has the same sort of failsafes like that of Cassandra or Kafka. Whereas they replicate their data across nodes so that if one goes down, then the others can handle everything and repopulate the other node when it is restarted. Flink's state can be stored on a remote backend like an s3 bucket or hdfs
(see: https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/state_backends.html), but I do not know if there is replication of the state. So if the state is stored all on one node that goes down, if it is gone for good or is backed up on another node. That is something to look into more since that should be a big decision in your choice.
Hope that at least gave you some info and a brief idea of what questions to ask.
I am using Camel and I have a business problem. We consume order messages from an activemq queue. The first thing we do is check in our DB to see if the customer exists. If the customer doesn't exist then a support team needs to populate the customer in a different system. Sometimes this can take a 10 hours or even the following day.
My question is how to handle this. It seems to me at a high level I can dequeue these messages, store them in our DB and re-run them at intervals (a custom coded solution) or I could note the error in our DB and then return them back to the activemq queue with a long redelivery policy and expiration, say redeliver every 2 hours for 48 hours.
This would save a lot of code but my question is if approach 2 is a sound approach or could lead to resource issues or problems with not knowing where messages are?
This is a pretty common scenario. If you want insight into how the jobs are progressing, then it's best to use a database for this.
Your queue consumption should be really simple: consume the message, check if the customer exists; if so process, otherwise write a record in a TODO table.
Set up a separate route to run on a timer - every X minutes. It should pull out the TODO records, and for each record check if the customer exists; if so process, otherwise update the record with the current timestamp (the last time the record was retried).
This allows you to have a clear view of the state of the system, that you can then integrate into a console to see what the state of the outstanding jobs is.
There are a couple of downsides with your Option 2:
you're relying on the ActiveMQ scheduler, which uses a KahaDB variant sitting alongside your regular store, and may not be compatible with your H/A setup (you need a shared file system)
you can't see the messages themselves without scanning through the queue, which is an antipattern - using a queue as a database - you may as well use a database, especially if you can anticipate needing to ever selectively remove a particular message.
Imagine the following case:
1,000 clients are connected to a Meteor page displaying the content of the "Somestuff" collection.
"Somestuff" is a collection holding 1,000 items.
Someone inserts a new item into the "Somestuff" collection
What will happen:
All Meteor.Collections on clients will be updated i.e. the insertion forwarded to all of them (which means one insertion message sent to 1,000 clients)
What is the cost in term of CPU for the server to determine which client needs to be updated?
Is it accurate that only the inserted value will be forwarded to the clients, and not the whole list?
How does this work in real life? Are there any benchmarks or experiments of such scale available?
The short answer is that only new data gets sent down the wire. Here's
how it works.
There are three important parts of the Meteor server that manage
subscriptions: the publish function, which defines the logic for what
data the subscription provides; the Mongo driver, which watches the
database for changes; and the merge box, which combines all of a
client's active subscriptions and sends them out over the network to the
client.
Publish functions
Each time a Meteor client subscribes to a collection, the server runs a
publish function. The publish function's job is to figure out the set
of documents that its client should have and send each document property
into the merge box. It runs once for each new subscribing client. You
can put any JavaScript you want in the publish function, such as
arbitrarily complex access control using this.userId. The publish
function sends data into the merge box by calling this.added, this.changed and
this.removed. See the
full publish documentation for
more details.
Most publish functions don't have to muck around with the low-level
added, changed and removed API, though. If a publish function returns a Mongo
cursor, the Meteor server automatically connects the output of the Mongo
driver (insert, update, and removed callbacks) to the input of the
merge box (this.added, this.changed and this.removed). It's pretty neat
that you can do all the permission checks up front in a publish function and
then directly connect the database driver to the merge box without any user
code in the way. And when autopublish is turned on, even this little bit is
hidden: the server automatically sets up a query for all documents in each
collection and pushes them into the merge box.
On the other hand, you aren't limited to publishing database queries.
For example, you can write a publish function that reads a GPS position
from a device inside a Meteor.setInterval, or polls a legacy REST API
from another web service. In those cases, you'd emit changes to the
merge box by calling the low-level added, changed and removed DDP API.
The Mongo driver
The Mongo driver's job is to watch the Mongo database for changes to
live queries. These queries run continuously and return updates as the
results change by calling added, removed, and changed callbacks.
Mongo is not a real time database. So the driver polls. It keeps an
in-memory copy of the last query result for each active live query. On
each polling cycle, it compares the new result with the previous saved
result, computing the minimum set of added, removed, and changed
events that describe the difference. If multiple callers register
callbacks for the same live query, the driver only watches one copy of
the query, calling each registered callback with the same result.
Each time the server updates a collection, the driver recalculates each
live query on that collection (Future versions of Meteor will expose a
scaling API for limiting which live queries recalculate on update.) The
driver also polls each live query on a 10 second timer to catch
out-of-band database updates that bypassed the Meteor server.
The merge box
The job of the merge box is to combine the results (added, changed and removed
calls) of all of a client's active publish functions into a single data
stream. There is one merge box for each connected client. It holds a
complete copy of the client's minimongo cache.
In your example with just a single subscription, the merge box is
essentially a pass-through. But a more complex app can have multiple
subscriptions which might overlap. If two subscriptions both set the
same attribute on the same document, the merge box decides which value
takes priority and only sends that to the client. We haven't exposed
the API for setting subscription priority yet. For now, priority is
determined by the order the client subscribes to data sets. The first
subscription a client makes has the highest priority, the second
subscription is next highest, and so on.
Because the merge box holds the client's state, it can send the minimum
amount of data to keep each client up to date, no matter what a publish
function feeds it.
What happens on an update
So now we've set the stage for your scenario.
We have 1,000 connected clients. Each is subscribed to the same live
Mongo query (Somestuff.find({})). Since the query is the same for each client, the driver is
only running one live query. There are 1,000 active merge boxes. And
each client's publish function registered an added, changed, and
removed on that live query that feeds into one of the merge boxes.
Nothing else is connected to the merge boxes.
First the Mongo driver. When one of the clients inserts a new document
into Somestuff, it triggers a recomputation. The Mongo driver reruns
the query for all documents in Somestuff, compares the result to the
previous result in memory, finds that there is one new document, and
calls each of the 1,000 registered insert callbacks.
Next, the publish functions. There's very little happening here: each
of the 1,000 insert callbacks pushes data into the merge box by
calling added.
Finally, each merge box checks these new attributes against its
in-memory copy of its client's cache. In each case, it finds that the
values aren't yet on the client and don't shadow an existing value. So
the merge box emits a DDP DATA message on the SockJS connection to its
client and updates its server-side in-memory copy.
Total CPU cost is the cost to diff one Mongo query, plus the cost of
1,000 merge boxes checking their clients' state and constructing a new
DDP message payload. The only data that flows over the wire is a single
JSON object sent to each of the 1,000 clients, corresponding to the new
document in the database, plus one RPC message to the server from the
client that made the original insert.
Optimizations
Here's what we definitely have planned.
More efficient Mongo driver. We
optimized the driver
in 0.5.1 to only run a single observer per distinct query.
Not every DB change should trigger a recomputation of a query. We
can make some automated improvements, but the best approach is an API
that lets the developer specify which queries need to rerun. For
example, it's obvious to a developer that inserting a message into
one chatroom should not invalidate a live query for the messages in a
second room.
The Mongo driver, publish function, and merge box don't need to run
in the same process, or even on the same machine. Some applications
run complex live queries and need more CPU to watch the database.
Others have only a few distinct queries (imagine a blog engine), but
possibly many connected clients -- these need more CPU for merge
boxes. Separating these components will let us scale each piece
independently.
Many databases support triggers that fire when a row is updated and
provide the old and new rows. With that feature, a database driver
could register a trigger instead of polling for changes.
From my experience, using many clients with while sharing a huge collection in Meteor is essentially unworkable, as of version 0.7.0.1. I'll try to explain why.
As described in the above post and also in https://github.com/meteor/meteor/issues/1821, the meteor server has to keep a copy of the published data for each client in the merge box. This is what allows the Meteor magic to happen, but also results in any large shared databases being repeatedly kept in the memory of the node process. Even when using a possible optimization for static collections such as in (Is there a way to tell meteor a collection is static (will never change)?), we experienced a huge problem with the CPU and Memory usage of the Node process.
In our case, we were publishing a collection of 15k documents to each client that was completely static. The problem is that copying these documents to a client's merge box (in memory) upon connection basically brought the Node process to 100% CPU for almost a second, and resulted in a large additional usage of memory. This is inherently unscalable, because any connecting client will bring the server to its knees (and simultaneous connections will block each other) and memory usage will go up linearly in the number of clients. In our case, each client caused an additional ~60MB of memory usage, even though the raw data transferred was only about 5MB.
In our case, because the collection was static, we solved this problem by sending all the documents as a .json file, which was gzipped by nginx, and loading them into an anonymous collection, resulting in only a ~1MB transfer of data with no additional CPU or memory in the node process and a much faster load time. All operations over this collection were done by using _ids from much smaller publications on the server, allowing for retaining most of the benefits of Meteor. This allowed the app to scale to many more clients. In addition, because our app is mostly read-only, we further improved the scalability by running multiple Meteor instances behind nginx with load balancing (though with a single Mongo), as each Node instance is single-threaded.
However, the issue of sharing large, writeable collections among multiple clients is an engineering problem that needs to be solved by Meteor. There is probably a better way than keeping a copy of everything for each client, but that requires some serious thought as a distributed systems problem. The current issues of massive CPU and memory usage just won't scale.
The experiment that you can use to answer this question:
Install a test meteor: meteor create --example todos
Run it under Webkit inspector (WKI).
Examine the contents of the XHR messages moving across the wire.
Observe that the entire collection is not moved across the wire.
For tips on how to use WKI check out this article. It's a little out of date, but mostly still valid, especially for this question.
This is still a year old now and therefore I think pre-"Meteor 1.0" knowledge, so things may have changed again? I'm still looking into this.
http://meteorhacks.com/does-meteor-scale.html
leads to a "How to scale Meteor?" article
http://meteorhacks.com/how-to-scale-meteor.html
On the system I am developing, we have a PostgreSQL database that contains set up data which when updated must be transfered to handsets when and while those handsets are "docked". While the handsets are docked, our "service software" can talk to them, but not while they are undocked (they are not wireless).
At the moment, the service software that the handsets talk to loads the set up data from the database on startup and caches it. Thereafter it queries the latest timestamp of the setup data every 5 seconds and reloads parts of the set up if the timestamp queried is higher than the latest cached timestamp.
However, I find this method haphazard. It may be possible to miss an update for instance if an update transaction takes longer than a second, or at least if the period between submitting the transaction and completion of the transaction takes it over a 1-second boundary (the now() function is resolved at the beginning of the transaction by PostgreSQL). The only way I can think of round that is to do a table level lock before querying the latest timestamp. I'm not a fan of table locks but it is the only way I can think of to get round the problem.
Another problem with this approach is, I have to query for new data based on the update timetamp being >= the last latest timestamp, as opposed to just > the last latest timestamp. Why? Because a record may of been committed within the same second, just after my query - so I'd miss the record.
Another approach I've thought of is, storing "last synchronised date-time" data in the database for each logical item of data that must be stored on the handsets. I would do this on a per handset basis. I can then periodically query for all data not currently synchronised on a particular handset, and then mark it as synchronised once the handset is up to date (I have worked out a mechanism for this to be failsafe which takes into account the data being updated during synchronisation).
My only problem with this approach is that it means the database is storing non-business centric data - as in it is storing data to make the system work. I'm not convinced data about what handsets are in sync is "business" data. To me it is more the responsibility of the handset service software / handset software to know how to keep itself up to date, though it is tempting as it describes perfectly what data is and is not on each handset and allows queries to only return the data needed.
The first approach however at least only uses data appropriate to the business - i.e. the timestamp of when the data was last changed.
The ideal way would be to use some kind of notification system, but unfortunately postgres only has a basic NOTIFY / EVENT system and that doesn't seem to work over ODBC (which I foolishly decided to use and do not have time to change just now). If I was using Oracle I could use Streams..
Thoughts?
Note: The database is purely relational - I am not interested in any "object oriented" approaches to this problem or any framework based solutions.
Thanks.
First of all, if you are using PostgreSQL version at least 7.2, the now function returns values with microsecond precision rather that second precision; although the value is ultimately derived from the operating system and will be accurate only up to a few hundredths of a second.
The method that you describe appears to be safe against permanently missing any updates. Just make sure that you reload data every time unless the timestamps prove that you have reloaded long enough after the last update. Alternately, you could update a timestamp upon data update in a separate transaction; in that case, ever seeing such a timestamp is a proof that all updates had finished before the timestamp value.
Another approach I've thought of is, storing "last synchronised
date-time" data in the database for each logical item of data that
must be stored on the handsets. I would do this on a per handset
basis. I can then periodically query for all data not currently
synchronised on a particular handset, and then mark it as synchronised
once the handset is up to date
I can not recommend this for the following reasons:
As synchronization is a state of a handset and not a state of the data, this information should better be stored on the handset.
The database should be scalable to many handsets and it ideally should not have to keep track of them.
If a handset can change its identity, or be wiped or restored (reimaged) to a previous state without changing its identity, the database will get out of sync with the real state of the headset and no mechanism will ensure proper synchronization.
While NOTIFY is certainly preferable to constant polling, it is a problem orthogonal to where you store the synchronization progress. You still need to have a polling capability to be able to deal with a freshly connected devide, and notifications would be just a bandwidth/latency optimization.
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);