Use Google App Engine's NDB as a message queue? - google-app-engine

Has anyone tried to use NDB as a message queue? We have several consumers and producers, which may want to do broadcast, multicast, and publish-subscribe. I've read several documents on why using a RDBMS as a message queue is bad. But in my case, my app can tolerate latency of several seconds. So eventual consistency should not be as much of an issue, because almost all replication in NDB should complete within a few seconds. In terms of message ordering, I could use timestamps.
Another alternative is to use NDB's strong consistency feature with a buffer (e.g. memcache).

Why not use the Task Queue? It's optimized for both push (broadcast, multicast) and pull (subscribe).

Related

Message queue with conditional processing and consensus between workers

I'm implementing workflow engine, where a job request is received first and executed later by a pool of workers. Sounds like a typical message queue use case.
However, there are some restrictions for parallel processing. For example, it's not allowed to run concurrent jobs for the same customer. In other words, there must be some sort of consensus between workers.
I'm currently using database table with business identifiers, status flags, row locking and conditional queries to store and poll available jobs according to spec. It works, but using database for asynchronous processing feels counterintuitive. Does messaging systems support my requirements of conditional processing?
As an author of a few workflow engines, I believe that the persistence component for maintaining state is essential. I cannot imagine a workflow engine that only uses queues.
Unless you are just doing it just for fun implementing your own is a weird idea. A fully featured workflow engine is an extremely complex piece of software comparable to a database. I would recommend looking into existing ones instead of building your own if it is for production use. You can start from my open source project temporal.io :). It is used by thousands of companies for mission-critical applications and can scale to almost any rate given enough DB capacity.

I want to log all mqtt messages of the broker. How should I design schema of database. Avoiding dulplicate entries and fast searching

I am implementing a callback in java to store messages in a database. I have a client subscribing to '#'. But the problem is when this # client disconnects and reconnect it adds duplicate entries in the database of retained messages. If I search for previous entries bigger tables will be expensive in computing power. So should I allot a separate table for each sensor or per broker. I would really appreciate if you suggest me better designs.
Subscribing to wildcard with a single client is definitely an anti-pattern. The reasons for that are:
Wildcard subscribers get all messages of the MQTT broker. Most client libraries can't handle that load, especially not when transforming / persisting messages.
If you wildcard subscriber dies, you will lose messages (unless the broker queues endlessly for you, which also doesn't work)
You essentially have a single point of failure in your system. Use MQTT brokers which are hardened for production use. These are much more robust single point of failures than your hand-written clients. (You can overcome the SIP through clustering and load balancing, though).
So to solve the problem, I suggest the following:
Use a broker which can handle shared subscriptions (like HiveMQ or MessageSight), so you can balance all messages between many clients
Use a custom plugin for doing the persistence at the broker instead of the client.
You can also read more about that topic here: http://www.hivemq.com/blog/mqtt-sql-database
Also consider using QoS = 3 for all message to make sure one and only one message is delivered. Also you may consider time-stamp each message to avoid inserting duplicate messages if QoS requirement is not met.

Golang on App Engine Datastore - Using PutMulti to Improve Performance

I have a GAE Golang app that should be able to handle hundreds of concurrent requests, and for each requests, I do some work on the input and then store it in the datastore.
Using the task queue (appengine/delay lib) I am getting pretty good performance, but it still seems very inefficient to perform single-row inserts for each request (even though the inserts are deferred using task queue).
If this was not app engine, I would probably append the output a file, and every once in a while I would batch load the file into the DB using a cron job / some other kind of scheduled service.
So my questions are:
Is there an equivalent scheme I can implement on app engine? I was
thinking - perhaps I should write some of the rows to memecache, and
then every couple of seconds I will bulk load all of the rows from
there and purge the cache.
Is this really needed? Can the datastore
handle thousands of concurrent writes - a write per http request my
app is getting?
Depends really on your setup. Are you using ancestor queries? If so then your are limited to 1 write per second PER ancestor (and all children, grand children). The datastore has a natural queue so if you try and write too quickly it will queue it. It only becomes an issue if you are writing too many way too quickly. You can read some best practices here.
If you think you will be going over that limit use a pull queues with async multi puts. You would put each entity in the queue. With a backed module (10 minute timeouts) you can pull in the entries in batches (10-50-100...) and do a put_async on them in batches. It will handle putting them in at the proper speed. While its working you can queue up the next batch. Just be wary of the timeout.

Resuming Camel Processing after power failure

I'm currently developing a Camel Integration app in which resumption from a previous state of processing is important. When there's a power outage, for instance, it's important that all previously processed messages are not re-processed. The processing should resume from where it left off before the outage.
I've gone through a number of possible solutions including Terracotta and Apache Shiro. I'm not sure how to use either as documentation on the integration with Apache Camel is scarce. I've not settled on the two, however.
I'm looking for suggestions on the potential alternatives I can use or a pointer to some tutorial to get me started.
The difficulty in surviving outages lies primarily in state, and what to do with in-flight messages.
Usually, when you're talking state within routes the solution is to flush it to disk, or other nodes in the cluster. Taking the aggregator pattern as an example, aggregated state is persisted in an aggregation repository. The default implementation is in memory, so if the power goes out, all the state is lost. However, there are other implementations, including one for JDBC, and another using Hazelcast (a lightweight in-memory data grid). I haven't used Hazelcast myself, but JDBC does a synchronous write to disk. The aggregator pattern allows you to resume from where you left off. A similar solution exists for idempotent consumption.
The second question, around in-flight messages is a little more complicated, and largely depends on where you are consuming from. If you're in the middle of handling a web service request, and the power goes out, does it matter if you have lost the message? The user can simply retry. Any effects on external systems can be wrapped in a transaction, or an idempotent consumer with JDBC idempotent repository.
If you are building out integrations based on messaging, you should consume within a transaction, so that if your server goes down, the messages go back into the broker and can be replayed to another consumer.
Be careful when using seda: or threads blocks, these use an in-memory queue to pass exchanges between threads, any messages flowing down these sorts of routes will be lost if someone trips over the power cable. If you can't afford message loss, and need this sort of processing model, consider using a JMS queue as the endpoints between the two routes (with transactions to ensure you pick up where you left off).

What are the low hanging fruit for optimizing google app engine with respect to quota usage?

Everyone learns to use Memcache pretty quick. Another one I've learned recently is setting indexed=False for Model properties that I am not going to query against. What are some others? What are the big ones?
Don't use offset in queries. Use cursors instead.
Explanations: offset loads all data up to offset+limit and charges you for it, but only returns limit entities.
Minimize instance use, by tweaking idle instances and pending latency appropriately for your app.
A couple helped us (not all may be low-hanging at first). First, we denormalized our datastore to reduce joins. I'm using SQL terms because I came from a SQL background. By spreading commonly queried elements around, we reduced the number of reads we had to make considerably, even after factoring in Memcache. Potentially increases writes but for most apps, the number of reads far outweighs the number of writes.
Next, we started using task queues, backends, and the channel API more often. I don't remember specific examples but I do remember we were able to reduce our front-end usage down below the free quota mark by moving some processing around to queues and backends and by sending data down via channel rather than having the client poll.
Also, we use objectify for our data access which we configure to automatically use memcache wherever appropriate.

Resources