Message queue with conditional processing and consensus between workers - database

I'm implementing workflow engine, where a job request is received first and executed later by a pool of workers. Sounds like a typical message queue use case.
However, there are some restrictions for parallel processing. For example, it's not allowed to run concurrent jobs for the same customer. In other words, there must be some sort of consensus between workers.
I'm currently using database table with business identifiers, status flags, row locking and conditional queries to store and poll available jobs according to spec. It works, but using database for asynchronous processing feels counterintuitive. Does messaging systems support my requirements of conditional processing?

As an author of a few workflow engines, I believe that the persistence component for maintaining state is essential. I cannot imagine a workflow engine that only uses queues.
Unless you are just doing it just for fun implementing your own is a weird idea. A fully featured workflow engine is an extremely complex piece of software comparable to a database. I would recommend looking into existing ones instead of building your own if it is for production use. You can start from my open source project temporal.io :). It is used by thousands of companies for mission-critical applications and can scale to almost any rate given enough DB capacity.

Related

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Apache Apex vs Apache Flink

As both are streaming frameworks which processes event at a time, What are the core architectural differences between these two technologies/streaming framework?
Also, what are some particular use cases where one is more appropriate than the other?
As you mentioned both are streaming platform which to in memory computation in real time. But there are some architectural differences when you take a closer look.
Apex is yarn native architecture, it fully utilises yarn for scheduling, security & multi-tenancy where as Flink integrates with yarn. Apex can do resource allocation at operator (container) level with yarn.
Partitioning: Apex supports several sophisticated stream partitioning schemes and also allows controlling operator locality & stream locality. Flink supports simple hash partitions and custom partitions.
Apex allows dynamic changes to topology without having to take down the application. Apex allows the application to be updated at runtime so you can add and remove operators, update properties of operators, or automatically scale the application at runtime. Apache Flink does not support any of these capabilities.
Buffer Server: There is a message bus called buffer server between operators. Subscribers can connect to buffer server and fetch data from particular offsets. This is window aware, and holds data as long as no subscriber needs it.
Fault tolerance: Apex has incremental recovery model, on failure it can only part of topology can be restarted no need to go back to source, where in flink it goes back to source.
Apex has high level api as well as low level api. Flink only has high level api.
Apex has a library called Apache Malhar which has vast variety of well tested connectors and processing operators which can be reused easily.
Lastly Apex is more focused on productizing big data applications so has many features which will help in easy development and maintenance of applications.
Note: I am a committer to Apache Apex, so I might sound biased to Apex :)

Distributed store with transactions

I currently develop an application hosted at google app engine. However, gae has many disadvantages: it's expensive and is very hard to debug since we can't attach to real instances.
I am considering changing the gae to an open source alternative. Unfortunately, none of the existing NOSQL solutions which satisfy me support transactions similar to gae's transactions (gae support transactions inside of entity groups).
What do you think about solving this problem? I am currently considering a store like Apache Cassandra + some locking service (hazelcast) for transactions. Did anyone has any experience in this area? What can you recommend
There are plans to support entity groups in cassandra in the future, see CASSANDRA-1684.
If your data can't be easily modelled without transactions, is it worth using a non transcational database? Do you need the scalability?
The standard way to do transaction like things in cassandra is described in this presentation, starting at slide 24. Basically you write something similar to a WAL log entry to 1 row, then perform the actual writes on multiple rows, then delete the WAL log row. On failure, simply read and perform actions in the WAL log. Since all cassandra writes have a user supplied time stamp, all writes can be made idempotent, just store the time stamp of your write with the WAL log entry.
This strategy gives you the Atomic and Durable in ACID, but you do not get Consistency and Isolation. If you are working at scale that requires something like cassandra, you probably need to give up full ACID transactions anyway.
You may want to try AppScale or TyphoonAE for hosting applications built for App Engine on your own hardware.
If you are developing under Python, you have very interesting debugging options with the Werkzeug debugger.

Good Strategy for Message Queuing?

I'm currently designing an application which I will ultimately want to move to Windows Azure. In the short term, however, it will be running on a server which I will host myself.
The application involves a number of separate web applications - some of these are essentially WCF services which receive data, and some are sites for users to manage data. In addition, there will need to be a worker service running in the background which will process data in various ways.
I'm very keen to use a decoupled architecture for this. Ideally I'm wanting the components (i.e. web apps and worker service) to know as little as possible about each other. It seems like using a message queue will be the best solution here - the web apps can enqueue messages with work units into the queue and the worker service can pick them out and process them as needed.
However, I want to work out a good set of technologies for doing this, bearing in mind that I'll ultimately be moving to Azure and want to minimise the amount of re-work I'll need to do when I migrate to the cloud. Azure has a Queue component built in which looks ideal for my needs. What I'd like to do is create something myself which will mimic this as closely as possible.
It looks like there are several options (I'm using .NET on Windows, with a SQL Server 2005 back end) - the ones I've found so far are:
MSMQ
SQL Server service broker
Rolling my own using a database table and some stored procs
I was wondering if anyone has any suggestions for this - or if anyone has done anything similar and has advice on things to do/to avoid. I realise that every situation is different, but in this case I think my queuing requirements are pretty generic so I'd love to hear anyone else's thoughts about the best way to do this.
Thanks in advance,
John
If you have Azure in mind, perhaps you should start straight on Azure as the APIs and semnatics are significantly different between Azure queues and any of MSMQ or SSB.
A quick 3048 meters comparison of MSMQ vs. SSB (I'll leave a custom table-as-queue out of comparison as it really depends how you implement it...)
Deployment: MSMQ is a Windows component, SSB is a SQL compoenent. SSB requires a SQL instance to store any message, so disconencted clients need access to an instance (can be Express). MSMQ requires deployment of MSMQ on the client (part of OS, but optional install).
Programmability: MSMQ offers a fully fledged, supported, WCF channel. SSB offers only an experimental WCF channel at http://ssbwcf.codeplex.com
Performance: SSB will be significantly faster than MSMQ in transacted mode. MSMQ will be faster if let operate in untransacted mode (best effort, unordered, delivery)
Queriability: SSB queues can be SELECTE-ed uppon (view any message, full SQL JOIN/WHERE/ORDER/GROUP power), MSMQ queues can be peeked (only next message)
Recoverability: SSB queues are integrated in the database so they are backed up and restored with the database, keeping a consitent state with the application state. MSMQ queues are backed up in the NT file backup subsytem, so to keep the backup in sync (coherent) the queue and database have to be suspended.
Transactions (since every enqueue/dequeue is always accompanied by a database update): SSB is fully integrated in SQL so dequeueing and enqueueing are local transaction operations. MSMQ is a separate TM (Transaction Manager) so queue/dequeue has to be a Distributed Transaction operation to enroll both SQL and MSMQ in the transaction.
Management and Monitoring: both equaly bad. No tools whatsoever.
Correlated Messages processing: SSB can block processing of correlated message by concurent threads via built-in Conversation Group Locking.
Event Driven: SSB has Activation to launch stored procedures, MSMQ uses Windows Activation service. Similar. SSB though has self load balancing capalities due to the way WAITFOR(RECEIVE) and MAX_QUEUE_READERS interact.
Availability: SSB piggybacks on the SQL Server High Availability story, it can work either in a clustered or in database miroring environment. MSMQ rides the Windows clustering story only. Database Mirroring is much cheaper than clustering as a HA solution.
In addition I'd add that SSB and MSMQ differ significantly at the level ofthe primitive they offer: SSB primitive is a conversation, while MSMQ primitive is a message. Think TCP vs. UDP semantics.
Pick a queue back end that works for you, or that is better suited to your environment. #Remus has given a great comparison between MSMQ and SSB. MSMQ is going to be the easier one to implement, but has some notable limitations, while SSB is going to feel very heavy as its at the other end of the spectrum.
Have It Your Way
To minimize the rework from you applications, abstract the queues access behind an interface, and then provide an implementation for the queue transport you ultimately decide to go with. When its time to move to Azure, or another queue transport, you just provide a new implementation of your interface.
You get to control the semantics of how you want to interact with the queue to give a consistent usable API from your applications.
A rough idea might be:
interface IQueuedTransport
{
void SendMessage(XmlDocument);
XmlDocument ReceiveMessage();
}
public class MSMQTransport : IQueuedTransport {}
public class AzureQueueTransport : IQueuedTransport {}
You may not be building the be-all queuing transport, just what meets your needs. If you work with Xml, pass xml. If you work in byte arrays, pass byte arrays. :)
Good luck!
Z
Use Win32 Mailslots. They will be reliable on a single server, are easy to implement, and do not require any extra software.

Queues against Tables in messaging systems [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've been experiencing the good and the bad sides of messaging systems in real production environments, and I must admit that a well organized table or schema of tables simply beats every time any other form of messaging queue, because:
Data are permanently stored on a table. I've seen so many java (jms) applications that lose or vanish messages on their way for uncaught exceptions or other bugs.
Queues tend to fill up. Db storage is virtually infinite, instead.
Tables are easily accessible, while you have to use esotic instruments to read from a queue.
What's your opinion on each approach?
The phrase beats every time totally depends on what your requirements were to begin with. Certainly its not going to beat every time for everyone.
If you are building a single system which is already using a database, you don't have very high performance throughput requirements and you don't have to communicate with any other teams or systems then you're probably right.
For simple, low thoughput, mostly single threaded stuff, database are a totally fine alternative to message queues.
Where a message queue shines is when
you want a high performance, highly concurrent and scalable load balancer so you can process tens of thousands of messages per second concurrently across many servers/processes (using a database table you'd be lucky to process a few hundred a second and processing with multiple threads is pretty hard as one process will tend to lock the message queue table)
you need to communicate between different systems using different databases (so don't have to hand out write access to your systems database to other folks in different teams etc)
For simple systems with a single database, team and fairly modest performance requirements - sure use a database. Use the right tool for the job etc.
However where message queues shine is in large organisations where there are lots of systems that need to communicate with each other (and so you don't want a business database to be a central point of failure or place of version hell) or when you have high performance requirements.
In terms of performance a message queue will always beat a database table - as message queues are specifically designed for the job and don't rely on pessimistic table locks (which are required for a database implementation of a queue - to do the load balancing) and good message queues will perform eager loading of messages to queues to avoid the network overhead of a database.
Similarly - you'd never use a database to do load balancing of HTTP requests across your web servers - as it'd be too slow - if you have high performance requirements for your load balancer you'd not use a database either.
I've used tables first, then refactor to a full-fledged msg queue when (and if) there's reason - which is trivial if your design is reasonable.
The biggest benefits are a.) it's easier, (b. it's a better audit trail because you have the other tables to join to, c.) if you know the database tools really well, they are easier to use than the Message Queue tools, d.) it's generally a bit easier to set up a test/dev environment in a context that already exists for your app (if same familiarity applies).
Oh, and e.) for perhaps you and others, it's not another product to learn, install, configure, administer, and support.
IMPE, it's just as reliable, disconnectable, and you can convert if it needs more scalable.
Data are permanently stored on a table. I've seen so many java (jms) applications that loose or vanish messages on their way for uncaught exceptions or other bugs.
Which JMS implementation? Sun sells reliable queue which can't lose messages. Perhaps you just purchased a cheesy JMS-compliant product. IBM's MQ is extremely reliable, and there are JMS libraries to access it.
Queues tend to fill up. Db storage is virtually infinite, instead.
Ummm... If your queue fills up, it sounds like something is broken. If your apps crash, that's not a good thing, and queues have little to do with that. If you've purchased a really poor JMS implementation, I can see where you might be unhappy with it. It's a competitive market-place. Find a better queue manager. Sun's JCAPS has a really good queue manager, formerly the SeeBeyond message queue.
Tables are easily accessible, while you have to use esotic instruments to read from a queue.
That doesn't fit with my experience. Tables are accessed through this peculiar "other language" (SQL), and requires that I be aware of structure mappings from tables to objects and data type mappings from VARCHAR2 to String. Further, I have to use some kind of access layer (JDBC or an ORM which uses JDBC). That seems very, very complex. A queue is accessed through MessageConsumers and MessageProducers using simple sends and receives.
It sounds as though the problems you've experienced are not inherent to messaging, but rather are artifacts of poorly-implemented messaging systems. Is building messaging systems harder than building database systems? Yes, if all you ever do is build database systems.
Losing messages to uncaught exceptions? That's hardly the fault of the message queue. The applications you're using are poorly engineered. They're removing messages from the queue before processing completes. They're not using transactions, or journalling.
Message queues fill up while DB storage is "virtually infinite"? You talk as though managing disk space were something that databases didn't require. Message queue servers require administration, just like database servers do.
Esoteric instruments to read from a queue? Maybe if you find asynchronous methods esoteric. Maybe if you find serialization and deserialization esoteric. (At least, those are the things I found esoteric when I was learning messaging. Like many seemingly-esoteric technologies, they're actually quite mundane once you understand them, and understanding them is an important part of the seasoned developer's education.)
Aspects of messaging that make it superior to databases:
Asynchronous processing. Message queues notify waiting processes when new messages arrive. To accomplish this functionality in a database, the waiting processes have to poll the database.
Separation of concerns. The communications channel is decoupled from the implementation details of the message content. Only the sender and the receiver need to know anything about the format of the data stream within a given message.
Fault-tolerance.. Messaging can function when connections between servers are intermittent. Message queues can store messages locally and only forward them to remote servers when the connection is live.
Systems integration. In the Windows world, at least, messaging is built into the operating system. It uses the OS's security model, it's managed through the OS's tools, etc.
If you don't need these things, you probably don't need messaging.
Here's a simple example of an application for messaging: I'm building a system right now where users, distributed across multiple networks, are entering fairly intricate sets of transactions that are used to produce printed output. Output generation is computationally expensive and not part of their workflow; i.e. the users don't care when the output gets generated, just that it does.
So we serialize the transactions into a message and drop it in a queue. A process running on a server grabs messages from the queue, produces the output, and stores the output in an imaging system.
If we used a database as our message store, we'd have to come up with a schema to store a transaction format that right now only the sender and receiver care about, we'd need to make sure every workstation on the network had permanent persistent connections to the database server, we'd have no capacity to distribute this transaction load across multiple servers, and our output server would have to query the database thousands of times a day waiting to see if there were new jobs to process.
Queues provide reliable messaging. The store-and-forward, disconnected nature of queueing make it much more scalable than databases, not to mention more robust.
And queues shouldn't really be used for permanent storage of information - it is best to think of them as temporary inboxes, unlike databases.

Resources