I am using Camel and I have a business problem. We consume order messages from an activemq queue. The first thing we do is check in our DB to see if the customer exists. If the customer doesn't exist then a support team needs to populate the customer in a different system. Sometimes this can take a 10 hours or even the following day.
My question is how to handle this. It seems to me at a high level I can dequeue these messages, store them in our DB and re-run them at intervals (a custom coded solution) or I could note the error in our DB and then return them back to the activemq queue with a long redelivery policy and expiration, say redeliver every 2 hours for 48 hours.
This would save a lot of code but my question is if approach 2 is a sound approach or could lead to resource issues or problems with not knowing where messages are?
This is a pretty common scenario. If you want insight into how the jobs are progressing, then it's best to use a database for this.
Your queue consumption should be really simple: consume the message, check if the customer exists; if so process, otherwise write a record in a TODO table.
Set up a separate route to run on a timer - every X minutes. It should pull out the TODO records, and for each record check if the customer exists; if so process, otherwise update the record with the current timestamp (the last time the record was retried).
This allows you to have a clear view of the state of the system, that you can then integrate into a console to see what the state of the outstanding jobs is.
There are a couple of downsides with your Option 2:
you're relying on the ActiveMQ scheduler, which uses a KahaDB variant sitting alongside your regular store, and may not be compatible with your H/A setup (you need a shared file system)
you can't see the messages themselves without scanning through the queue, which is an antipattern - using a queue as a database - you may as well use a database, especially if you can anticipate needing to ever selectively remove a particular message.
Related
We currently have a payment tracking system which uses MS SQL Server Enterprise. When a client requests a service, he would have to do the payment within 24 hours, otherwise we would send him an SMS Reminder. Our current implementation simply records the date and time of the purchase, and keep on polling constantly the records in order to find "expired" purchases.
This is generating so much load on the database that we have to implement some form of replication in order to offload these operations to another server.
I was thinking: is there a way to combine CLR triggers with some kind of a scheduler that would be triggered only once, that is, 24 hours after the purchase is created?
Please keep in mind that we have tens of thousands of transactions per hour.
I am not sure how you are thinking that SQLCLR will solve this problem. I don't think this needs to be handled in the DB at all.
Since the request time doesn't change, why not load all requests into a memory-based store that you can hit constantly. You would load the 24-hour-from-request time so that you only need to compare those times to Now. If the customer pays prior to the 24-hour period then you remove the entry from the cache. Else, the polling process will eventually find it, process it, and remove it from the cache.
OR, similarly, you can use a scheduler and load a future event to be the SMS message, based on the 24-hour-from-request time, upon each request. Similar to scheduling an action using "AT". Again, if someone pays prior to that time, just remove the scheduled task/event/reminder.
You would store just the 24-hour-after-time and the RequestID. If the time is reached, the service would refer back to the DB using that RequestID to get the current info.
You just need to make sure to de-list items from the cache / scheduler if payment is made prior to the 24-hour-after time.
And if the system crashes / restarts, you just load all entries that are a) unpaid, and b) have not yet reached their 24-hour-after time.
I'm starting to build a bulk upload tool and I'm trying to work out how to accomplish one of the requirements.
The idea is that a user will upload a CSV file and the tool will parse it and send each row of the CSV to the task queue as a task to be run. Then once all the tasks (relating to that specific CSV file) are completed, a summary report will be sent to the user.
I'm using Google App Engine and in the past I've used the standard Task Queue to handle tasks. However, with the standard Task Queue there is no way of knowing when the queue has finished, no event is fired to trigger the report generation so I'm not sure how to achieve this?
I've looked into it more and I understand that Google also offers Google PubSub. This is more sophisticated and seems more suited, but I still can't find out how to trigger and event when a PubSub queue is finished, any ideas?
Seems that you could use a counter for this. Create an entity with an Integer property that is set to the number of lines of the CSV file. Each task will decrement the counter in a transaction when it finishes processing the row (in a transaction). One task will set the counter to 0, and that task could trigger the event. This might cause too much contention though.
Another possibility could be to have each task create an entity of a specific kind when it finishes processing a row. You can then count the number of these entities to determine when all the rows have been processed.
It might be easier to use the The GAE Pipeline API, which would take care of this as a basic portion of its functionality.
There's a nice article explaining it a bit here.
And a related SO question which happens to mention the same reason for moving to this API and has an excellent answer: Google AppEngine Pipelines API
I didn't use it myself yet, but it's just a matter of time :)
It's also possible to implement a scheme to track the related tasks still being active, see Figure out group of tasks completion time using TaskQueue and Datastore.
You can also check the queue (approximate) status, see Get number of tasks in a named queue?
I faced a similar problem earlier this week and managed to find a nice workaround for it. What i did was i created an extra column in the table where a task inserts data into. And once a specific task is completed, it updates this 'task_status' column with 'done', otherwise it's left as the default null. Then when the user refreshes the page or goes to a specific URL or you do an AJAX call to query the task status for a specific id in your table, you can see if it is complete or not.
select * from table where task_status is not null and id = ?;
You can also create a 'tasks' table where you can store relevant columns there instead of modifying existing tables.
Hope this finds you some use.
I have two systems, call them A and B.
When some significant object changes in A, A sends it to B through Apache Camel.
However, I encountered one case, when A actually has change log of an object, while B must reflect only actual state of the object.
Moreover, change log in A can contain "future" records.
It means, that object's state change is scheduled on some moment in the future.
A user of the system A can edit this change log, remove change records, add new change records with any timestamp (in the past and in the future) and even update existing changes.
Of course, A sends these change records to B, but B needs only actual state of the object.
Note, that I can query objects from A, but A is a performance-critical system, and I therefore I am not going to query something, as it can cause additional load.
Also, API for querying data from A is overcomplicated and I would like to avoid it whenever possible.
I can see two problems here.
First is realizing whether the particular change in change log record may cause changing of the actual state.
I am going to store change log in an intermediate database.
As a change log record comes, I am going to add/remove/update it in the intermediate database and then calculate actual state of the object and send this state to B.
Second is tracking change schedule.
I couldn't invent anything except for running a periodical job in a constant interval (say, 15 minutes).
This job would scan all records that fall in the time interval from the last invocation until the current invocation.
What I like Apache Camel for is its component-based approach, when you only need to connect endpoints and get everyting work, with only a little amount of coding.
Is there any pre-existing primitives for this problem both in Apache Camel and in EIP?
I am actually working on a very similar use-case, where system A sends Snapshot and updates which require translation before sending to system B.
First, you need to trigger the mechanism for giving you the initial state (the "snapshot") from system A, the timer: component can start the one-time startup logic.
Now, you will receive the snapshot data (you didn't specify how, perhaps it is an ftp file or a jms endpoint). Validate the data, split it into items, and store each item of data in a local in-memory cache:, as Sergey suggests in his comment, keyed uniquely. Use an expire policy that is logical (e.g. 48 hours).
From there, continually process the "update" data from the ftp: endpoint. For each update, you will need to match the update with the data in the cache: and determine what (and when) needs to be sent to system B.
Data that needs to be sent to system B later will need be to be persisted in memory or in a database.
Finally, you need a scheduling mechanism to determine every 15 minutes if new data should be sent, you can easily use timer: or quartz: for this.
In summary, you can build this integration from following components: timer, cache, ftp, quartz plus some custom beans / processors to perform the custom logic.
The main challenges are handling data that is cached and then updated and working out the control mechanisms for what should happen on the initial connect, a disconnect, or if your camel application is restarted.
Good luck ;)
I'm creating an app on GAE with Java, and looking for advice on how to handle scheduling user notifications (which will be email, text, push, whatever). There are a couple ways notifications will generated: when a producer creates content, and on a consumer's schedule. The later is the tricky part, because a consumer can change its schedule at any time. Here are the options I have considered and my concerns so far:
Keep an entry in the datastore for each consumer, indexed by the time until the next notification. My concern is over the lag for an eventually-consistent index. The longest lag I've seen reported is about 4 hours, which would be unacceptable for this use-case. A user should not delay their schedule by a week, then 4 hours later receive a notification from the old schedule.
The same as above, but with each entry sharing a common parent so that I can use an ancestor query to eliminate its eventual-ness. My concern is that there could be enough consumers to cause a problem with contention. In my wildest dreams I could foresee something like 10,000 schedule changes per minute at peak usage.
Schedule a task for each consumer. When changing the schedule, it could delete the old task and create a new one at the new time. My concern has to do with the interaction of tasks and datastore transactions, since the schedule will be stored in the datastore. The documentation notes that enqueing a task plays nicely with transactions, but what about deleting one? I would not want a task to be deleted only to have the add fail as part of its transaction.
Edit: I experimented with deleting tasks (for option 3), and unfortunately a delete that is part of a failed transaction still succeeds. That is a disappointing asymmetry. I will probably end up going that route anyway, but adding some extra logic and datastore flags to ensure rogue tasks that didn't get deleted properly simply do nothing when they execute.
Eventual consistency in the Datastore typically measures in seconds. As Google states:
the time delay is typically small, but may be longer (even minutes or
more in exceptional circumstances).
Save a time of next notification for each user. Run a cron job periodically (e.g. once per hour), and send notifications to all users who have to be notified at this time (i.e. now >= next notification).
Create a task for each user when a user's schedule is created with the countdown value. When a task executes, it creates the next task for this user.
The first approach is probably more efficient, especially if you choose a large enough window for your cron job.
As for transactions, I don't see why you need them. You can design your system that in the very rare fail situation a user will receive two notifications instead of one (old schedule and new schedule). This is not such a bad thing that you need to design around it.
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);