Are there any ways using the data already stored by Service Broker to form statistics such as Average Message Lifetime or Average Message Processing Time for a specific queue? I'm not finding any date/time information on any of the Service Broker tables that I know of. Conversations/Dialogs can have an expiration lifetime so there must be some of this information somewhere. The most helpful information would be if there is a message add/created and errored/completed timestamps available without insertions into custom tables.
Recent variants(post 2012?) of sys.transmission_queue expose enqueue_time for debugging. But there is no end-to-end timing info (time created, time in target, time to process etc). Anything you build would have to be based on adding metadata to the message itself, in the application payload, and tracking it in your own tables.
Related
Please help me to understand the functionality of Google cloud Pubsub subscription/num_undelivered_messages metric with pull subscription.
From docs: subscription/num_undelivered_messages is
Number of unacknowledged messages (a.k.a. backlog messages) in a
subscription. Sampled every 60 seconds. After sampling, data is not
visible for up to 120 seconds.
And for Pull delivery from docs
In pull delivery, your subscriber application initiates requests to
the Cloud Pub/Sub server to retrieve messages. The subscribing
application explicitly calls the pull method, which requests messages
for delivery.
Now I setup a pull subscription against a Google public topic named projects/pubsub-public-data/topics/taxirides-realtime which is suppose to continuously provide stream of taxi rides data.
Now my requirement is to calculate number of taxi rides in past 1 hour. The usual approach came in my mind is to pull all messages from topic and perform aggregation over it.
However while searching I found these 2 links link1 and link2 which I feel like can solve the problem but below question 1 is lingering as doubt for this solution and confuses me!
So overall my question is
1. How does a pub subscription finds value of num_undelivered_messages from a topic, even when subscription didn't made any pull call? Actually I can see this metric in stackdriver monitoring by filtering on subscription id.
What is the right way to calculate aggregate of number of messages present in a topic in a certain duration?
The number of undelivered messages is established based on when the subscription is created. Any messages published after that are messages that should be delivered to the subscription. Therefore, any of these messages not pulled and acked by the subscription will count toward num_undelivered_messages.
For solving your particular problem, it would be better to read the feed and aggregate the data. The stats like num_undelivered_messages are useful for examining the health of subscribers, e.g., if the count is building up, it could indicate that something is wrong with the subscribers or that the data published has changed in some way. You could look at the difference in the number between the end of your desired time interval and the beginning to get an estimate of the number of messages published in that time frame, assuming you aren't also consuming and acking any messages.
However, it is important to keep in mind that the time at which messages are published in this feed may not exactly correspond to the time at which a taxi ride occurred. Imagine there was an issue with the publisher and it was unable to publish the messages for a period of time and then once fixed, published all of the messages that had built up during that time. In this scenario, the timestamp in the messages themselves indicating when the taxi ride occurred would not match the time at which the message was received by Cloud Pub/Sub.
We currently have a payment tracking system which uses MS SQL Server Enterprise. When a client requests a service, he would have to do the payment within 24 hours, otherwise we would send him an SMS Reminder. Our current implementation simply records the date and time of the purchase, and keep on polling constantly the records in order to find "expired" purchases.
This is generating so much load on the database that we have to implement some form of replication in order to offload these operations to another server.
I was thinking: is there a way to combine CLR triggers with some kind of a scheduler that would be triggered only once, that is, 24 hours after the purchase is created?
Please keep in mind that we have tens of thousands of transactions per hour.
I am not sure how you are thinking that SQLCLR will solve this problem. I don't think this needs to be handled in the DB at all.
Since the request time doesn't change, why not load all requests into a memory-based store that you can hit constantly. You would load the 24-hour-from-request time so that you only need to compare those times to Now. If the customer pays prior to the 24-hour period then you remove the entry from the cache. Else, the polling process will eventually find it, process it, and remove it from the cache.
OR, similarly, you can use a scheduler and load a future event to be the SMS message, based on the 24-hour-from-request time, upon each request. Similar to scheduling an action using "AT". Again, if someone pays prior to that time, just remove the scheduled task/event/reminder.
You would store just the 24-hour-after-time and the RequestID. If the time is reached, the service would refer back to the DB using that RequestID to get the current info.
You just need to make sure to de-list items from the cache / scheduler if payment is made prior to the 24-hour-after time.
And if the system crashes / restarts, you just load all entries that are a) unpaid, and b) have not yet reached their 24-hour-after time.
I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.
I've developed a python app that registers information from incoming emails and saves this information to the GAE Datastore. Registering the emails works just fine. As part of the registration, emails with the same subject and recipients get a conversation ID. However, sometimes emails enter the system so fast after each other, that emails from the same conversation don't get the same ID. This happens because two emails from the same conversation are being processed at the same time and GAE doesn't see the other entry yet when running a query for this conversation.
I've been thinking of a way to prevent this, and think it would be best if the system processes only one email per user at a time (each sender has his own account). This could be done by having a push task queue that first checks if there is currently an email being processed for this user, and if so, put the new task in a pull queue from which it can be retrieved as soon as the previous task has been finished.
The big disadvantage of this, is that (I think) I can't run the push queue asynchronous, which obviously is a big performance disadvantage. Any ideas on what would be a better way to setup such a process?
Apparently this was a typical race-condition. I've made use of the Transactions functionality to prevent multiple processes writing at the same time. Documentation can be found here: https://cloud.google.com/appengine/docs/python/datastore/transactions
I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);