Please help me to understand the functionality of Google cloud Pubsub subscription/num_undelivered_messages metric with pull subscription.
From docs: subscription/num_undelivered_messages is
Number of unacknowledged messages (a.k.a. backlog messages) in a
subscription. Sampled every 60 seconds. After sampling, data is not
visible for up to 120 seconds.
And for Pull delivery from docs
In pull delivery, your subscriber application initiates requests to
the Cloud Pub/Sub server to retrieve messages. The subscribing
application explicitly calls the pull method, which requests messages
for delivery.
Now I setup a pull subscription against a Google public topic named projects/pubsub-public-data/topics/taxirides-realtime which is suppose to continuously provide stream of taxi rides data.
Now my requirement is to calculate number of taxi rides in past 1 hour. The usual approach came in my mind is to pull all messages from topic and perform aggregation over it.
However while searching I found these 2 links link1 and link2 which I feel like can solve the problem but below question 1 is lingering as doubt for this solution and confuses me!
So overall my question is
1. How does a pub subscription finds value of num_undelivered_messages from a topic, even when subscription didn't made any pull call? Actually I can see this metric in stackdriver monitoring by filtering on subscription id.
What is the right way to calculate aggregate of number of messages present in a topic in a certain duration?
The number of undelivered messages is established based on when the subscription is created. Any messages published after that are messages that should be delivered to the subscription. Therefore, any of these messages not pulled and acked by the subscription will count toward num_undelivered_messages.
For solving your particular problem, it would be better to read the feed and aggregate the data. The stats like num_undelivered_messages are useful for examining the health of subscribers, e.g., if the count is building up, it could indicate that something is wrong with the subscribers or that the data published has changed in some way. You could look at the difference in the number between the end of your desired time interval and the beginning to get an estimate of the number of messages published in that time frame, assuming you aren't also consuming and acking any messages.
However, it is important to keep in mind that the time at which messages are published in this feed may not exactly correspond to the time at which a taxi ride occurred. Imagine there was an issue with the publisher and it was unable to publish the messages for a period of time and then once fixed, published all of the messages that had built up during that time. In this scenario, the timestamp in the messages themselves indicating when the taxi ride occurred would not match the time at which the message was received by Cloud Pub/Sub.
Related
I'm working on an application where I will be getting 40 million records in a day so will the PubSub can handle it?. I have also seen that in some cases PubSub sends duplicate messages how can we avoid this?
40 million records in a day (~460/s) is definition feasible for Pub/Sub, yes. The service is designed to scale horizontally with your load to tens of GB per second. Pub/Sub is an at-least-once delivery service by default, which means that duplicates are possible. There is an exactly once feature currently in public preview, which allows one to get stronger guarantees including:
Only one delivery of a message can be outstanding at a time.
A successful response to the Ack call means that the message is guaranteed not to be redelivered.
This does mean that if you don't ack a message before the deadline, the message will get redelivered, so it doesn't mean you avoid duplicates entirely. If you need exactly once processing, then Dataflow can be a good choice.
I am working on an application that requires checking the user's inbox for new messages every 5mins.
The current approach we've taken is to utilise the list.histories endpoint based on push notifications. We're running into the edge case where we get stale history IDs for accounts that go cold for a few weeks. I understand the remedy in this situation is to do a full sync using the list messages.
I was wondering if it was possible to just use list.messages to poll the list of messages every 5-10 mins using the list.messages endpoint with q filters to restrict the timeframe. The implementation would be something involving querying with overlapping timeframes of 1 min; The idea is that having that overlap would allow us figure out where we've left off and then stitch the sequence correctly. We will not be using pub/sub or list.histories any more.
The concerns i have are:
This approach isn't listed in guide.
Is it possible for a message with a history_id that is greater than the message that precedes it to have an internal date that is older?
Does anyone else have experience with this?
Are there any ways using the data already stored by Service Broker to form statistics such as Average Message Lifetime or Average Message Processing Time for a specific queue? I'm not finding any date/time information on any of the Service Broker tables that I know of. Conversations/Dialogs can have an expiration lifetime so there must be some of this information somewhere. The most helpful information would be if there is a message add/created and errored/completed timestamps available without insertions into custom tables.
Recent variants(post 2012?) of sys.transmission_queue expose enqueue_time for debugging. But there is no end-to-end timing info (time created, time in target, time to process etc). Anything you build would have to be based on adding metadata to the message itself, in the application payload, and tracking it in your own tables.
I've developed a python app that registers information from incoming emails and saves this information to the GAE Datastore. Registering the emails works just fine. As part of the registration, emails with the same subject and recipients get a conversation ID. However, sometimes emails enter the system so fast after each other, that emails from the same conversation don't get the same ID. This happens because two emails from the same conversation are being processed at the same time and GAE doesn't see the other entry yet when running a query for this conversation.
I've been thinking of a way to prevent this, and think it would be best if the system processes only one email per user at a time (each sender has his own account). This could be done by having a push task queue that first checks if there is currently an email being processed for this user, and if so, put the new task in a pull queue from which it can be retrieved as soon as the previous task has been finished.
The big disadvantage of this, is that (I think) I can't run the push queue asynchronous, which obviously is a big performance disadvantage. Any ideas on what would be a better way to setup such a process?
Apparently this was a typical race-condition. I've made use of the Transactions functionality to prevent multiple processes writing at the same time. Documentation can be found here: https://cloud.google.com/appengine/docs/python/datastore/transactions
I'm building a multiuser realtime application with Google App Engine (Python) that would look like the Facebook livestream plugin: https://developers.facebook.com/docs/reference/plugins/live-stream/
Which means: 1 to 1 000 000 users on the same webpage can perform actions that are instantly notified to everyone else. It's like a group chat but with a lot of people...
My questions:
- Is App Engine able to scale to that kind of number?
- If yes, how would you design it?
- If no, what would be your suggestions?
Right now, this is my design:
- I'm using the App Engine Channel API
- I store every user connected in the memcache
- Everytime an action is performed, a notification task is added to a taskqueue
- The task consist in retrieving all users from memcache and send them a notification.
I know my bottleneck is in the task. Everybody is notified through the same task/ request. Right now, for 30 users connected, it lasts about 1 sec so for 100 000 users, you can imagine how long it could take.
How would you correct this?
Thanks a lot
How many updates per user do you expect per second? If each user updates just once every hour, you'll be sending 10^12 messages per hour -- every sent message results in 1,000,000 more sends. This is 277 million messages per second. Put another way, if every user sends a message an hour, that works out to 277 incoming messages per second, or 277 million outgoing messages.
So I think your basic design is flawed. But the underlying question: "how do I broadcast the same message to lots of users" is still valid, and I'll address it.
As you have discovered, the Channel API isn't great at broadcast because each call takes about 50ms. You could work around this with multiple tasks executing in parallel.
For cases like this -- lots of clients who need the exact same stateless data, I would encourage you to use polling, rather than the Channel API, since every client is going to receive the exact same information -- no need to send individualized messages to each client. Decide on an acceptable average latency (eg. 1 second) and poll at twice that rate (eg. 2 seconds). Write a very lightweight, memcache-backed servlet to just get the most recent block of data and let the clients de-dupe.