Split caldav PROPFIND response when it is too large - mobile

I have home-grown CalDAV implementation which normally works fine, but with one problem.
There are clients with hundreds calendars which are synchronized over mobile network.
Each time iCalendar ask PROPFIND with depth=1 my server must answer with full list of calendars giving huge response which fails sometimes because of unstable mobile network.
I guess splitting response in smaller chunks (like 30 per response) would help but i don't know if it really possible.
So question is - Can i force client to PROPFIND calendars in consecutive requests by chunks of N calendars?

No, there's no agreed-upon standard for this.
That being said: (1) are you compression the response? (2) have you looked at https://datatracker.ietf.org/doc/html/draft-murchison-webdav-prefer-05?

When you say "100's of calendars" do you mean "100's of events"? Because a PROPFIND response with 100's of items actually isnt big, thats perfectly normal. But quite often the list of events in a calendar can be large. But Apple usually do quite good caldav clients, and they should be doing REPORT requests with appropriate date range so they dont get too many events.
Its possible that your Caldav server doesnt implement the full range of reports so the client is falling back to a simpler approach.

As mentioned by brad, you need to differentiate between numbers of calendars and numbers of events within a calendar.
Having hundreds of calendars is very unusual, but just that should be kinda OK. A PROPFIND with 100 results isn't too big. Also note the CTag, if that's available you know whether you need to sync the calendar contents in the first place.
Most likely you are actually asking about some calendars containing a lot of events, could be thousands. In such cases the PROPFIND:1 on the calendar to grab the ETags to check for changes can become large&slow. (in any case make sure you do support Accept-Encoding: gzip and Brief:T).
For this case there is an RFC'd solution: RFC 6578. With sync-reports you only need to return the records which changed since the last sync. It's supported by iOS and iCal.
The spec also supports batching (called truncation in the RFC), but this isn't implemented in all clients.

Related

Gmail API all messages

I need to get all messages in Inbox with gmail api. But I see only one way to do it.
Get list of messages(id, threadID)
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages?labelIds=INBOX&key={YOUR_API_KEY}
With id`s get all messages in loop
While
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages/147199d21bbaf5a5?key={YOUR_API_KEY}
End of While
But for this way needed 100500 request.
Have anybody idea how to get with one request all messages(or just payload field)?
Use batch and request 100 messages at a time. You will need to make 1000 requests but the good news is that's quite fine and it'll be easier for everyone (no downloading 1GB response in a single request!).
Documented at:
https://developers.google.com/gmail/api/guides/batch
There's a few other people that have asked about batching Gmail Api here on Stack Overflow, so just do a quick search to find answers and examples.
The approach you are doing is correct, as there is no 'GetAll' API to download them
Reasons include:
Unbounded Result Sets
Pulling out an unlimited amount of emails (aka unbounded result set) is a resource hog on Google servers. Did you want the attachments AND images? These could be gigabytes of data.
Network Problems
Google has to read gigabytes form disk, store it in memory and send them over the internet. Google's server would handle it, but the bandwidth of internet connectivity would not work. Worst of all, if you issue this request again and again, you could perform a DDoS attack on Google.
Security Risk
If someone gains an API key of another user, they could download their entire mailbox. Hence Google provide paging to ensure that they can provide a securer service and reduce resource contention.
Therefore, it is there to protect you and other users, and themselves.

What is a good way to send large data sets to a client through API requests?

A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.
First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/
This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.
Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.
If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)

Strategy for caching of remote service; what should I be considering?

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.
time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated
A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).
Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?
Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.
Anyway, to throw out some thoughts on standards for crawling:
Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
A statistical model of the chance of an image being updated and needing to be recached.
An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
2,500 requests processing new images (which you mentioned are more important)
2,500 requests processing images of current events
2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
2,500 requests processing images that have been viewed at least
Total: 15,000 requests per hour.
How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.
andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.
At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Best practices to limit the number of calls to Mirror API

I, like everyone else I imagine, have a courtesy limit of 1000 Mirror API calls per day.
I see there's a batching facility that looks promising, but it appears to be able to batch only requests for a single credential. So even one customer, pushing to the API every 60 seconds will be 1440 requests/day. Ideally, 30 seconds is where I'd like to be. 2880 requests/day would be multiplied by the number of customers. It will get really big really fast.
I might be missing something, but I don't see a way around that.
If it were available I could glom all updates across all clients in the 30 second period into one giant message...
Is there a better design pattern to keep cards up-to-date with telemetry that's changing in real-time?
You can send requests to multiple users with a single batch request: instead of setting the Authorization header in the batch request, simply set the Authorization header in each sub-request.
Our Python and Java Quick Start projects have an example of using batch request to send an update to up to 10 users. This is also mentioned in the Building Glass Services with the Google Mirror API I/O session.
Otherwise, you can check the protocol documentation in our reference guide.
As Scarygami mentioned, each sub-request will consume quota so the only optimization is to save on bandwidth and HTTP requests, especially if using gzip encoding.

working with new channel creation limits

Google app engine seems to have recently made a huge decrease in free quotas for channel creation from 8640 to 100 per day. I would appreciate some suggestions for optimizing channel creation, for a hobby project where I am unwilling to use the paid plans.
It is specifically mentioned in the docs that there can be only one client per channel ID. It would help if there were a way around this, even if it were only for multiple clients on one computer (such as multiple tabs)
It occurred to me I might be able to simulate channel functionality by repeatedly sending XHR requests to the server to check for new messages, therefore bypassing limits. However, I fear this method might be too slow. Are there any existing libraries that work on this principle?
One Client per Channel
There's not an easy way around the one client per channel ID limitation, unfortunately. We actually allow two, but this is to handle the case where a user refreshes his page, not for actual fan-out.
That said, you could certainly implement your own workaround for this. One trick I've seen is to use cookies to communicate between browser tabs. Then you can elect one tab the "owner" of the channel and fan out data via cookies. See this question for info on how to implement the inter-tab communication: Javascript communication between browser tabs/windows
Polling vs. Channel
You could poll instead of using the Channel API if you're willing to accept some performance trade-offs. Channel API deliver speed is on the order of 100-200ms; if you could accept 500ms average then you could poll every second. Depending on the type of data you're sending, and how much you can fit in memcache, this might be a workable solution. My guess is your biggest problem is going to be instance-hours.
For example, if you have, say, 100 clients you'll be looking at 100qps. You should experiment and see if you can serve 100 requests in a second for the data you need to serve without spinning up a second instance. If not, keep increasing your latency (ie., decreasing your polling frequency) until you get to 1 instance able to serve your requests.
Hope that helps.

Resources