Strategy for caching of remote service; what should I be considering? - database

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.
time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated
A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).

Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?
Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.
Anyway, to throw out some thoughts on standards for crawling:
Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
A statistical model of the chance of an image being updated and needing to be recached.
An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
2,500 requests processing new images (which you mentioned are more important)
2,500 requests processing images of current events
2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
2,500 requests processing images that have been viewed at least
Total: 15,000 requests per hour.

How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.
andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.
At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Related

Notion API Pagination Random Database Entry

I'm trying to retrieve a random entry from a database using the Notion API. There is a page limit on how many entries you can retrieve at once, so pagination is utilized to sift through the pages 100 entries at a time. Since there is no database attribute telling you how long the database is, you have to go through the pages in order until reaching the end in order to choose a random entry. This is fine for small databases, but I have a cron job going that regularly chooses a random entry from a notion database with thousands of entries. Additionally, if I make too many calls simultaneously I risk being rate limited pretty often. Is there a better way to go about choosing a random value from a database that uses pagination? Thanks!
I don't think there is a better way to do it right now (sadly). If your entries don't change often, think about caching the pages. Saves you a lot of execution time in your cron job. For the rate limit, if you use Node.js, you can build a rate-limited queue (3 requests/second) pretty easily with something like bull

How can I address the 10GB limit on Google App Engine?

We are trying to index inboxes by sitting on top of the GMail, and are using the App Engine search API, but we are hitting up the 10 GB limit. This is because we are indexing the whole organization's emails so we can search across the whole team's inbox. How can we work around this? One way might be to have an individual index per person and somehow combine the results manually, but worried that merging results might be really complex! Wondering what options are available?
This is a typical problem in any document retrieval system, and the solution is to slice the entire corpus into multiple buckets. You should choose a slicing strategy based on your requirements/usage pattern.
One possibility is to slice messages by their date. You keep adding messages to an index until you come close to the limit, at which point you start a new index for newer messages. Or you can do it by calendar intervals (per year, per quarter or per month, depending on your volume).
Merging results from several indexes is simple. You can also give users a chance to choose how far back in time they want to go in their search. Often people know that they are looking for something recent or something that happened a long time ago.
File a feature request:
https://code.google.com/p/googleappengine/wiki/FilingIssues?tm=3
There was this filed too so maybe star it: https://code.google.com/p/googleappengine/issues/detail?id=10667

How can I combine similar tasks to reduce total workload?

I use App Engine, but the following problem could very well occur in any server application:
My application uses memcache to cache both large (~50 KB) and small (~0.5 KB) JSON documents which aggregate information which is expensive to refresh from the datastore. These JSON documents can change often, but the changes are sparse in the document (i.e., one item out of hundreds may change at a time). Currently, the application invalidates an entire document if something changes, and then will lazily re-create it later when it needs it. However, I want to move to a more efficient design which updates whatever particular value changed in the JSON document directly from the cache.
One particular concern is contention from multiple tasks / request handlers updating the same document, but I have ways to detect this issue and mitigate it. However, my main concern is that it's possible that there could be rapid changes to a set of documents within a small period of time coming from different request handlers, and I don't want to have to edit the JSON document in the cache separately for each one. For example, it's possible that 10 small changes affecting the same set of 20 documents of 50 KB each could be triggered in less than a minute.
So this is my problem: What would be an effective solution to combine these changes together? In my old solution, although it is expensive to re-create an entire document when a small item changes, the benefit at least is that it does it lazily when it needs it (which could be a while later). However, to update the JSON document with a small change seems to require that it be done immediately (not lazily). That is, unless I come up with a complex solution that lazily applies a set of changes to the document later on. I'm hoping for something efficient but not too complicated.
Thanks.
Pull queue. Everyone using GAE should watch this video:
http://www.youtube.com/watch?v=AM0ZPO7-lcE
When a call comes in, update memcache and do an async_add to your task pull queue. You likely could run a process that will handle thousands of updates each minute without a lot of overhead (i.e. instance issues). Still have an issue should memcache get purged prior to your updates, but that it not too hard to work around. HTH. -stevep

Google App Engine and data replication with multiple instances

I'm using the Google App Engine as the backend for an iOS game that was just released.
Through that act of playing the game, players create levels and then those are shared with their friends and the world at large. GAE is used to store and retrieve those levels. GAE also Manages player's high scores since they are more complex than Game Center can handle.
As a whole, GAE works great. I like how GAE spins up new instances as they are needed without me having to constantly monitor load. For this game, GAE is running around 10 instance and serving around 8 queries a second.
But there is a small problem.
I've noticed that sometimes players will get on the high score table twice. This should not be possible since I remove any old scores before putting up the new scores (this is done in one query to GAE).
After some testing and poking around, it seems that what is happening is that a player will get a high score and instance 1 handles the removing of the old score and the adding of the new one. The player then gets a new high score, but this time instance 4 is the one that handles the request and it doesn't know about the other score yet.
At their fastest, it might take a player 10 seconds to get a new high score. It was my understanding that the replication of data only took 2 or 3 seconds.
I never saw this problem during testing because load rarely caused 2 instances to be started.
Does this seem like a plausible explanation for what is happening and how data is stored for each instance?
Is there a way to guarantee that data added, deleted or altered in one instance will be available in another? High scores are not "mission critical", so I'm not too worried about it, but I would like to use GAE for some more complex situations where it is very important that data is consistent.
Is that possible with GAE, or should I be looking at other solutions?
It is possible to guarantee that data will be consistent across all data centers (strong consistency). You need to use ancestor queries to achieve it. However, doing so poses a restriction on how many write per seconds you can achieve. Currently the limit is 1 write per second.
If the write limit is too slow for you, one alternative is to add a cache layer. So you will still be using the eventual consistency model, but you will mix those results with the ones in memcache.
See the doc Structuring for Strong Consistency for further details.

What is the best way to do basic View tracking on a web page?

I have a web facing, anonymously accessible, blog directory and blogs and I would like to track the number of views each of the blog posts receives.
I want to keep this as simple as possible, accuracy need only be an approximation. This is not for analytics (we have Google for that) and I dont want to do any log analysis to pull out the stats as running background tasks in this environment is tricky and I want the numbers to be as fresh as possible.
My current solution is as follows:
A web control that simply records a view in a table for each GET.
Excludes a list of known web crawlers using a regex and UserAgent string
Provides for the exclusion of certain IP Addresses (known spammers)
Provides for locking down some posts (when the spammers come for it)
This actually seems to do a pretty good job, but a couple of things annoy me. The spammers still hit some posts, thereby skewing the Views. I still have to manually monitor the views an update my list of "bad" IP addresses.
Does anyone have some better suggestions for me? Anyone know how the views on StackOverflow questions are tracked?
It sounds like your current solution is actually quite good.
We implemented one where the server code which delivered the view content also updated a database table which stored the URL (actually a special ID code for the URL since the URL could change over time) and the view count.
This was actually for a system with user-written posts that others could comment on but it applies equally to the situation where you're the only user creating the posts (if I understand your description correctly).
We had to do the following to minimise (not eliminate, unfortunately) skew.
For logged-in users, each user could only add one view point to a post. EVER. NO exceptions.
For anonymous users, each IP address could only add one view point to a post each month. This was slightly less reliable as IP addresses could be 'shared' (NAT and so on) from our point of view. The reason we relaxed the "EVER" requirement above was for this sharing reason.
The posts themselves were limited to having one view point added per time period (the period started low (say, 10 seconds) and gradually increased (to, say, 5 minutes) so new posts were allowed to accrue views faster, due to their novelty). This took care of most spam-bots, since we found that they tend to attack long after the post has been created.
Removal of a spam comment on a post, or a failed attempt to bypass CAPTCHA (see below), automatically added that IP to the blacklist and reduced the view count for that post.
If a blacklisted IP hadn't tried to leave a comment in N days (configurable), it was removed from the blacklist. This rule, and the previous rule, minimised the manual intervention in maintaining the blacklist, we only had to monitor responses for spam content.
CAPTCHA. This solved a lot of our spam problems, especially since we didn't just rely on OCR-type things (like "what's this word -> 'optionally'); we actually asked questions (like "what's 2 multiplied by half of 8?") that break the dumb character recognition bots. It won't beat the hordes of cheap labour CAPTCHA breakers (unless their maths is really bad :-) but the improvements from no-CAPTCHA were impressive.
Logged-in users weren't subject to CAPTCHA but spam got the account immediately deleted, IP blacklisted and their view subtracted from the post.
I'm ashamed to admit we didn't actually discount the web crawlers (I hope the client isn't reading this :-). To be honest, they're probably only adding a minimal number of view points each month due to our IP address rule (unless they're swarming us with multiple IP addresses).
So basically, I'm suggested the following as possible improvements. You should, of course, always monitor how they go to see if they're working or not.
CAPTCHA.
Automatic blacklist updates based on user behaviour.
Limiting view count increases from identical IP addresses.
Limiting view count increases to a certain rate.
No scheme you choose will be perfect (e.g., our one month rule) but, as long as all posts are following the same rule set, you still get a good comparative value. As you said, accuracy need only be an approximation.
Suggestions:
Move the hit count logic from a user control into a base Page class.
Redesign the exclusions list to be dynamically updatable (i.e. store it in a database or even in an xml file)
Record all hits. On a regular interval, have a cron job run through the new hits and determine whether they are included or excluded. If you do the exclusion for each hit, each user has to wait for the matching logic to take place.
Come up with some algorithm to automatically detect spammers/bots and add them to your blacklist. And/Or subscribe to a 3rd party blacklist.

Resources