How can I address the 10GB limit on Google App Engine? - google-app-engine

We are trying to index inboxes by sitting on top of the GMail, and are using the App Engine search API, but we are hitting up the 10 GB limit. This is because we are indexing the whole organization's emails so we can search across the whole team's inbox. How can we work around this? One way might be to have an individual index per person and somehow combine the results manually, but worried that merging results might be really complex! Wondering what options are available?

This is a typical problem in any document retrieval system, and the solution is to slice the entire corpus into multiple buckets. You should choose a slicing strategy based on your requirements/usage pattern.
One possibility is to slice messages by their date. You keep adding messages to an index until you come close to the limit, at which point you start a new index for newer messages. Or you can do it by calendar intervals (per year, per quarter or per month, depending on your volume).
Merging results from several indexes is simple. You can also give users a chance to choose how far back in time they want to go in their search. Often people know that they are looking for something recent or something that happened a long time ago.

File a feature request:
https://code.google.com/p/googleappengine/wiki/FilingIssues?tm=3
There was this filed too so maybe star it: https://code.google.com/p/googleappengine/issues/detail?id=10667

Related

Notion API Pagination Random Database Entry

I'm trying to retrieve a random entry from a database using the Notion API. There is a page limit on how many entries you can retrieve at once, so pagination is utilized to sift through the pages 100 entries at a time. Since there is no database attribute telling you how long the database is, you have to go through the pages in order until reaching the end in order to choose a random entry. This is fine for small databases, but I have a cron job going that regularly chooses a random entry from a notion database with thousands of entries. Additionally, if I make too many calls simultaneously I risk being rate limited pretty often. Is there a better way to go about choosing a random value from a database that uses pagination? Thanks!
I don't think there is a better way to do it right now (sadly). If your entries don't change often, think about caching the pages. Saves you a lot of execution time in your cron job. For the rate limit, if you use Node.js, you can build a rate-limited queue (3 requests/second) pretty easily with something like bull

Strategy for caching of remote service; what should I be considering?

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.
time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated
A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).
Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?
Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.
Anyway, to throw out some thoughts on standards for crawling:
Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
A statistical model of the chance of an image being updated and needing to be recached.
An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
2,500 requests processing new images (which you mentioned are more important)
2,500 requests processing images of current events
2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
2,500 requests processing images that have been viewed at least
Total: 15,000 requests per hour.
How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.
andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.
At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Google App Engine and data replication with multiple instances

I'm using the Google App Engine as the backend for an iOS game that was just released.
Through that act of playing the game, players create levels and then those are shared with their friends and the world at large. GAE is used to store and retrieve those levels. GAE also Manages player's high scores since they are more complex than Game Center can handle.
As a whole, GAE works great. I like how GAE spins up new instances as they are needed without me having to constantly monitor load. For this game, GAE is running around 10 instance and serving around 8 queries a second.
But there is a small problem.
I've noticed that sometimes players will get on the high score table twice. This should not be possible since I remove any old scores before putting up the new scores (this is done in one query to GAE).
After some testing and poking around, it seems that what is happening is that a player will get a high score and instance 1 handles the removing of the old score and the adding of the new one. The player then gets a new high score, but this time instance 4 is the one that handles the request and it doesn't know about the other score yet.
At their fastest, it might take a player 10 seconds to get a new high score. It was my understanding that the replication of data only took 2 or 3 seconds.
I never saw this problem during testing because load rarely caused 2 instances to be started.
Does this seem like a plausible explanation for what is happening and how data is stored for each instance?
Is there a way to guarantee that data added, deleted or altered in one instance will be available in another? High scores are not "mission critical", so I'm not too worried about it, but I would like to use GAE for some more complex situations where it is very important that data is consistent.
Is that possible with GAE, or should I be looking at other solutions?
It is possible to guarantee that data will be consistent across all data centers (strong consistency). You need to use ancestor queries to achieve it. However, doing so poses a restriction on how many write per seconds you can achieve. Currently the limit is 1 write per second.
If the write limit is too slow for you, one alternative is to add a cache layer. So you will still be using the eventual consistency model, but you will mix those results with the ones in memcache.
See the doc Structuring for Strong Consistency for further details.

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

Dumping Twitter Streaming API tweets as-is to Apache Cassandra for post-processing

I am using the Twitter Streaming API to monitor several keywords/users. I am planning to dump the tweets json strings I get from twitter directly as-is to cassandra database and do post processing on them later.
Is such a design practical? Will it scale up when I have millions of tweets?
Things I will do later include getting top followed users, top hashtags etc. I would like to save the stream as is for mining them later for any new information that I may not know of now.
What is important is not so much the number of tweets as the rate at which they arrive. Cassandra can easily handle thousands of writes per second, which should be fine (Twitter currently generates around 1200 tweets per second in total, and you will probably only get a small fraction of those).
However, tweets per second are highly variable. In the aftermath of a heavy spike in writes, you may see some slowdown in range queries. See the Acunu blog posts on Cassandra under heavy write load part i and part ii for some discussion of the problem and ways to solve it.
In addition to storing the raw json, I would extract some common features that you are almost certain to need, such as the user ID and the hashtags, and store those separately as well. This will save you a lot of processing effort later on.
Another factor to consider is to plan for how the data stored will grow over time. Cassandra can scale very well, but you need to have a strategy in place for how to keep the load balanced across your cluster and how to add nodes as your database grows. Adding nodes can be a painful experience if you haven't planned out how to allocate tokens to new nodes in advance. Waiting until you have an overloaded node before adding a new one is a good way to make your cluster fall down.
You can easily store millions of tweets in cassandra.
For processing the tweets and getting stats such as top followed users, hashtags look at brisk from DataStax which builds on top of cassandra.

Resources