AppEngine task to create many entities - google-app-engine

I'm building an application for people attending events. I need to create a Ticket entity for a subset of Person entities, for a specific Event. The amount of people may exceed 50,000 entities.
Obviously I can't just do a for-loop where I iterate over a Person query and generate these Tickets.
How do I structure this on App Engine, is there a way to take advantage of MapReduce?

You may want to take a look at the Deferred library. You can spool up a bunch of task queues in parrallel to do the work that you want. You may want to look at the Mapper example class in the google docs which may help push you in the right direction.

You can iterate in a single for loop if you the Backend where a request can last for long. But such long running processes are not a good thing in my opinion. I guess proper use of task queues is more than enough.
I read about the Deferred library. Sometimes it behaves strangely and pickling your data can introduce some headaches. I could for the TaskQueue API

I do not suggest Deferred Library, although it's very easy to write code, the disadvantage is it will pickle your data, put it into one entity, and load and unpickle it later that cost a lot of overhead. Puts 30K entities costs me about 3 CPU Hours!
The cheapest way is just use the Task Queue that split the Person and enqueue with keys or other positional information. Insert same 30K entities used less than 1 CPU Hour.
In your question, fetch 1 million entities and run over is very fast depends on GAE's design, just do it. The slowest part is store the new Ticket entity.
BTW,
why not just Person.all().filter("something like attending events").

Related

Notion API Pagination Random Database Entry

I'm trying to retrieve a random entry from a database using the Notion API. There is a page limit on how many entries you can retrieve at once, so pagination is utilized to sift through the pages 100 entries at a time. Since there is no database attribute telling you how long the database is, you have to go through the pages in order until reaching the end in order to choose a random entry. This is fine for small databases, but I have a cron job going that regularly chooses a random entry from a notion database with thousands of entries. Additionally, if I make too many calls simultaneously I risk being rate limited pretty often. Is there a better way to go about choosing a random value from a database that uses pagination? Thanks!
I don't think there is a better way to do it right now (sadly). If your entries don't change often, think about caching the pages. Saves you a lot of execution time in your cron job. For the rate limit, if you use Node.js, you can build a rate-limited queue (3 requests/second) pretty easily with something like bull

GAE transaction failure and idempotency

The Google App Engine documentation contains this paragraph:
Note: If your application receives an exception when committing a
transaction, it does not always mean that the transaction failed. You
can receive DatastoreTimeoutException,
ConcurrentModificationException, or DatastoreFailureException
exceptions in cases where transactions have been committed and
eventually will be applied successfully. Whenever possible, make your
Datastore transactions idempotent so that if you repeat a transaction,
the end result will be the same.
Wait, what? It seems like there's a very important class of transactions that just simply cannot be made idempotent because they depend on current datastore state. For example, a simple counter, as in a like button. The transaction needs to read the current count, increment it, and write out the count again. If the transaction appears to "fail" but doesn't REALLY fail, and there's no way for me to tell that on the client side, then I need to try again, which will result in one click generating two "likes." Surely there is some way to prevent this with GAE?
Edit:
it seems that this is problem inherent in distributed systems, as per non other than Guido van Rossum -- see this link:
app engine datastore transaction exception
So it looks like designing idempotent transactions is pretty much a must if you want a high degree of reliability.
I was wondering if it was possible to implement a global system across a whole app for ensuring idempotency. The key would be to maintain a transaction log in the datastore. The client would generated a GUID, and then include that GUID with the request (the same GUID would be re-sent on retries for the same request). On the server, at the start of each transaction, it would look in the datastore for a record in the Transactions entity group with that ID. If it found it, then this is a repeated transaction, so it would return without doing anything.
Of course this would require enabling cross-group transactions, or having a separate transaction log as a child of each entity group. Also there would be a performance hit if failed entity key lookups are slow, because almost every transaction would include a failed lookup, because most GUIDs would be new.
In terms of the additional $ cost in terms of additional datastore interactions, this would probably still be less than if I had to make every transaction idempotent, since that would require a lot of checking what's in the datastore in each level.
dan wilkerson, simon goldsmith, et al. designed a thorough global transaction system on top of app engine's local (per entity group) transactions. at a high level, it uses techniques similar to the GUID one you describe. dan dealt with "submarine writes," ie the transactions you describe that report failure but later surface as succeeded, as well as many other theoretical and practical details of the datastore. erick armbrust implemented dan's design in tapioca-orm.
i don't necessarily recommend that you implement his design or use tapioca-orm, but you'd definitely be interested in the research.
in response to your questions: plenty of people implement GAE apps that use the datastore without idempotency. it's only important when you need transactions with certain kinds of guarantees like the ones you describe. it's definitely important to understand when you do need them, but you often don't.
the datastore is implemented on top of megastore, which is described in depth in this paper. in short, it uses multi-version concurrency control within each entity group and Paxos for replication across datacenters, both of which can contribute to submarine writes. i don't know if there are public numbers on submarine write frequency in the datastore, but if there are, searches with these terms and on the datastore mailing lists should find them.
amazon's S3 isn't really a comparable system; it's more of a CDN than a distributed database. amazon's SimpleDB is comparable. it originally only provided eventual consistency, and eventually added a very limited kind of transactions they call conditional writes, but it doesn't have true transactions. other NoSQL databases (redis, mongo, couchdb, etc.) have different variations on transactions and consistency.
basically, there's always a tradeoff in distributed databases between scale, transaction breadth, and strength of consistency guarantees. this is best known by eric brewer's CAP theorem, which says the three axes of the tradeoff are consistency, availability, and partition tolerance.
The best way I came up with making counters idempotent is using a set instead of an integer in order to count. Thus, when a person "likes" something, instead of incrementing a counter I add the like to the thing like this:
class Thing {
Set<User> likes = ....
public void like (User u) {
likes.add(u);
}
public Integer getLikeCount() {
return likes.size();
}
}
this is in java, but i hope you get my point even if you are using python.
This method is idempotent and you can add a single user for how many times you like, it will only be counted once. Of course, it has the penalty of storing a huge set instead of a simple counter. But hey, don't you need to keep track of likes anyway? If you don't want to bloat the Thing object, create another object ThingLikes, and cache the like count on the Thing object.
another option worth looking into is app engine's built in cross-group transaction support, which lets you operate on up to five entity groups in a single datastore transaction.
if you prefer reading on stack overflow, this SO question has more details.

What are the low hanging fruit for optimizing google app engine with respect to quota usage?

Everyone learns to use Memcache pretty quick. Another one I've learned recently is setting indexed=False for Model properties that I am not going to query against. What are some others? What are the big ones?
Don't use offset in queries. Use cursors instead.
Explanations: offset loads all data up to offset+limit and charges you for it, but only returns limit entities.
Minimize instance use, by tweaking idle instances and pending latency appropriately for your app.
A couple helped us (not all may be low-hanging at first). First, we denormalized our datastore to reduce joins. I'm using SQL terms because I came from a SQL background. By spreading commonly queried elements around, we reduced the number of reads we had to make considerably, even after factoring in Memcache. Potentially increases writes but for most apps, the number of reads far outweighs the number of writes.
Next, we started using task queues, backends, and the channel API more often. I don't remember specific examples but I do remember we were able to reduce our front-end usage down below the free quota mark by moving some processing around to queues and backends and by sending data down via channel rather than having the client poll.
Also, we use objectify for our data access which we configure to automatically use memcache wherever appropriate.

Google App Engine low memcache performance

Memcache is one of those things where the solution could be absolutely anything, and no one ever really gives a decent answer, maybe because there is none. So I'm not looking for a direct answer, but maybe just something to get me going in the right direction.
For a typical request, here is my AppStats info:
So, out of a total 440 ms request, I spend 342 ms in memcache. And here I figured memcache was supposed to be lightning fast. I must be doing something wrong.
Looking at my memcache statistics in my admin console, I have this:
Hit count: 3848
Miss count: 21382
Hit ratio: 15%
I'm no expert on this stuff, but I'm pretty sure 15% is terrible.
The typical request above is a bit too detailed to explain, but basically, I create and put a new entity, which also updates and puts a parent entity, which also updates and puts any users associated with the parent entity.
Throughout all this, I always get by key, never query. Oh and I'm using NDB, so all the basic memcache stuff is handled automatically. So I never actually touch memcache manually on my own in my code.
Any ideas?
Edit: Here is the breakdown of my request
So I only have 2 datastore gets and 2 puts. The rest is automatically handled memcache stuff. Why is it doing so much work? Would I be better off handling this stuff manually?
Let's take a closer look at your data. Seven memcache writes took as much time as two datastore writes. This actually proves that memcache is, like, 3.5 times faster than Datastore.
If a typical request to your application requires updates of at least three database entities--followed by an update of more entities (the users associated), you can't make this operation "lightning fast." Memcache helps when you read entries much more frequently than you write them. If the amount of reads and writes to a User's record are on par, you should consider turning cache off for this model.
You can also try asynchronous operations and task queues. From your description, it looks like you try to first update the entity, and update its parent only after the update completes because it's natural. You may run these concurrently; this probably will require some refactoring, but it's worth it.
Second, updating "all the associated users" may be, perhaps. deferred to a task spawned in background; Task Queues have a very convenient interface for this. The "associated users" won't be updated immediately, but they probably don't need to! However, the latency of your request will be less then.

Dumping Twitter Streaming API tweets as-is to Apache Cassandra for post-processing

I am using the Twitter Streaming API to monitor several keywords/users. I am planning to dump the tweets json strings I get from twitter directly as-is to cassandra database and do post processing on them later.
Is such a design practical? Will it scale up when I have millions of tweets?
Things I will do later include getting top followed users, top hashtags etc. I would like to save the stream as is for mining them later for any new information that I may not know of now.
What is important is not so much the number of tweets as the rate at which they arrive. Cassandra can easily handle thousands of writes per second, which should be fine (Twitter currently generates around 1200 tweets per second in total, and you will probably only get a small fraction of those).
However, tweets per second are highly variable. In the aftermath of a heavy spike in writes, you may see some slowdown in range queries. See the Acunu blog posts on Cassandra under heavy write load part i and part ii for some discussion of the problem and ways to solve it.
In addition to storing the raw json, I would extract some common features that you are almost certain to need, such as the user ID and the hashtags, and store those separately as well. This will save you a lot of processing effort later on.
Another factor to consider is to plan for how the data stored will grow over time. Cassandra can scale very well, but you need to have a strategy in place for how to keep the load balanced across your cluster and how to add nodes as your database grows. Adding nodes can be a painful experience if you haven't planned out how to allocate tokens to new nodes in advance. Waiting until you have an overloaded node before adding a new one is a good way to make your cluster fall down.
You can easily store millions of tweets in cassandra.
For processing the tweets and getting stats such as top followed users, hashtags look at brisk from DataStax which builds on top of cassandra.

Resources