app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary? - google-app-engine

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map

A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.

One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Related

how to use Google Cloud Memcach to save/update unique items

In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]

Paging of frequently changing data

I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.
The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.
I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.
As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one
Solutions which don't work:
Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)
(If it matters I'm using MongoDB+c#)
How would you solve this kind of problem?
Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.
So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.
The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.
It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.
This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.
OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.
You start off with a "Time" table:
Time {
timestamp long (PK)
created datetime
last_queried datetime
}
This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.
Now we can start talking about "Threads":
Threads {
id long (PK)
identifier long
last_modified datetime
title string
body string
score int
}
The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.
Now, how do we query this? You can't just directly query Threads. You need to include a star table:
ThreadStar {
timestamp long (FK -> Time.timestamp)
thread_id long (FK -> Threads.id)
thread_identifier long (matches Threads[thread_id].identifier)
(timestamp, thread_identifier should be unique)
}
This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
AND Thread.identifier = {thread_identifier}
That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC
Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.
So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.
So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.
You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.
Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).
Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.
This general mechanism also works if you have more nested data, but your queries get harder.
Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!
Yeah. Make sure your tests are solid.
But at least you aren't sensitive to re-ordered data mid-paging.
For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db.
Once you have that the query for constantly chaning data is easy.
Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will
not display an already dispalyed value twice because on different pages because its rank has changed.
If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.
Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.
I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.
To minimize memory usage you can cache only records ids and fetch whole data when user requests it.
Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.

Improve throughput of ndb query over large data

I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
everything works in a task queue, so we have plenty of time (10 minute deadline).
I run a query over the ndb entities in order to select which entities need to be processed.
as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?
In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.
The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).
Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.

How Row Key is designed in Hbase

I am writing a program that converts an RDBMS into HBase. I selected a sequential entity as a row key like Employee ID (1,2,3....)but i read it somewhere that row key shouldn't be a sequential entity. My question is why selecting a sequential row key is not recommended. what are the design prospects associated for doing the same?
Although sequential rowkeys allow faster scans, it becomes a problem after a certain point as it causes undesirable RegionServer hotspotting during read/write time. By its default behavior Hbase stores rows with similar keys to the same region. It allows faster range scans. So if rowkeys are sequential all of your data will start going to the same machine causing uneven load on that machine. This is called as RegionServer Hotspotting and is the main motivation behind not using sequential keys. I'll take "writes" to explain the problem here.
When records with sequential keys are being written to HBase all writes hit one Region. This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer. Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions. Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim. Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster.
You can find a very good explanation of the problem along with its solution here.
You might also find this page helpful, which shows us how to design rowkeys efficiently.
Hope this answers your question.
Mostly because sequentially increasing row keys will be written to the same region, and not evenly distributed in terms of writes. If you have a write-intensive application, it makes sense to have some randomness in your row-key.
This is a great explanation (with graphics) on why a sequentially increasing row-key is a bad idea for HBase.

Where should I handle concurrent operations on the same data? In the application or the database?

So I am building a Java webapp with Spring and Hibernate. In the application userw can add points to a object and I'd like to count the points given to order my objects. The objects are also stored in the database. And hopefully hundreds of people will give points to the objects at the same time.
But how do I count the points and save them in the database at the same time? Usually I would just have a property on my object and just increase the points. But that would mean that I have to lock the data in the database with a pessimistic transaction in order to prevent concurrency issues (reading the amount of points while another thread is half way through changing it already). That would possibly make my app much slower (at least I imagine it would).
The other solution would be to store the amount of given points in an associated object and store them separately in the database while counting the points in memory within a "small" synchronized block or something.
Which solution has the least performance impact when handling many concurrent operations on the same objects. Or are there any other fitting solutions?
If you would like the values to be persisted, then you should persist them in your database.
Given that, try the following:
create a very narrow row, like just OBJ_ID and POINTS.
create an index only on OBJ_ID, so not a lot of time is spent updating indexes when values are inserted, updated or deleted.
use INNODB, which has row-level locking, so the locks will be smaller
mysql will give you the last committed (consistent) value
That's all pretty simple. Give a whirl! Setup a test case that mimics your expected load and see how it performs. Post back if you get stuck.
Good luck.

Resources