How to make datastore keys mapreduce-friendly(-er)?

How to make datastore keys mapreduce-friendly(-er)? - google-app-engine

Edit: See my answer. Problem was in our code. MR works fine, it may have a status reporting problem, but at least the input readers work fine.
I ran an experiment several times now and I am now sure that mapreduce (or DatastoreInputReader) has odd behavior. I suspect this might have something to do with key ranges and splitting them, but that is just my guess.
Anyway, here's the setup we have:
we have an NDB model called "AdGroup", when creating new entities
of this model - we use the same id returned from AdWords (it's an
integer), but we use it as string: AdGroup(id=str(adgroupId))
we have 1,163,871 of these entities in our datastore (that's what
the "Datastore Admin" page tells us - I know it's not entirely
accurate number, but we don't create/delete adgroups very often, so
we can say for sure, that the number is 1.1 million or more).
mapreduce is started (from another pipeline) like this:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I've tried to run this mapreduce several times today without changing anything in the code and without making changes to the datastore. Every time I ran it, mapper-calls counter had a different value ranging from 450,000 to 550,000.
Correct me if I'm wrong, but considering that I use the very basic DatastoreInputReader - mapper-calls should be equal to the number of entities. So it should be 1.1 million or more.
Note: the reason why I noticed this issue in the first place is because our marketing guys started complaining that "it's been 4 days after we added new adgroups and they still don't show up in your app!".
Right now, I can think of only one workaround - write all keys of all adgroups into a blobstore file (one per line) and then use BlobstoreLineInputReader. The writing to blob part would have to be written in a way that does not utilize DatastoreInputReader, of course. Should I go with this for now, or can you suggest something better?
Note: I have also tried using DatastoreKeyInputReader with the same code - the results were similar - mapper-calls were between 450,000 and 550,000.
So, finally questions. Is it important how you generate ids for your entities? Is it better to use int ids instead of str ids? In general, what can I do to make it easier for mapreduce to find all of my entities mapping them?
PS: I'm still in the process of experimenting with this, I might add more details later.

After further investigation we have found that the error was actually in our code. So, mapreduce actually works as expected (mapper is called for every single datastore entity).
Our code was calling some google services functions that were sometimes failing (the wonderful cryptic ApplicationError messages). Due to these failures, MR tasks were being retried. However, we have set a limit on taskqueue retries. MR did not detect nor report this in any way - MR was still showing "success" in the status page for all shards. That is why we thought that everything is fine with our code and that there is something wrong with the input reader.

Related

Understanding Datastore Get RPCs in Google App Engine

I'm using sharded counters (https://cloud.google.com/appengine/articles/sharding_counters) in my GAE application for performance reasons, but I'm having some trouble understanding why it's so slow and how I can speed things up.
Background
I have an API that grabs a set of 20 objects at a time and for each object, it gets a total from a counter to include in the response.
Metrics
With Appstats turned on and a clear cache, I notice that getting the totals for 20 counters makes 120 RPCs by datastore_v3.Get which takes 2500ms.
Thoughts
This seems like quite a lot of RPC calls and quite a bit of time for reading just 20 counters. I assumed this would be faster and maybe that's where I'm wrong. Is it supposed to be faster than this?
Further Inspection
I dug into the stats a bit more, looking at these two lines in the get_count method:
all_keys = GeneralCounterShardConfig.all_keys(name)
for counter in ndb.get_multi(all_keys):
If I comment out the get_multi line, I see that there are 20 RPC calls by datastore_v3.Get totaling 185ms.
As expected, this leaves get_multi to be the culprit for 100 RPC calls by datastore_v3. Get taking upwards of 2500 ms. I verified this, but this is where I'm confused. Why does calling get_multi with 20 keys cause 100 RPC calls?
Update #1
I checked out Traces in the GAE console and saw some additional information. They show a breakdown of the RPC calls there as well - but in the sights they say to "Batch the gets to reduce the number of remote procedure calls." Not sure how to do that outside of using get_multi. Thought that did the job. Any advice here?
Update #2
Here are some screen shots that show the stats I'm looking at. The first one is my base line - the function without any counter operations. The second one is after a call to get_count for just one counter. This shows a difference of 6 datastore_v3.Get RPCs.
Base Line
After Calling get_count On One Counter
Update #3
Based on Patrick's request, I'm adding a screenshot of info from the console Trace tool.

Try splitting up the for loop that goes through each item and the actual get_multi call itself. So something like:
all_values = ndb.get_multi(all_keys)
for counter in all_values:
# Insert amazeballs codes here
I have a feeling it's one of these:
The generator pattern (yield from for loop) is causing something funky with get_multi execution paths
Perhaps the number of items you are expecting doesn't match actual result counts, which could reveal a problem with GeneralCounterShardConfig.all_keys(name)
The number of shards is set too high. I've realized that anything over 10 shards causes performance issues.

When I've dug into similar issues, one thing I've learned is that get_multi can cause multiple RPCs to be sent from your application. It looks like the default in the SDK is set to 1000 keys per get, but the batch size I've observed in production apps is much smaller: something more like 10 (going from memory).
I suspect the reason it does this is that at some batch size, it actually is better to use multiple RPCs: there is more RPC overhead for your app, but there is more Datastore parallelism. In other words: this is still probably the best way to read a lot of datastore objects.
However, if you don't need to read the absolute most current value, you can try setting the db.EVENTUAL_CONSISTENCY option, but that seems to only be available in the older db library and not in ndb. (Although it also appears to be available via the Cloud Datastore API).
Details
If you look at the Python code in the App Engine SDK, specifically the file google/appengine/datastore/datastore_rpc.py, you will see the following lines:
max_count = (Configuration.max_get_keys(config, self.__config) or
self.MAX_GET_KEYS)
...
if is_read_current and txn is None:
max_egs_per_rpc = self.__get_max_entity_groups_per_rpc(config)
else:
max_egs_per_rpc = None
...
pbsgen = self._generate_pb_lists(indexed_keys_by_entity_group,
base_req.ByteSize(), max_count,
max_egs_per_rpc, config)
rpcs = []
for pbs, indexes in pbsgen:
rpcs.append(make_get_call(base_req, pbs,
self.__create_result_index_pairs(indexes)))
My understanding of this:
Set max_count from the configuration object, or 1000 as a default
If the request must read the current value, set max_gcs_per_rpc from the configuration, or 10 as a default
Split the input keys into individual RPCs, using both max_count and max_gcs_per_rpc as limits.
So, this is being done by the Python Datastore library.

App engine datastore inconsistent?

This is so weird...
First of all this query works in the datastore viewer, ie. it returns the correct row.
SELECT * FROM Level where short_id = 'Ec71eN'
But if I run this
Level.all().filter("short_id = ", 'Ec71eN').get()
it returns None, if I run this:
db.GqlQuery("SELECT * FROM Level where short_id = '%s'" % 'Ec71eN').get()
it also returns None. If I run this:
level = Level.get_by_id(189009)
it returns the correct row (189009 is the id for the correct row)
Puzzling? What can be wrong here? I have never seen anything like this before, it has worked correctly for at least a couple of weeks in production... I think I have at least two cases now where it dosent work starting today.
UPDATE: This can not be a eventually consistent problem since the row was 7 hours old when I tried the above. I had two rows with same symptoms, strangely booth generated by the same users. They where booth "fixed" after I did a manual fecth of their ids by uploading special case code like:
if short_id==CASE_1_SHORT_ID:
level = Level.get_by_id(CASE_1_ID)
After that the query worked as usual.

Are you using the HRD? Nothing's wrong. You know it's supposed to be eventually consistent right?
Query operations are eventually consistent.
Get-by-id operations are fully consistent.
What you describe is correct datastore behavior. It's a bit odd that the datastore viewer operation returns the correct result, but it might have hit a separate tablet on the datastore operation.

Given that it was created 7 hours ago, the 'eventual consistency' generally should take seconds to minutes.
If eventual consistency IS the problem, run the same query method a bunch of times and see if returns the same result. If it continuously returns the same result with the same method, then it is more than likely not an eventual consistency problem. You should switch to the NDB API for querying data as well - it's 1000 times better and Guido worked on it - so you know it's good. Does NDB show the same inconsistency?

App Engine Datastore Viewer, how to show count of records using GQL?

I would think this would be easy for an SQL-alike! What I want is the GQL equivalent of:
select count(*) from foo;
and to get back an answer something similar to:
1972 records.
And I want to do this in GQL from the "command line" in the web-based DataStore viewer. (You know, the one that shows 20 at a time and lets me see "next 20")
Anyway -- I'm sure it's brain-dead easy, I just can't seem to find the correct syntax. Any help would be appreciated.
Thanks!

With straight Datastore Console, there is no direct way to do it, but I just figured out how to do it indirectly, with the OFFSET keyword.
So, given a table, we'll call foo, with a field called type that we want to check for values named "bar":
SELECT * FROM foo WHERE type="bar" OFFSET 1024
(We'll be doing a quick game of "warmer, colder" here, binary style)
Let's say that query returns nothing. Change OFFSET to 512, then 256, 128, 64, ... you get the idea. Same thing in reverse: Go up to 2048, 4096, 8192, 16384, etc. until you see no records, then back off.
I just did one here at work. Started with 2048, and noticed two records came up. There's 2049 in the table. In a more extreme case, (lets say there's 3300 records), you could start with 2048, notice there's a lot, go to 4096, there's none... Take the midpoint (1024 between 2048 and 4096 is 3072) next and notice you have records... From there you could add half the previous midpoint (512) to get 3584, and there's none. Whittle back down half (256) to get 3328, still none. Once more down half (128) to get 3200 and there's records. Go up half of the last val (64) and there's still records. Go up half again (32) to 3296 - still records, but so small you can easily see there's exactly 3300.
The nice thing about this vs. Datastore statistics to see how many records are in a table is you can limit it by the WHERE clause.

I don't think there is any direct way to get the count of entities via GQL. However you can get the count directly from the dashbaord
;
More details - https://cloud.google.com/appengine/docs/python/console/managing-datastore

As it's stated in other questions, it looks like there is no count aggregate function in GQL. The GQL Reference also doesn't say there is the ability to do this, though it doesn't explicitly say that it's not possible.
In the development console (running your application locally) it looks like just clicking the "List Entities" button will show you a list of all entities of a certain type, and you can see "Results 1-10 of (some number)" to get a total count in your development environment.
In production you can use the "Datastore Statistics" tab (the link right underneath the Datastore Viewer), choose "Display Statistics for: (your entity type)" and it will show you the total number of entities, however this is not the freshest view of the data (updated "at least once per day").
Since you can't run arbitrary code in production via the browser, I don't think saying "use .count() on a query" would help, but if you're using the Remote API, the .count() method is no longer capped at 1000 entries as of August, 2010, so you should be able to run print MyEntity.all().count() and get the result you want.

This is one of those surprising things that the datastore just can't do. I think the fastest way to do it would be to select __KEY__ from foo into a List, and then count the items in the list (which you can't do in the web-based viewer).
If you're happy with statistics that can be a little bit stale, you can go to the Datastore Statistics page of the admin console, which will tell you how many entities of each type there were some time ago. It seems like those stats are usually less than 10 hours old. Unfortunately, you can't query them more specifically.

There's no way to get a total count in GQL. Here's a way to get a count using python:
def count_models(model_class, max_fetch=1000):
total = 0
cursor = None
while True:
query = model_class.all(keys_only=True)
if cursor:
query.with_cursor(cursor)
results = query.fetch(max_fetch)
total += len(results)
print('still counting: ' + total)
if (len(results) < max_fetch):
return total
cursor = query.cursor()
You could run this function using the remote_api_shell, or add a custom page to your admin site to run this query. Obviously, if you've got millions of rows you're going to be waiting a while. You might be able to increase max_fetch, I'm not sure what the current fetch limit is.

What's your experience developing on Google App Engine?

Is GQL easy to learn for someone who knows SQL? How is Django/Python? Does App Engine really make scaling easy? Is there any built-in protection against "GQL Injections"? And so on...
I'd love to hear the not-so-obvious ups and downs of using app engine.
Cheers!

My experience with google app engine has been great, and the 1000 result limit has been removed, here is a link to the release notes:
app-engine release notes
No more 1000 result limit - That's
right: with addition of Cursors and
the culmination of many smaller
Datastore stability and performance
improvements over the last few months,
we're now confident enough to remove
the maximum result limit altogether.
Whether you're doing a fetch,
iterating, or using a Cursor, there's
no limits on the number of results.

The most glaring and frustrating issue is the datastore api, which looks great and is very well thought out and easy to work with if you are used to SQL, but has a 1000 row limit across all query resultsets, and you can't access counts or offsets beyond that. I've run into weirder issues, with not actually being able to add or access data for a model once it goes beyond 1000 rows.
See the Stack Overflow discussion about the 1000 row limit
Aral Balkan wrote a really good summary of this and other problems
Having said that, app engine is a really great tool to have at ones disposal, and I really enjoy working with it. It's perfect for deploying micro web services (eg: json api's) to use in other apps.

GQL is extremely simple - it's a subset of the SQL 'SELECT' statement, nothing more. It's only a convenience layer over the top of the lower-level APIs, though, and all the parsing is done in Python.
Instead, I recommend using the Query API, which is procedural, requires no run-time parsing, and makes 'GQL injection' vulnerabilities totally impossible (though they are impossible in properly written GQL anyway). The Query API is very simple: Call .all() on a Model class, or call db.Query(modelname). The Query object has .filter(field_and_operator, value), .order(field_and_direction) and .ancestor(entity) methods, in addition to all the facilities GQL objects have (.get(), .fetch(), .count()), etc.) Each of the Query methods returns the Query object itself for convenience, so you can chain them:
results = MyModel.all().filter("foo =", 5).order("-bar").fetch(10)
Is equivalent to:
results = MyModel.gql("WHERE foo = 5 ORDER BY bar DESC LIMIT 10").fetch()

A major downside when working with AppEngine was the 1k query limit, which has been mentioned in the comments already. What I haven't seen mentioned though is the fact that there is a built-in sortable order, with which you can work around this issue.
From the appengine cookbook:
def deepFetch(queryGen,key=None,batchSize = 100):
"""Iterator that yields an entity in batches.
Args:
queryGen: should return a Query object
key: used to .filter() for __key__
batchSize: how many entities to retrieve in one datastore call
Retrieved from http://tinyurl.com/d887ll (AppEngine cookbook).
"""
from google.appengine.ext import db
# AppEngine will not fetch more than 1000 results
batchSize = min(batchSize,1000)
query = None
done = False
count = 0
if key:
key = db.Key(key)
while not done:
print count
query = queryGen()
if key:
query.filter("__key__ > ",key)
results = query.fetch(batchSize)
for result in results:
count += 1
yield result
if batchSize > len(results):
done = True
else:
key = results[-1].key()
The above code together with Remote API (see this article) allows you to retrieve as many entities as you need.
You can use the above code like this:
def allMyModel():
q = MyModel.all()
myModels = deepFetch(allMyModel)

At first I had the same experience as others who transitioned from SQL to GQL -- kind of weird to not be able to do JOINs, count more than 1000 rows, etc. Now that I've worked with it for a few months I absolutely love the app engine. I'm porting all of my old projects onto it.
I use it to host several high-traffic web applications (at peak time one of them gets 50k hits a minute.)

Google App Engine doesn't use an actual database, and apparently uses some sort of distributed hash map. This will lend itself to some different behaviors that people who are accustomed to SQL just aren't going to see at first. So for example getting a COUNT of items in regular SQL is expected to be a fast operation, but with GQL it's just not going to work the same way.
Here are some more issues:
http://blog.burnayev.com/2008/04/gql-limitations.html
In my personal experience, it's an adjustment, but the learning curve is fine.

SSIS/VB.NET Equivalent of SQL IN (anonymous array.Contains())

I've got some SQL which performs complex logic on combinations of GL account numbers and cost centers like this:
WHEN (#IntGLAcct In (
882001, 882025, 83000154, 83000155, 83000120, 83000130,
83000140, 83000157, 83000010, 83000159, 83000160, 83000161,
83000162, 83000011, 83000166, 83000168, 83000169, 82504000,
82504003, 82504005, 82504008, 82504029, 82530003, 82530004,
83000000, 83000100, 83000101, 83000102, 83000103, 83000104,
83000105, 83000106, 83000107, 83000108, 83000109, 83000110,
83000111, 83000112, 83000113, 83100005, 83100010, 83100015,
82518001, 82552004, 884424, 82550072, 82552000, 82552001,
82552002, 82552003, 82552005, 82552012, 82552015, 884433,
884450, 884501, 82504025, 82508010, 82508011, 82508012,
83016003, 82552014, 81000021, 80002222, 82506001, 82506005,
82532001, 82550000, 82500009, 82532000))
Overall, the whole thing is poorly performing in a UDF, especially when it's all nested and the order of the steps is important etc. I can't make it table-driven just yet, because the business logic is so terribly convoluted.
So I'm doing a little exploratory work in moving it into SSIS to see about doing it in a little bit of a different way. Inside my script task, however, I've got to use VB.NET, so I'm looking for an alternative to this:
Select Case IntGLAcct = 882001 OR IntGLAcct = 882025 OR ...
Which is obviously a lot more verbose, and would make it terribly hard to port the process.
Even something like ({90605, 90607, 90610} AS List(Of Integer)).Contains(IntGLAcct) would be easier to port, but I can't get the initializer to give me an anonymous array like that. And there are so many of these little collections, I'm not sure I can create them all in advance.
It really all NEEDS to be in one place. The business changes this logic regularly. My strategy was to use the udf to mirror their old "include" file, but performance has been poor. Now each of the functions takes just 2 or three parameters. It turns out that in a dark corner of the existing system they actually build a multi-million row table of all these results - even though the pre-calced table is not used much.
So my new experiment is to (since I'm still building the massive cross join table to reconcile that part of the process) go ahead and use the table instead of the code, but go ahead and populate this table during an SSIS phase instead of calling the udf 12 million times - because my udf version just basically stopped working within a reasonable time frame and the DBAs are not of much help right now. Yet, I know that SSIS can process these rows pretty efficiently - because each month I bring in the known good results dozens of multi-million row tables from the legacy system in minutes AND run queries to reconcile that there are no differences with the new versions.
The SSIS code would theoretically become the keeper of the business logic, and the efficient table would be built from that (based on all known parameter combinations). Of course, if I can simplify the logic down to a real logic table, that would be the ultimate design - but that's not really foreseeable at this point.

Try this:
Array.IndexOf(New Integer() {90605, 90607, 90610}, IntGLAcct) >-1

What if you used a conditional split transform on your incoming data set and then used expressions or something similar (I'm not sure if your GL Accounts are fixed or if you're going to dynamically pass them in) to apply to the results? You can then take the resulting data from that and process as necessary.