How to avoid datastore_v3.Next calls - google-app-engine

My datastore processing takes lot of times.
So, I checked what process takes time with StackDriver -> Trace.
Then, "datastore_v3.Next" is called many times.
I found following document.
https://cloud.google.com/appengine/docs/standard/python/tools/appstats
Datastore queries usually involve a datastore_v3.RunQuery followed by
zero or more datastore_v3.Next calls. (RunQuery returns the first few
results, so the API only uses Next when fetching many results.
Avoiding unnecessary Next calls may speed up your app!)
But, I don't understand
In what cases will datastore_v3.Next be called many time?
How to avoid Next calls?
Add:
MyCode is below
#classmethod
def get_foo(cls, user_key, foo):
search_key = ndb.Key('UserInformation', user_key.id(), 'Foo', foo)
return search_key.get()
UserInformation and Foo kind have 1 million entity.

The primary way is the one mentioned in the doc, simply don't return an unnecessarily large result set.
If you only care about the first few results in the query, you can look into using fetch which can return a limited set of results instead of the whole thing.
In addition, you can also make good use of memcache on NDB, less datastore calls means less .Next calls overall.

Related

Raku : is there a SUPER fast way to turn an array into a string without the spaces separating the elements?

I need to convert thousands of binary byte strings, each about a megabyte long, into ASC strings. This is what I have been doing, and seems too slow:
sub fileToCorrectUTF8Str ($fileName) { # binary file
my $finalString = "";
my $fileBuf = slurp($fileName, :bin);
for #$fileBuf { $finalString = $finalString ~ $_.chr; };
return $finalString;
}
~#b turns #b into string with all elements separated by space, but this is not what I want. If #b = < a b c d >; the ~#b is "a b c d"; but I just want "abcd", and I want to do this REALLY fast.
So, what is the best way? I can't really use hyper for parallelism because the final string is constructed sequentially. Or can I?
TL;DR On an old rakudo, .decode is about 100X times as fast.
In longer form to match your code:
sub fileToCorrectUTF8Str ($fileName) { # binary file
slurp($fileName, :bin).decode
}
Performance notes
First, here's what I wrote for testing:
# Create million and 1 bytes long file:
spurt 'foo', "1234\n6789\n" x 1e5 ~ 'Z', :bin;
# (`say` the last character to check work is done)
say .decode.substr(1e6) with slurp 'foo', :bin;
# fileToCorrectUTF8Str 'foo' );
say now - INIT now;
On TIO.run's 2018.12 rakudo, the above .decode weighs in at about .05 seconds per million byte file instead of about 5 seconds for your solution.
You could/should of course test on your system and/or using later versions of rakudo. I would expect the difference to remain in the same order, but for the absolute times to improve markedly as the years roll by.[1]
Why is it 100X as fast?
Well, first, # on a Buf / Blob explicitly forces raku to view the erstwhile single item (a buffer) as a plural thing (a list of elements aka multiple items). That means high level iteration which, for a million element buffer, is immediately a million high level iterations/operations instead of just one high level operation.
Second, using .decode not only avoids iteration but only incurs relatively slow method call overhead once per file whereas when iterating there are potentially a million .chr calls per file. Method calls are (at least semantically) late-bound which is in principle relatively costly compared to, for example, calling a sub instead of a method (subs are generally early bound).
That all said:
Remember Caveat Empty[1]. For example, rakudo's standard classes generate method caches, and it's plausible the compiler just in-lines the method anyway, so it's possible there is negligible overhead for the method call aspect.
See also the doc's Performance page, especially Use existing high performance code.
Is the Buf.Str error message LTA?
Update See Liz++'s comment.
If you try to use .Str on a Buf or Blob (or equivalent, such as using the ~ prefix on it) you'll get an exception. Currently the message is:
Cannot use a Buf as a string, but you called the Str method on it
The doc for .Str on a Buf/Blob currently says:
In order to convert to a Str you need to use .decode.
It's arguably LTA that the error message doesn't suggest the same thing.
Then again, before deciding what to do about this, if anything, we need to consider what, and how, folk could learn from anything that goes wrong, including signals about it, such as error messages, and also what and how they do in fact currently learn, and bias our reactions toward building the right culture and infrastructure.
In particular, if folk can easily connect between an error message they see, and online discussion that elaborates on it, that needs to be taken into account and perhaps encouraged and/or made easier.
For example, there's now this SO covering this issue with the error message in it, so a google is likely to get someone here. Leaning on that might well be a more appropriate path forward than changing the error message. Or it might not. The change would be easy...
Please consider commenting below and/or searching existing rakudo issues to see if improvement of the Buf.Str error message is being considered and/or whether you wish to open an issue to propose it be altered. Every rock moved is at least great exercise, and, as our collective effort becomes increasingly wise, improves (our view of) the mountain.
Footnotes
[1] As the well known Latin saying Caveat Empty goes, both absolute and relative performance of any particular raku feature, and more generally any particular code, is always subject to variation due to factors including one's system's capabilities, its load during the time it's running the code, and any optimization done by the compiler. Thus, for example, if your system is "empty", then your code may run faster. Or, as another example, if you wait a year or three for the compiler to get faster, advances in rakudo's performance continue to look promising.

Using FDQuery.RecordCount to tell if the query is empty or not may return negative value in Delphi

I have an FDQuery on a form, and an action that enable and disable components according to its recordCount ( enabled when >0).
Most of times recordCount property return the actual count of records in the query. Sometimes, recordcount return negative value, but I can see records on a grid associated with the query.
RecordCount returned values between -5 and -1, until now.
How can I solve this? why does it return negative values?
why does it return negative values?
That is not FireDAC specific, that is a normal behavior of all Delphi libraries providing TDataSet interface.
TDataSet.RecordCount was never warranted to work. It is even said in the documentation. This is a bonus functionality. Sometimes it may work, other times it will not. It is only expected to work reliably for non-SQL table data like CSV, DBF, Paradox tables and other similar ISAM engines.
How can I solve this?
Relying on this function in the modern world is always a risky endeavor.
So you better design your program so that this function would never be used, or only in very few very specific scenarios.
Instead you should understand what question your program really asks to the library and then find a "language" to make this question by calling other functions, better tailored for your case.
Now tell me, when you search in Google how often do you read through up to 10th page, up to 100th page? Almost never, right? Your program users would also almost never scroll the data grid really far downwards. Keep this in mind.
You always need to show users first data and do it fast. But rarely the last data.
Now, three examples.
1) you read data from some remote server with slow internet. You can only read 100 rows per second. You grid has room to show 20 first rows. The rest user has to scroll by. In total the query can filter 10 000 rows for you.
If you just show those 20 rows to user - then it works almost instantly, it is only 0.2 seconds from when you start reading data to when you filled your grid and presented it to the user. The rest of the data would only be fetched if user would request it by scrolling (I am a bit simplifying here for clarity, I know about pre-caching TDataset.Buffers).
So if you call the RecordCount function what does your program do? It downloads ALL the records into your local memory where it counts them. And with such a speed it would take 10 000 / 100 = 100 seconds, more than a minute and a half.
Just by calling the RecordCount function you called FetchAll procedure and made your program response to the user not in 0.2 seconds but in 1m40s instead.
User would be very nervous waiting for that to finish.
2) Imagine you are fetching the data from some Stored Procedure. Or from a table, where another application is inserting the rows. In other words, that is not some static read-only data, that is live data that is being generated while you are downloading it.
So, how many rows are there then? This moment it is, for example, 1000 rows, in a second it would be 1010 rows, in two second it maybe would be 1050 rows, and so forth.
What is the One True Value when this value is being changed every now and then?
Oookey, you called RecordCount function, you SetLength-ed your array to the 1000, and now you read all the data from your query. But it takes some time to download the data. It usually is fast, but it never is instantaneous. So, for example, it took one second to you to download those 1000 rows the database query data into your array (or grid). But while you were doing it 10 more rows were generated/inserted, and your query is not .EOFed, and you keep fetching rows #1001, #1002, ... #1010 - and you put them in the array/gris rows that just do not exist!
Would it be good?
Or would you cancel your query when you went out of the array/grid boundaries?
That way you would not have Access Violation.
But instead you would have those most recent 10 rows ignored and missed.
Is that good?
3) your query, when you debug it, returns 10 000 rows. You download them all into you program's local memory by calling RecordCount function, and it works like a charm, and you deploy it.
You client uses your program, and the data grows, and one day your query returns not the 10 000 rows, but 10 000 000.
Your program calls RecordCount function to download all those rows, it downloads, for example, 9 000 000 millions...
....and then it crashes with Out Of Memory error.
enable and disable components according to its recordCount > 0
That is a wrong approach to get the data you do not ever need (exact quantity of rows), then discard it. The examples above show you how that makes your program fragile and slow.
All you really want to know is if there are any rows or none at all.
You do not need to count all the rows ad learn their amount, you only wonder whether the query is empty or not.
And that is exactly what you should ask by calling the TDataSet.IsEmpty function instead of RecordCount.

GQL: is myList.contains(myField) any faster than 30 separate myField == myList(i) queries?

I've got a list with 100 Longs in it, and an Entity kind with a Long field. I want to find all the entities whose field value is in the list.
I was thinking, "Great! I'll just write where :p1.contains(field)," but AppEngine will only split this out on less than 31 elements (new Baskin Robbins' slogan?). So, now I'm thinking I'll have to split the list of 100 into multiple lists of 30.
But at this point, my hopes of a one-liner shot, I realized I could do something like
for (Long number : list)
GQL("select * from Kind where field = " + number)
essentially splitting into all the subqueries myself. My question is... is this equivalent to letting appengine split my list of 30 into 30 separate queries? Or is there some back-end magic they do to fetch all 30 sub-queries simultaneously?
Sub-queries are functionally identical to regular queries. Currently, sub-queries are run serially, so doing it yourself isn't going to be any slower than letting the SDK do it for you. In future, though, it's likely that sub-queries will be executed in parallel, making them much more efficient than doing it yourself. It's also possible functionality like 'IN' could be pushed to the backend, avoiding the issue in the app server altogether.
You should be aware, though, that anything that requires you to execute 100 queries is going to be very, very slow. If you can, you should find a way to work around this without doing piles of queries and merging the results.
That function likely utilizes the IN (arg1,arg2,...) operator.
The problem being:
A single query containing != or IN operators is limited to 30 sub-queries.
Each item in the list counts as a sub query so you can only fetch 30 items at a time with that query style.
Using SELECT * FROM YourKind WHERE value IN list_property will cause multiple sub-queries to be run. It is these sub-queries that are limited to 30.
Using SELECT * FROM YourKind WHERE list_property = value will use the index that is built for your list_property so should work with as many entries as your list-property can contain†.
† I think this would be limited to the 5000 max items for an index

GAE/J Low-level API: FetchOptions usage

What should I know about FetchOptions withLimit, prefetchSize and chunkSize? The docs say the following:
prefetchSize is the number of results
retrieved on the first call to the
datastore.
chunkSize determines the internal
chunking strategy of the Iterator
returned by
PreparedQuery.asIterator(FetchOptions)
...
prefetchSize and chunkSize have no
impact on the result of the PreparedQuery,
but rather only the performance
of the PreparedQuery.
I'm not too sure how to use that in anger. What are the performance implications of different options? Any examples of how changes you've made have improved performance?
Setting bigger chunkSize/prefetchSize will improve performance of iteration over big result sets, but it also will increase latency. So bigger values should work better when you know that you are going to iterate over big result set.

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.

Resources