GAE/J Low-level API: FetchOptions usage - google-app-engine

What should I know about FetchOptions withLimit, prefetchSize and chunkSize? The docs say the following:
prefetchSize is the number of results
retrieved on the first call to the
datastore.
chunkSize determines the internal
chunking strategy of the Iterator
returned by
PreparedQuery.asIterator(FetchOptions)
...
prefetchSize and chunkSize have no
impact on the result of the PreparedQuery,
but rather only the performance
of the PreparedQuery.
I'm not too sure how to use that in anger. What are the performance implications of different options? Any examples of how changes you've made have improved performance?

Setting bigger chunkSize/prefetchSize will improve performance of iteration over big result sets, but it also will increase latency. So bigger values should work better when you know that you are going to iterate over big result set.

Related

Raku : is there a SUPER fast way to turn an array into a string without the spaces separating the elements?

I need to convert thousands of binary byte strings, each about a megabyte long, into ASC strings. This is what I have been doing, and seems too slow:
sub fileToCorrectUTF8Str ($fileName) { # binary file
my $finalString = "";
my $fileBuf = slurp($fileName, :bin);
for #$fileBuf { $finalString = $finalString ~ $_.chr; };
return $finalString;
}
~#b turns #b into string with all elements separated by space, but this is not what I want. If #b = < a b c d >; the ~#b is "a b c d"; but I just want "abcd", and I want to do this REALLY fast.
So, what is the best way? I can't really use hyper for parallelism because the final string is constructed sequentially. Or can I?
TL;DR On an old rakudo, .decode is about 100X times as fast.
In longer form to match your code:
sub fileToCorrectUTF8Str ($fileName) { # binary file
slurp($fileName, :bin).decode
}
Performance notes
First, here's what I wrote for testing:
# Create million and 1 bytes long file:
spurt 'foo', "1234\n6789\n" x 1e5 ~ 'Z', :bin;
# (`say` the last character to check work is done)
say .decode.substr(1e6) with slurp 'foo', :bin;
# fileToCorrectUTF8Str 'foo' );
say now - INIT now;
On TIO.run's 2018.12 rakudo, the above .decode weighs in at about .05 seconds per million byte file instead of about 5 seconds for your solution.
You could/should of course test on your system and/or using later versions of rakudo. I would expect the difference to remain in the same order, but for the absolute times to improve markedly as the years roll by.[1]
Why is it 100X as fast?
Well, first, # on a Buf / Blob explicitly forces raku to view the erstwhile single item (a buffer) as a plural thing (a list of elements aka multiple items). That means high level iteration which, for a million element buffer, is immediately a million high level iterations/operations instead of just one high level operation.
Second, using .decode not only avoids iteration but only incurs relatively slow method call overhead once per file whereas when iterating there are potentially a million .chr calls per file. Method calls are (at least semantically) late-bound which is in principle relatively costly compared to, for example, calling a sub instead of a method (subs are generally early bound).
That all said:
Remember Caveat Empty[1]. For example, rakudo's standard classes generate method caches, and it's plausible the compiler just in-lines the method anyway, so it's possible there is negligible overhead for the method call aspect.
See also the doc's Performance page, especially Use existing high performance code.
Is the Buf.Str error message LTA?
Update See Liz++'s comment.
If you try to use .Str on a Buf or Blob (or equivalent, such as using the ~ prefix on it) you'll get an exception. Currently the message is:
Cannot use a Buf as a string, but you called the Str method on it
The doc for .Str on a Buf/Blob currently says:
In order to convert to a Str you need to use .decode.
It's arguably LTA that the error message doesn't suggest the same thing.
Then again, before deciding what to do about this, if anything, we need to consider what, and how, folk could learn from anything that goes wrong, including signals about it, such as error messages, and also what and how they do in fact currently learn, and bias our reactions toward building the right culture and infrastructure.
In particular, if folk can easily connect between an error message they see, and online discussion that elaborates on it, that needs to be taken into account and perhaps encouraged and/or made easier.
For example, there's now this SO covering this issue with the error message in it, so a google is likely to get someone here. Leaning on that might well be a more appropriate path forward than changing the error message. Or it might not. The change would be easy...
Please consider commenting below and/or searching existing rakudo issues to see if improvement of the Buf.Str error message is being considered and/or whether you wish to open an issue to propose it be altered. Every rock moved is at least great exercise, and, as our collective effort becomes increasingly wise, improves (our view of) the mountain.
Footnotes
[1] As the well known Latin saying Caveat Empty goes, both absolute and relative performance of any particular raku feature, and more generally any particular code, is always subject to variation due to factors including one's system's capabilities, its load during the time it's running the code, and any optimization done by the compiler. Thus, for example, if your system is "empty", then your code may run faster. Or, as another example, if you wait a year or three for the compiler to get faster, advances in rakudo's performance continue to look promising.

MongoDB - Performance Bottleneck with Simple Query

We are currently investigating MongoDB as a possible solution for a highly distributed database for scientific data. Given our query requirements we have chosen to go for a single collection consisting of documents, each document representing an object and its properties which number ~450. A typical document would be structured as follows:
d = {'patch': '12345-1,1',
'X': { (120 key-value pairs) },
'A': { (64 key-value pairs) },
...
(4 more such embedded documents)
}
Within X, there is an integer flag. The flag is a 32 bit integer, each bit representing a Boolean flag. This is a common method of storing boolean flags when their number is rather large. There is a lookup table which shows which position corresponds to what Boolean property. There is a 15th bit which is of relevance to our specific set of queries. The total number of documents are 600,000, sharded across 3 desktops (8GB RAM and i7 CPU, standard 5400 RPM spinning hard drives).
The query being written is a simple - we want a count of all documents for which the 15th bit of a particular flag integer is set to 1.
db.coll.find(
{'X.flag1': {$bitsAllSet: [14]}}
).count()
The average time taken for this query is 19,783 ms. This is not an acceptable time for us. We tried to improve this using an aggregation instead of a standard find() based query.
db.coll.aggregate([
'$match': {
'X.flag1': {$bitsAllSet: [14]}
},
'$group': {
_id: 0,
count: {$sum: 1}
}
])
This takes about 10,000 ms. While this is an improvement (which I think is because of the highly efficient C++ implementation of the Aggregation framework), it is still beyond the kind of performance we desire. The next step was to actually isolate the flag hidden in the 15th bit and make it a separate key in the document. This would result in the same queries as above but instead of using $bitsAllSet: [14], we would use X.is_primary: 1. For find() and aggregate() the respective times were 19,000 ms and 8,500 ms respectively. There is very little improvement.
So, my two questions which I hope people may help with are:
Is this the final performance I can expect from MongoDB Community Edition? I am aware that there is an Enterprise Edition which will come with an In-Memory Engine. But my question is more specific to the Community Edition. Is there any trick that I could use to speed up the query?
I am slowly finding that atleast for complex server side analytics and querying that we need, MongoDB is proving to be hard to use both in terms of the complexity of queries we are writing as well as the performance bottle necks. Any advice on what other databases I may consider.
Edit: As suggested, I am sharing the output of the .explain(). The output is for a collection where is_primary is not indexed. But as discussed in the comments section, for a Boolean value, the presence of an index should not make a difference to the performance a query based on Boolean flags.
Pastebin Link (expiry of 2 weeks)

How to avoid datastore_v3.Next calls

My datastore processing takes lot of times.
So, I checked what process takes time with StackDriver -> Trace.
Then, "datastore_v3.Next" is called many times.
I found following document.
https://cloud.google.com/appengine/docs/standard/python/tools/appstats
Datastore queries usually involve a datastore_v3.RunQuery followed by
zero or more datastore_v3.Next calls. (RunQuery returns the first few
results, so the API only uses Next when fetching many results.
Avoiding unnecessary Next calls may speed up your app!)
But, I don't understand
In what cases will datastore_v3.Next be called many time?
How to avoid Next calls?
Add:
MyCode is below
#classmethod
def get_foo(cls, user_key, foo):
search_key = ndb.Key('UserInformation', user_key.id(), 'Foo', foo)
return search_key.get()
UserInformation and Foo kind have 1 million entity.
The primary way is the one mentioned in the doc, simply don't return an unnecessarily large result set.
If you only care about the first few results in the query, you can look into using fetch which can return a limited set of results instead of the whole thing.
In addition, you can also make good use of memcache on NDB, less datastore calls means less .Next calls overall.

Best way to store redis keys

I am using Redis to store some information and detect changes in that information over time (for example, think users and locations). What is the value to using a longer or shorter keyname? Using a longer key is clearer, but is there much cost for memory or performance to using longer keyname?
Here are examples:
SET L:123456 "<name> <latitude> <longitude> ..."
HSET U:987654321 loc 123456 time <epoch>
or
SET loc:{123456} "<name> <latitude> <longitude> ..."
HSET user:{U987654321} loc 123456 time <epoch>
It all depends on how you are going to use it.
If every byte counts, for example when you have to pay for each kB transferred to a cloud service, you can calculate the costs. The maths is simple; a byte is a byte 'on the wire'. Inside redis, for larger values it is equally simple. For smaller values, Redis does some memory optimization.
In your HSET example, you split out the members, which only makes sense if you need them separated from eachother most of the time. A better approach -might- be: HSET user:data 987654321 '{"loc": "123456", "time": "2014-01-01T13:00:00"}'. Separate keys/members 'cost' a lot more than longer strings, performance wise. You can even put a whole table or dataset in one member if it's only going to be used as one complete semi-static entity.
Speed and Size: There is a notable difference between keys and values.
Keys:
Shorter is generally more memory efficient as well as speed efficient. If you use a redis Sorted Set you can even use 'numbers' as keys (sorted set 'members' plus 'scores'). I say 'numbers' because a score is technically a float64, but to be used as an ID it has to be between -999999999999999 and 999999999999999 including (that's 15 digits), without any fractional part. This can be really helpful, since Redis does fast and scalable O(log(n)) on-the-fly sorting of Sorted Sets (using skiplists, simplified).
Values:
The MsgPack format (uncompressed) takes up the least space, especially if you store the definitions once and the values many. JSON is a bit less memory efficient, but is ofcourse such a common IPC format that it should not be left out. Raw strings, character separated, fixed length (ugh), whatever your desire, it's possible to use. You can always compress your data before storing it in Redis. So far memory efficiency. When it comes to speed, it's less simple. If you want to use Lua server-side scripting (which you should), you can't do anything with compressed data. JSON and MsgPack can be deserialized, but only 'as a whole'. Which is fine in mosts scenarios. Most flexible is storing separate values (for example as members of a HSET), but this comes at a price as well (most of the time: too high a price). You also can combine all these. What we use most: a prefix of two or three delimiter-separated values, followed by a MsgPack payload.
My general advice is: start with using only HSET's and ZSET's, don't split out data that belongs together, use descriptive PascalCased names for your keys between 10-25 chars, use ':' if you need delimiters in your keys (namespaces), serialize as JSON (for simplicity, but code for easy switching to MsgPack), use Lua scripting (even if you don't know Lua, the subset you use in Redis is tiny).
I wouldn't worry about it too much in the startup phase of your project, you can always change it later on and do some A/B comparisons as soon as you have some interpolatable data.
Hope this helps, TW
Now that Redis v3.2 is almost here, you should consider switching to the new geo hashing functionality: http://redis.io/commands/geoadd

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

Resources