CouchDB Views: How much processing is acceptable in map reduce? - database

I've been toying around with Map Reduce with CouchDB. Some of the examples show some possibly heavy logic within the map reduce functions. In one particular case, they were performing for loops within map.
Is map reduce run on every single possible document before it emits your selected documents?
If so, I would think that means that running any kind of iterative processing within the map reduce functions would increase processing burden by an order of magnitude, at least.
Basically it boils down to the following question: how much logic can be performed within map reduce before its an unreasonably expensive query?

Lots of expensive processing is acceptable in CouchDB map-reduce.
CouchDB views (map-reduce) are more like CREATE INDEX than they are SELECT FROM.
Specifically, CouchDB guarantees that a map function runs only once per document, ever. (Well, actually once per document change ever.) That is what the "iterative map-reduce" is.
Therefore, suppose you had 10,000 documents and they take 1 second each to process (which is way higher than I have ever seen). That is 10,000 seconds or 2.8 hours to completely build the view. However once the view is complete, querying any row (?key=...) or row slice (?startkey=...&endkey=...) takes the same time as querying for documents directly. Lookup time is O(log n) for the document count.
In other words, even if it takes 1 second per document to execute the map, it will take a few milliseconds to fetch the result. (Of course, the view must build first, since it is actually an index.)

Querying the db is an unrelated activity from the map/reduce of a document. Therefore the query cost is not impacted by the complexity of the map/reduce.
In couchdb you are querying an index. This means it is a copy of your data in a format optimized for query speed. A query is not like a tablescan in sql. It does not loop through records.
So how do you make this index? It is done through the map function. The map function emits a key and a value. The key is put in the index. Some complicated map functions that you mention may loop and emit many keys and values. Couchdb is smart and only runs a document when it needs to, usually on create, updates, and deletes. This is why it is incremental map/reduce.
So as you might see, a complicated map function might impact create, update, and delete speed. But again couchdb is smart in that you can specify how stale the data might be when you query the index.

Related

Is the QLDB sum() fuction optimized?

We want to query our QLDB Ledger using the SUM() function (with a WHERE clause referring only to indexed fields).
e.g.,
select count(*), sum(NonIndexedField) from myTable where IndexedField1 = "foo" and IndexedField2 = "bar";
Is this a good or bad query pattern? The guidance at this page talks about how the COUNT() function is not optimized, and that makes me suspicious about the SUM() method as well.
Animesh,
Overall, your query follows a good pattern because it depends primarily on an equality comparison against an index (IndexedField1). However, if there are many thousands of documents with IndexedField1 = "foo", your query may run slowly or even timeout. If there are fewer documents with IndexedField2 = "bar" than IndexedField1 = "foo", then put the IndexedField2 comparison first in the WHERE clause. QLDB may try to use both indexes, but the left-most one is the most important.
Besides performance, the other problem you might run into if there are a lot of documents to scan through after the index hit is OCC exceptions. If another process inserts/updates/deletes a document with an IndexedField1 value of "foo" or an IndexedField2 value of "bar", then your aggregation query will get an OCC exception. This is generally not a problem because the driver will just retry, but it can become a perceived performance problem if the frequency of collision is high and that frequency increases with the number of documents in your transaction scope. We recommend keeping your transaction scope as small as you can.
Generally speaking, QLDB is optimized for writes and targeted fetches of small amounts of data using an equality comparison against an indexed value. It's great for validation reads that inform a transaction decision (does my account have enough balance, does this record already exist, give me customer record identified by unique ID 'X', etc.). As the number of documents in your transaction scope widens, the potential for OCCs and transaction timeouts increases. We generally recommend you stream data from QLDB into a secondary database to support OLAP, reporting, ad-hoc queries, etc.
So the real answer is that your query looks pretty good, but you should consider what your data set might look like in production now, next year, etc. and test for that to make sure you're good to go.
In my personal experience using QLDB for almost a year now, you can never rely on it to perform any kind of non-indexed selection directly on the database.
Even when it's indexed, if you happen to have a low cardinality (meaning your index has repetitions as #DanB said) it will perform poorly.
Best approach is to either select what you want into the application (if viable) and perform the sum/count over there or use their journal export to manipulate all records of the database.
This second option may be tricky, though, because you must rely on inserts and updates that happened in a timeframe to get the information you want. Meaning you may need to export a HUGE cluster of files, if you need to go over all records available. And then process what you want. As you can see, this is, by all means, a non-live operation.
In short words, never think of QLDB as a relational or a no-sql database and things will work out better. It's a ledger. Period.

Using QuerySplitter in Google Datastore to load chunks of a known size

I'd like to load lots of data from a Google Datastore table. For performance, I'd like to run, in parallel, a few queries that each loads a lot of objects. Cursors are not suitable for the parallel execution.
QuerySplitter is. However, for QuerySplitter you have to tell it how many splits you want, and what I care about is loading a certain number of objects. The number is chosen for the needs of my application, large but not not too large, say 800 objects. It's OK if the number of objects returned by each query is only very roughly the same; nothing worse would happen that different threads running different amounts of time.
How do I do this? I could query all objects keys-only in order to count them, and divide by 800. Is there a better way.
Querying all your entities (even keys only) might not scale so well, but you could run your query/ies periodically and save the counts in datastore or memcache, depending on how frequently you need to run your job.
However, to find all the entities of a given kind you can use the Datastore Statistics API which should be a lot quicker. I don't know how frequently the stats are updated but it's probably the same as the stats in the console.
If you are going to more frequent counts, or figures for filtered queries, you might consider sharded counters. Since you only need an approximate number, you could update them asynchronously on each new put.

Improve throughput of ndb query over large data

I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
everything works in a task queue, so we have plenty of time (10 minute deadline).
I run a query over the ndb entities in order to select which entities need to be processed.
as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?
In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.
The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).
Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Resources