What are the key differences between streaming and normal search in terms of internal implementation . A normal search would also work in a distributed manner , right ? How does streaming improve performance ? The documentation is not helping out .
Distributed search issues requests, calculates results, delivers results and handles merging. Each step is processed fully before continuing to the next step. This works well enough when the amount of data is small. For larger requests, such as delivering millions of documents, this requires huge memory buffers. It also means that the caller has to wait until the very last step (delivery of the result to the caller) before the result can be processed.
With streaming, all this is ongoing: Calculation, delivery and merging happens at the same time, with a fixed upper memory overhead. You can ask for 10K results or you can ask for 10 billion, the only difference is how long it takes. As every part of the process is active at the same time, including delivery to the caller, this also means that the first result data will be delivered to the caller very quickly.
Internally, streaming is basically paging through the search result. Each page (10K documents, if I remember correctly) is passed along to the stream as soon as it has been calculated. Ignoring optimizations, the same behaviour could be emulated from the outside with deep paging and a custom merger.
Related
Let's say there is a DynamoDB key with a value of 0, and there is a process that repeatably reads from this key using eventually consistent reads. While these reads are occurring, a second process sets the value of that key to 1.
Is it ever possible for the read process to ever read a 0 after it first reads a 1? Is it possible in DynamoDB's eventual consistency model for a client to successfully read a key's fully up-to-date value, but then read a stale value on a subsequent request?
Eventually, the write will be fully propagated and the read process will only read 1 values, but I'm unsure if it's possible for the reads to go 'backward in time' while the propagation is occuring.
The property you are looking for is known as monotonic reads, see for example the definition in https://jepsen.io/consistency/models/monotonic-reads.
Obviously, DynamoDB's strongly consistent read (ConsistentRead=true) is also monotonic, but you rightly asked about DynamoDB's eventually consistent read mode.
#Charles in his response gave a link, https://www.youtube.com/watch?v=yvBR71D0nAQ&t=706s, to a nice official official talk by Amazon on how eventually-consistent reads work. The talk explains that DynamoDB replicates written data to three copies, but a write completes when two out of three (including one designated as the "leader") of the copies were updated. It is possible that the third copy will take some time (usually a very short time to get updated).
The video goes on to explain that an eventually consistent read goes to one of the three replicas at random.
So in that short amount of time where the third replica has old data, a request might randomly go to one of the updated nodes and return new data, and then another request slightly later might go by random to the non-updated replica and return old data. This means that the "monotonic read" guarantee is not provided.
To summarize, I believe that DynamoDB does not provide the monotonic read guarantee if you use eventually consistent reads. You can use strongly-consistent reads to get it, of course.
Unfortunately I can't find an official document which claims this. It would also be nice to test this in practice, similar to how he paper http://www.aifb.kit.edu/images/1/17/How_soon_is_eventual.pdf tested whether Amazon S3 (not DynamoDB) guaranteed monotonic reads, and discovered that it did not by actually seeing monotonic-read violations.
One of the implementation details which may make it hard to see these monotonic-read violations in practice is how Amazon handles requests from the same process (which you said is your case). When the same process sends several requests in sequence, it may (but also may not...) may use the same HTTP connections to do so, and Amazon's internal load balancers may (but also may not) decide to send those requests to the same backend replica - despite the statement in the video that each request is sent to a random replica. If this happens, it may be hard to see monotonic read violations in practice - but it may still happen if the load balancer changes its mind, or the client library opens another connection, and so on, so you still can't trust the monotonic read property to hold.
Yes it is possible. Requests are stateless so a second read from the same client is just as likely as any other request to see slightly stale data. If that’s an issue, choose strong consistency.
You will (probably) not ever get the old data after getting the new data..
First off, there's no warning in the docs about repeated reads returning stale data, just that a read after a write may return stale data.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
But more importantly, every item in DDB is stored in three storage nodes. A write to DDB doesn't return a 200 - Success until that data is written to 2 of 3 storage nodes. Thus, it's only if your read is serviced by the third node, that you'd see stale data. Once that third node is updated, every node has the latest.
See Amazon DynamoDB Under the Hood
EDIT
#Nadav's answer points that it's at least theoretically possible; AWS certainly doesn't seem to guarantee monotonic reads. But I believe the reality depends on your application architecture.
Most languages, nodejs being an exception, will use persistent HTTP/HTTPS connections by default to the DDB request router. Especially given how expensive it is to open a TLS connection. I suspect though can't find any documents confirming it that there's at least some level of stickiness from the request router to a storage node. #Nadav discusses this possibility. But only AWS knows for sure and they haven't said.
Assuming that belief is correct
curl in a shell script loop - more likely to see the old data again
loop in C# using a single connection - less likely
The other thing to consider is that in the normal course of things, the third storage node in "only milliseconds behind".
Ironically, if the request router truly picks a storage node at random, a non-persistent connection is then less likely to see old data again given the extra time it takes to establish the connection.
If you absolutely need monotonic reads, then you'd need to use strongly consistent reads.
Another option might be to stick DynamoDB Accelerator (DAX) in front of your DDB. Especially if you're retrieving the key with GetItem(). As I read how it works it does seem to imply monotonic reads, especially if you've written-through DAX. Though it does not come right out an say so. Even if you've written around DAX, reading from it should still be monotonic, it's just there will be more latency until you start seeing the new data.
I'm in the design stage of my App DB, running on Firebase.
I wonder what is the best approach to store my Data.
I know that Firebase only returns the first layer of the document unless you explicitly query for sub-collection And I want to reduce extra reads as much as possible to save $.
In short, my DB holds devices and each device can have messages (call-backs) that the server needs to send back to him, one at a time, in FIFO order.
The user sees the waiting call-backs when watching a device in the App.
My first thought was to hold a collection of messages for each device, but this approach will require two reads for each call-back in each device when the server wants to know what is the next call-back to send.
My second approach is to hold the call-back in array in each device document but I'm not sure if it is the right approach, sounds messy to me for some reason.
My third option is to hold a collection of call-backs with the deviceID field in each call-back. the drawback I see in this approach is that I need to perform some kind of "join" when viewing a device and his call-backs in the App, And the searching time for the next call-back is increasing, although still reasonable (log(n) time).
If we talk numbers theoretically the system can store tens of thousands of devices and each device can have 10-100 call-backs in line, and each call-back can be called every 15 seconds or more.
My first thought was to hold a collection of messages for each device, but this approach will require two reads for each call-back in each device when the server wants to know what is the next call-back to send.
If you need to query the messages that correspond to a device, then storing them into a collection is a good option to go ahead with. This is because you can perform simple and compound qureis.
My second approach is to hold the call-back in array in each device document but I'm not sure if it is the right approach, sounds messy to me for some reason.
If you only need to list them, and not query, then that's a really good idea in my opinion, but only if the size of the document can stay below the maximum limit. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your array gets bigger, be careful about this limitation.
My third option is to hold a collection of call-backs with deviceID field in each call-back. the drawback I see in this approach is that I need to perform some kind of "join" when viewing a device and his call-backs in the App, And the searching time for the next call-back is increasing, although still reasonable (log(n) time).
When talking again about collections, we are getting back to the first question. Remember, we are usually structuring a Firestore database according to the queries that we want to perform. So it's up to you to decide if this approach it's best for you.
Always keep in mind that you are billed according to the number of documents a query returns.
I am using broadcastprocess function to perform simple pattern matching. I am broadcasting around 60 patterns. Once the process completed the memory is not coming down i am using garbage collection setting in my flink configuration file env.java.opts = "-XX:+UseG1GC" to perform GC but it is also not working. But CPU percentage coming after completing the processing of data. I am doing checkpointing every 2 minutes and my statebackend is filesystem. Below are screenshots of memory and CPU usage
I don't see anything surprising or problematic in the graphs you have shared. After ingesting the patterns, each instance of your BroadcastProcessFunction will be holding onto a copy of all of the patterns -- so that will consume some memory.
If I understand correctly, it sounds like the situation is that as data is processed for matching against those patterns, the memory continues to increase until the pods crash with out-of-memory errors. Various factors might explain this:
If your patterns involve matching a sequence of events over time, then your pattern matching engine has to keep state for each partial match. If there's no timeout clause to ensure that partial matches are eventually cleaned up, this could lead to a combinatorial explosion.
If you are doing key-partitioned processing and your keyspace is unbounded, you may be holding onto state for stale keys.
The filesystem state backend has considerable overhead. You may have underestimated how much memory it needs.
I'm looking into using Solr for a use case that will require some deep paging, thinking an upper bound of about 100k total results split into 1k pages from a collection of ~10 million records. I quickly discovered why using start & num_rows is a bad idea for a result set that size and came across cursorMark in the process. Articles I've found about cursorMark suggest a relatively constant time for record access regardless of position in the set which seems perfect for my case.
The question I had though, is there any kind of performance impact going down this route? Is there any performance difference in terms of memory/CPU usage for using cursorMark to deep page into result sets of 1k, 10k, 100k, 1 million records assuming I return back 1000 at time?
In theory it gets a little bit faster as you page down. In reality the difference is so small that you won't not notice it.
A standard non-cursor search uses a little queue to hold the top-X results. Every match is added to that queue, pushing out poorer matches if the queue is full.
A cursor-search also uses a queue of size X. Every match is added to that queue, if their sort value is beyond the previous cursor mark, pushing out poorer matches if the queue is full. So as you page deeper, there are a bit less inserts.
There are some very illustrative graphs of cursor performance at https://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
everything works in a task queue, so we have plenty of time (10 minute deadline).
I run a query over the ndb entities in order to select which entities need to be processed.
as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?
In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.
The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).
Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.