Azure Search Updating Individual Documents - azure-cognitive-search

We are using Azure Search for various scenarios. We often need to update an individual document with changes that users make. We need to have these changes become visible in our indexes as soon as possible so that the stale time is as short as possible within reason.
What is the best strategy to handle this. We know that batch updates is the way to go but we need more immediate reflection of the changes.
Once a document is updated, how long does it take for the index to reflect this change.
Many thanks

If the updates are not very frequent, you can simply update the Azure Search document immediately (that is, using batches of size 1). On the other hand, if the updates are extremely frequent and you notice a high rate of failures with single-document batches, you will need to build some sort of "collector" mechanism to batch up updates. My recommendation would be to do the simple thing first: try single-document batches, and add batching logic if necessary.
Updated or newly indexed documents are reflected in the search results after a short delay, usually ranging from single milliseconds to 1-2 seconds. The delay depends on the service topology and indexing load.

Related

How can I use change detection when indexing a SQL view with an Azure Search indexer?

Have Azure search targeting a table now but we need to change it to target a view instead. After some reading realized cannot use the change tracking to do incremental indexer build, does it mean each time the indexer need to be fully rebuild then? The view contains several million rows, each time a rebuild will cost around half an hour, questions,
Is there a better way to do this to minimize the data latency
During the indexer rebuild, would it impacting the search calls
Thanks.
You can use the high watermark change detection policy (documented here) with a SQL view. This ensures that when your indexer runs, only rows that have changed according to some high watermark column are indexed.
During the indexer rebuild, would it impacting the search calls
Maybe. Indexing does consume some resources and may impact search latency. It depends on your pricing tier, topology (number of replicas and partitions), and workload. If you use a change detection policy and the number of changed rows indexed during each run of the indexer is relatively small, it probably won't have much of an impact.
To minimize the data latency when using view you can use high watermark change detection policy High water mark change detection. But
Real-time data synchronization isn't possible with an indexer. An indexer can reindex your table at most every five minutes. If data updates need to be reflected in the index sooner, we recommend pushing updated rows directly.
Change detection will not reload the data for you. It just keeps the track of records updated after last run. You will have to set scheduler to reload the data or if you want it to be real time you can use search service to upload new data directly to index. But it has quota limits.
If you need to upload large set of records change tracking will do efficiently. Instead of using scheduler we can also run the indexer using API run indexer This will reload all updated data for you.
You can track the status of indexer run using indexer status

Best practices for counting very large and changing datasets

This is not an app-engine question, per-se... though our application runs in Python on App-Engine using NDB against datastore. So the question is about doing work on large datasets in a distributed system.
We have a growing dataset that we need to calculate statistics against (counts, sums, etc). We have systems in place that successfully do this in a differential manner so that they are maintained transactionally as things change... but there are cases when we want to blow away our statistics and recalculate them from scratch... and or run validation routines to check the counts/sums we have been maintaining differentially
The question is, generally, what are some best practices for building statistics against very large dataset that is constantly changing in a distributed system?
Let's say we kicked off a large MapReduce job to sum a particular field on a million entities... and while that job was running several new entities came in, several were deleted, and several other summed properties changed. What are some of the best known/accepted/successful approaches to making sure those additions/deletions/changes make it into the overall sum?
The way I do this is that I don’t query all instances and run my operation over all of them every time. I have a separate entity group which handles these statistics in 1 attribute. Whenever I create/update an instance, I update the value of this attribute accordingly, and when I delete an instance I also update this value accordingly.
The best way to make sure that any updates do update the statistics entity group is to use hooks that will run automatically every time you put or delete the instance.
Hope that helps.
If you're able to meet several conditions:
track each individual MapReduce sub-job
determine the individual MapReduce sub-jobs for which results would be affected by a transactional update
ensure such affected MapReduce sub-jobs don't run simultaneously with the transactional update affecting them (might be ensured by the transaction itself?)
determine in a transactional update, for each interfering MapReduce sub-job, if the sub-job has already completed
Then you could generate and apply differential stats updates for each interfering sub-job which has completed (maybe apply them after completion of the entire large MapReduce job?). The sub-jobs which didn't execute yet don't need such differential stats as the content will be already updated when the sub-job will execute on it.
You might need to treat the interferences for the additions, deletions and plain changes in a transactional update separately.
Alternatively you could store all MapReduce sub-jobs partial results, track which of them are affected by transactional updates (if any) and, at the end of the large Mapreduce job check if any update happened while the job was running. If so just re-run just the affected sub-jobs to get updated partial results for them and re-assemble the partial results into the final result. Repeat until no more updates happen while the most recent partial MapReduce re-run is in progress. More or less rsync style for copying/moving huge live partitions with minimal downtime.
You could even feed relevant impact info from the transactional updates to the mapper (a slightly smarter one), to leave the mapper itself evaluate the impact to the potentially affected maps and propagate the info accordingly to get the affected sub-jobs re-run, as the updates come in :)

How much RealTime is Elasticsearch, Solr and DSE realtime search?

From some last couple of weeks, I have been working around Elasticsearch and Solr, and trying to do OLTP processing in real time. However, what comes to me is they claims(especially ES) to be real time. The meaning of real time looks a lot fuzzy to me.
If we go deep into it, both ES and Solr, defines a refresh rate or a soft-commit rate, after which the newly indexed documents would be available for search, effectively providing only Near-Real time capabilities.
It looks like by Real time search, it is either a marketing statement to call it real time, or they make the word fuzzy by talking about Real Time Search rather than batch or analytical processing.
Am I correct, or correct me if I am wrong, and there is a real-time search possible in a typical OLTP system, where every transaction has search visibility to last document ?
Elasticsearch is a Near Real Time search engine for search. Elasticsearch is Real Time for operations like Create, Update, Delete and Get.
By default, refresh is 1 second. In some use cases, it could appear as real time. For example, I was working for a french gov service and we were producing statistics per day. So for our use case, it was somehow real time from our perspective.
For logs for example, 1 second is enough in most use cases.
You can modify this default value but it comes with a cost.
If you really need real time, then you probably want to use a SQL database.
My 2 cents.
Yes, DSE Search is indeed Near real-time and has not yet achieved the mythical goal of absolute zero latency. But... even traditional Real real-time is not real-time once you factor in the time to do the actual database update, plus the fact that a lot of traditional database updates are batch-oriented, or even if the actual update operation is not batched, there is likely to be some human process that delays the start of the database update from the original source of a data change.
Also keep in mind that the latency of a database update needs to include maintaining the required (tunable) consistency for replicating data updates in the cluster.
Rather than push you back towards SQL if you want real-time, I would challenge you to fully justify the true latency requirements of the app. For example, with complex distributed applications you need to be prepared for occasional resource outages, such as network delays, so that it is usually much better to design a modern distributed application to be a lot more flexible and asynchronous than a traditional, synchronous, fragile (think HealthCare.gov) app architecture that improperly depends on a perception of zero-latency distributed operations.
Finally, we are working on enhancements to reduce the actual latency of database updates, coupled with ongoing improvements in hardware performance that further shrink the update latency window.
But ultimately, all computing real-time measures will have some non-zero latency and modern distributed apps must be designed for at least some degree of decoupling between database updates and absolute dependency on those updates.
Worst case scenario, apps that need to synchronize with database updates may need to implement a polling strategy to wait for the update to complete.
ElasticSearch has real time features for CRUD operations. On GET operations, it checks the Transaction log, to look for any uncommitted changes and return the most relevant document.
The Percolator feature enables realtime in search queries as well. It allows you to register queries (percolation), that will be used at indexing time to return matching documents to those predefined queries.
This workflow looks like this:
Register specific query (percolation) in Elasticsearch
Index new content (passing a flag to trigger percolation)
The response to the indexing operation will contain the matched percolations
A very good blog with live example that explains the Percolator concept:
http://blog.qbox.io/elasticsesarch-percolator

How can I combine similar tasks to reduce total workload?

I use App Engine, but the following problem could very well occur in any server application:
My application uses memcache to cache both large (~50 KB) and small (~0.5 KB) JSON documents which aggregate information which is expensive to refresh from the datastore. These JSON documents can change often, but the changes are sparse in the document (i.e., one item out of hundreds may change at a time). Currently, the application invalidates an entire document if something changes, and then will lazily re-create it later when it needs it. However, I want to move to a more efficient design which updates whatever particular value changed in the JSON document directly from the cache.
One particular concern is contention from multiple tasks / request handlers updating the same document, but I have ways to detect this issue and mitigate it. However, my main concern is that it's possible that there could be rapid changes to a set of documents within a small period of time coming from different request handlers, and I don't want to have to edit the JSON document in the cache separately for each one. For example, it's possible that 10 small changes affecting the same set of 20 documents of 50 KB each could be triggered in less than a minute.
So this is my problem: What would be an effective solution to combine these changes together? In my old solution, although it is expensive to re-create an entire document when a small item changes, the benefit at least is that it does it lazily when it needs it (which could be a while later). However, to update the JSON document with a small change seems to require that it be done immediately (not lazily). That is, unless I come up with a complex solution that lazily applies a set of changes to the document later on. I'm hoping for something efficient but not too complicated.
Thanks.
Pull queue. Everyone using GAE should watch this video:
http://www.youtube.com/watch?v=AM0ZPO7-lcE
When a call comes in, update memcache and do an async_add to your task pull queue. You likely could run a process that will handle thousands of updates each minute without a lot of overhead (i.e. instance issues). Still have an issue should memcache get purged prior to your updates, but that it not too hard to work around. HTH. -stevep

Efficiently record and store page view counts in the database?

Sites like StackOverflow count, save, and display the view counts for pages. How do you do it efficiently? Let's take view counts for StackOverflow questions as the example. I see the following options.
Option 1. When the app receives a request for a question, increment the count in the question table. This is very inefficient! The vast majority of queries are read-only, but you're tacking on an update to each one.
Option 2. Maintain some kind of cache that maps new view counts to questionIds. When the app receives a question request, increment the cached view count for the question id. You’re caching the marginal increase in views. So far, so good. Now you need to periodically flush the counts in the cache. That's the second part of the problem. You could use a second thread or some kind of scheduling component. This really is a separate question and partly depends on your server platform (I'm using Java). Or, rather than using a separate thread, after a certain number of counts stored in the cache, you could do an update within the request thread that reached the threshold. The update functionality could be incapsulated in the cache giving the cache some IQ points.
I like the idea of a cache that gets flushed when a threshold is reached. I'm curious to know what others have done and if there is a better way.
I agree that writing each pageload to the database is inefficient. My approach is to cache the requests then commit them once an hour. Each time I update the cache I compare the current time to LastWriteTime, and if more than an hour has passed, I write. It's an ASP.NET app, so I have a final commit in the Application's shutdown method. The latter is not guaranteed to run, however is considered an acceptable loss.

Resources