What are some strategies for updating volatile data in Solr?

What are some strategies for updating volatile data in Solr? - solr

What are some strategies for updating volatile data in Solr? Imagine if you needed to model YouTube video data in a Solr index: how would you keep the "views" data fresh without swamping Solr in updates?
I would imagine that storing the "views" data in a different data store (something like MongoDB or Redis) that is better at handling rapid updates would be the best idea.
But what is the best way to update the index periodically with that data? Would a delta-import make sense in this context? What does a delta-import do to Solr in terms of performance for running queries?

First you need to define "fresh".
Is "fresh" 1ms? If so, by the time the value (the rendered html) gets to the browser, it's not fresh anymore, due to network latency. Does that really matter? For the vast majority of cases, no, true real-time results are not needed.
A more common limit is 1s. In that case, Solr can deal with that with RankingAlgorithm (a plugin) or soft commits (currently available in Solr 4.0 trunk only).
"Delta-import" is a term from DataImportHandler that doesn't have much intrinsic meaning. From the point of view of a Solr server, there's only document additions, it doesn't matter where they come from or if a set of documents represent the "whole" dataset or not.
If you want to have an item indexed within 1s of its creation/modification, then do just that, add it to Solr just after it's created/modified (for example with a hook in your DAL). This should be done asynchronously, and use RA or soft commits.

You might be interested in so-called "near-realtime search", or NRT, now available on Solr's trunk, which is designed to deal with exactly this problem. See http://wiki.apache.org/solr/NearRealtimeSearch for more info and links.

How about using the external file field ?
This helps you to maintain data outside of your index in a separate file, which you can refresh periodically without any changes to the index.
For data such as downloads, views, rank which is fast changing data this can be an good option.
More info # http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
This has some limitations, so you would need to check depending upon your needs.

Related

Database cache ( redis, memcache ) usage, query vs. items

I'm wondering what the preferred way is to cache elements from a database with an in-memory cache, like redis or memcache. The context is that I have a table of items which are being accessed by an API, frequently ( millions of times per second ) as real-time stats. In general, the API is just looking for items in a given range of time, with a certain secondary id. The same data is likely to be hit many times. It seems like you could do it in a few ways:
Cache the entire query.
Meaning, the entire data string resulting from the real query to the Database would get stored in the cache, with a minimal query as the key. The advantage is that for frequently used queries, there is just a single access to get the entire set of results back. But any slightly different query needs to be redone and cached.
Cache the items in the query.
Meaning, each item returned from the real query gets stored individually in the cache, with a searchable id as the key. The advantage is that for slightly different queries, you don't need to run a full query against the DB again, just elements that are not currently cached.
Mirror the entire database
Meaning, each item is put into the cache as soon as it gets created/udpdated in the DB. The cache is always assumed to be up to date, and so all queries can just run on the cache directly.
It seems like these approaches might be better or worse in certain circumstances, but are there some pitfalls here that make some completely undesirable? Or just clearly better in this use-case?
Thanks for any advice!

#3 i.e., Mirroring the database is not a good option. Also, keep in mind that most in memory systems like Redis don't have a query langurage but rather retreival is based on Keys. So, it is not a good idea to replicate data, especially if data is relational.
You should use a combination of #1 and #2. Redis is key based, so you will have to design the keys as per your query criteria. I would suggest to build a library that works on the concept of etag. In redis, save the etag and the query response. The library should pass the etag to backend logic, which will re-run the query only if etag doesn't match. If the etag matches then backend will not re-run query and library will take the cached response from redis and send back to client.
Refer
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag for concept.

possible issues in indexing of large documents in solr/lucene

I am trying to index a large data in solr/lucene. Since It is a legacy system and because of some other reasons, I have to do it via a C++ layer. But before doing that I wanted to optimize the process so I did google for that. I found out following things for that:
Indexing in batches: which will help me in scenario where indexing will fail in between because of some failure. So i can start with remaining batches again.
buffer lookup
indexer concurrency
I found the last 2 terms somewhere while looking for different issues, but I am unable to understand it fully.
So if anyone can help me in understanding these two issues and any other issue which may arise.

I'm not sure what you mean when you're mentioning "Buffer Lookup" - usually this is the case of allowing a server to have a decent in-memory cache, where as many queries as possible can be answered without having to recalculate the intersection between documents and which documents are contained in a certain set for each query. For Solr this is configured using the different *cache-settings. The requirements will be different for most applications, depending on query load, field definitions, etc. Performing a commit (making documents visible in the index) usually expires caches, as the cache might no longer be valid.
Indexer Concurrency allows a server to insert documents into the actual index from many threads at the same time, without locking between the threads. Lucene made concurrent indexing possible back in 2011 (for Lucene 4.0), and allows faster and more efficient updates of the index. Whether this matters depends on your application.

When to definitely use SOLR over Lucene in a Sitecore 7 build?

My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?

I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)

Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.

I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.

How much RealTime is Elasticsearch, Solr and DSE realtime search?

From some last couple of weeks, I have been working around Elasticsearch and Solr, and trying to do OLTP processing in real time. However, what comes to me is they claims(especially ES) to be real time. The meaning of real time looks a lot fuzzy to me.
If we go deep into it, both ES and Solr, defines a refresh rate or a soft-commit rate, after which the newly indexed documents would be available for search, effectively providing only Near-Real time capabilities.
It looks like by Real time search, it is either a marketing statement to call it real time, or they make the word fuzzy by talking about Real Time Search rather than batch or analytical processing.
Am I correct, or correct me if I am wrong, and there is a real-time search possible in a typical OLTP system, where every transaction has search visibility to last document ?

Elasticsearch is a Near Real Time search engine for search. Elasticsearch is Real Time for operations like Create, Update, Delete and Get.
By default, refresh is 1 second. In some use cases, it could appear as real time. For example, I was working for a french gov service and we were producing statistics per day. So for our use case, it was somehow real time from our perspective.
For logs for example, 1 second is enough in most use cases.
You can modify this default value but it comes with a cost.
If you really need real time, then you probably want to use a SQL database.
My 2 cents.

Yes, DSE Search is indeed Near real-time and has not yet achieved the mythical goal of absolute zero latency. But... even traditional Real real-time is not real-time once you factor in the time to do the actual database update, plus the fact that a lot of traditional database updates are batch-oriented, or even if the actual update operation is not batched, there is likely to be some human process that delays the start of the database update from the original source of a data change.
Also keep in mind that the latency of a database update needs to include maintaining the required (tunable) consistency for replicating data updates in the cluster.
Rather than push you back towards SQL if you want real-time, I would challenge you to fully justify the true latency requirements of the app. For example, with complex distributed applications you need to be prepared for occasional resource outages, such as network delays, so that it is usually much better to design a modern distributed application to be a lot more flexible and asynchronous than a traditional, synchronous, fragile (think HealthCare.gov) app architecture that improperly depends on a perception of zero-latency distributed operations.
Finally, we are working on enhancements to reduce the actual latency of database updates, coupled with ongoing improvements in hardware performance that further shrink the update latency window.
But ultimately, all computing real-time measures will have some non-zero latency and modern distributed apps must be designed for at least some degree of decoupling between database updates and absolute dependency on those updates.
Worst case scenario, apps that need to synchronize with database updates may need to implement a polling strategy to wait for the update to complete.

ElasticSearch has real time features for CRUD operations. On GET operations, it checks the Transaction log, to look for any uncommitted changes and return the most relevant document.
The Percolator feature enables realtime in search queries as well. It allows you to register queries (percolation), that will be used at indexing time to return matching documents to those predefined queries.
This workflow looks like this:
Register specific query (percolation) in Elasticsearch
Index new content (passing a flag to trigger percolation)
The response to the indexing operation will contain the matched percolations
A very good blog with live example that explains the Percolator concept:
http://blog.qbox.io/elasticsesarch-percolator

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.

With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.

We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.

Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight