I'm using Solr to handle search on a very large set of documents, I start having performance issues with complex queries with facets and filters.
This is a solr query used to get some data :
solr full request : http://host/solr/discovery/select?q=&fq=domain%3Acom+OR+host%3Acom+OR+public_suffix%3Acom&fq=crawl_date%3A%5B2000-01-01T00%3A00%3A00Z+TO+2000-12-31T23%3A59%3A59Z%5D&fq=%7B%21tag%3Dcrawl_year%7Dcrawl_year%3A%282000%29&fq=%7B%21tag%3Dpublic_suffix%7Dpublic_suffix%3A%28com%29&start=0&rows=10&sort=score+desc&fl=%2Cscore&hl=true&hl.fragsize=200&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E&hl.snippets=10&hl.fl=content&hl.mergeContiguous=false&hl.maxAnalyzedChars=100000&hl.usePhraseHighlighter=true&facet=true&facet.mincount=1&facet.limit=11&facet.field=%7B%21ex%3Dcrawl_year%7Dcrawl_year&facet.field=%7B%21ex%3Ddomain%7Ddomain&facet.field=%7B%21ex%3Dpublic_suffix%7Dpublic_suffix&facet.field=%7B%21ex%3Dcontent_language%7Dcontent_language&facet.field=%7B%21ex%3Dcontent_type_norm%7Dcontent_type_norm&shards=shard1"
When this query is used localy with about 50000 documents, it takes about 10 seconds, but when I try it on host with 200 million documents it takes about 4 minutes. I know naturaly it's going to take a much longer time in the host, but I wonder if anyone had the same issue and was able to get faster results. Knowing that I'm using two Shards.
Waiting for your responses.
You're doing a number of complicated things at once: Date ranges, highlighting, faceting, and distributed search. (Non-solrcloud, looks like)
Still, 10 seconds for a 50k-doc index seems really slow to me. Try selectively removing aspects of your search to see if you can isolate which part is slowing things down and then focus on that. I'd expect that you can find simpler queries that are fast, even if they match a lot of documents.
Either way, check out https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
There are a lot of useful tips there, but the #1 performance issue is usually not having enough memory, especially for large indexes.
Check for how many segments you have on solr
as more the number of segments worse the query response
If you have not set merge factor in your solrConfig.xml then probably you will have close 40 segments which is to bad for query response time
Set your merge factor accordingly
If no new documents are to be added set it 2
mergeFactor
The mergeFactor roughly determines the number of segments.
The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.
For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.
These values are set in the mainIndex section of solrconfig.xml (disregard the indexDefaults section):
mergeFactor Tradeoffs
High value merge factor (e.g., 25):
Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.
Related
I'm looking into using Solr for a use case that will require some deep paging, thinking an upper bound of about 100k total results split into 1k pages from a collection of ~10 million records. I quickly discovered why using start & num_rows is a bad idea for a result set that size and came across cursorMark in the process. Articles I've found about cursorMark suggest a relatively constant time for record access regardless of position in the set which seems perfect for my case.
The question I had though, is there any kind of performance impact going down this route? Is there any performance difference in terms of memory/CPU usage for using cursorMark to deep page into result sets of 1k, 10k, 100k, 1 million records assuming I return back 1000 at time?
In theory it gets a little bit faster as you page down. In reality the difference is so small that you won't not notice it.
A standard non-cursor search uses a little queue to hold the top-X results. Every match is added to that queue, pushing out poorer matches if the queue is full.
A cursor-search also uses a queue of size X. Every match is added to that queue, if their sort value is beyond the previous cursor mark, pushing out poorer matches if the queue is full. So as you page deeper, there are a bit less inserts.
There are some very illustrative graphs of cursor performance at https://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing.
The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average.
Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size.
I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem.
Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million
Thanks
I had the similar sort of issue having multiple customer and big indexed data.
I have the implemented it with version 3.4 by creating a separate core for a customer.
i.e One core per customer. Creating core is some sort of creating indexes or splitting the data as like we do in case of sharding...
Here you are splitting the large indexed data in different smaller segments.
Whatever the seach will happen it will carry in the smaller indexed segment.. so the response time would be faster..
I have almost 700 core created as of now and its running fine for me.
As of now I did not face any issue with managing the core...
I would suggest to go with combination of core and sharding...
It will help you in achieve
Allows to have a different configuration for each core with different behavior and that will not have impact on other cores.
you can perform action like update, load etc. on each core differently.
I have been using postgresql for full text search, matching a list of articles against documents containing a particular word. The performance for which degraded with a rise in the no. of rows. I had been using postgresql support for full text searches which made the performance faster, but over time resulted in slower searches as the articles increased.
I am just starting to implement with solr for searching. Going thru various resources on the net I came across that it can do much more than searching and give me finer control over my results.
Solr seems to use an inverted index, wouldn't the performance degrade over time if many documents (over 1 million) contain a search term begin queried by the user? Also if I am limiting the results via pagination for the searched term, while calculating the score for the documents, wouldn't it need to load all of the 1 million+ documents first and then limit the results which would dampen the performance with many documents having the same word?
Is there a way to sort the index by the score itself in the first place which would avoid loading of the documents later?
Lucene has been designed to solve all the problems you mentioned. Apart from inverted index, there is also posting lists, docvalues, separation of indexed and stored value, and so on.
And then you have Solr on top of that to add even more goodies.
And 1 million documents is an introductory level problem for Lucene/Solr. It is being routinely tested on indexing a Wikipedia dump.
If you feel you actually need to understand how it works, rather than just be reassured about this, check books on Lucene, including the old ones. Also check Lucene Javadocs - they often have additional information.
We have had a 3 node DSE SOLR cluster running and recently added a new core. After about a week of running fine, all of the SOLR nodes are now OOMing. The fill up both the JVM Heap (set at 8GB) and the system memory. Then are also constantly flushing the memtables to disk.
The cluster is DSE 3.2.5 with RF=3
here is the solrconfig from the new core:
http://pastie.org/8973780
How big is your Solr index relative to the amount of system memory available for the OS to cache file system pages. Basically, your Solr index needs to fit in the OS file system cache (the amount of system memory available after DSE is started but has not yet processed any significant amount of data.)
Also, how many Solr documents (Cassandra rows) and how many fields (Cassandra columns) are populated on each node? There is no hard limit, but 40 to 100 million is a good guideline as an upper limit - per node.
And, how much system memory and how much JVM heap is available if you restart DSE, but before you start putting load on the server?
For RF=N, where N is the total number of nodes in the cluster or at least the search data center, all of the data will be stored on all nodes, which is okay for smaller datasets, but not okay for larger datasets.
For RF=n, this means that each node will have X/N*n rows or documents, where X is the total number of rows or documents all column families in the data center. X/N*n is the number that you should try to keep below 100 million. That's not a hard limit - some datasets and hardware might be able to handle substantially more, and some datasets and hardware might not even be able to hold that much. You'll have to discover the number that works best for your own app, but the 40 million to 100 million range is a good start.
In short, the safest estimate is for X/N*n to be kept under 40 million for Solr nodes. 100 may be fine for some data sets and beefier hardware.
As far as tuning, one common source of using lots of heap is heavy use of Solr facets and filter queries.
One technique is to use "DocValues" fields for facets since DocValues can be stored off-heap.
Filter queries can be marked as cache=false to save heap memory.
Also, the various Solr caches can be reduced in size or even set to zero. That's in solrconfig.xml.
I have a Solr index, which host 4 millions document and whose size is 65 Gb. When I browse my index using the web UI everything is fast. But my real queries, which are made of about 2000 Term (all coming from the same field), are way too slow.
To increase the speed of my Solr queries I first copied the index into my RAM which makes things much faster but still I need to increase the speed.
I also have created a multi-threaded version of my query, using Java7 RecursiveTask, where I basically divide the number of query terms by 2 until the number of query terms pass below a threshold. Then I aggregate the results of the sub-queries to build a final response. It makes things faster but it creates other kind of problems.
Here is the code I use for the multiple terms query
MultiPhraseQuery query = new MultiPhraseQuery();
query.add(queryTerms); // where queryTerms is an array of Term
TopDocs tops = searcher.search(query, rows);
ScoreDoc[] scoreDoc = tops.scoreDocs;
Does anyone has some nice suggestions to improve the speed performance ?
Thank you
I believe that 2,000 terms are too much for a single index. You may have to refactor your design.
Now, a possibility to scale is by using SolrCloud with many replicates in order to improve the query response time of your index.
Also, do not forget the stored="false" option on the field definition (which might make the index size much smaller)