Index performance for large # documents in Lucene - solr

I have been using postgresql for full text search, matching a list of articles against documents containing a particular word. The performance for which degraded with a rise in the no. of rows. I had been using postgresql support for full text searches which made the performance faster, but over time resulted in slower searches as the articles increased.
I am just starting to implement with solr for searching. Going thru various resources on the net I came across that it can do much more than searching and give me finer control over my results.
Solr seems to use an inverted index, wouldn't the performance degrade over time if many documents (over 1 million) contain a search term begin queried by the user? Also if I am limiting the results via pagination for the searched term, while calculating the score for the documents, wouldn't it need to load all of the 1 million+ documents first and then limit the results which would dampen the performance with many documents having the same word?
Is there a way to sort the index by the score itself in the first place which would avoid loading of the documents later?

Lucene has been designed to solve all the problems you mentioned. Apart from inverted index, there is also posting lists, docvalues, separation of indexed and stored value, and so on.
And then you have Solr on top of that to add even more goodies.
And 1 million documents is an introductory level problem for Lucene/Solr. It is being routinely tested on indexing a Wikipedia dump.
If you feel you actually need to understand how it works, rather than just be reassured about this, check books on Lucene, including the old ones. Also check Lucene Javadocs - they often have additional information.

Related

Solr performance issues

I'm using Solr to handle search on a very large set of documents, I start having performance issues with complex queries with facets and filters.
This is a solr query used to get some data :
solr full request : http://host/solr/discovery/select?q=&fq=domain%3Acom+OR+host%3Acom+OR+public_suffix%3Acom&fq=crawl_date%3A%5B2000-01-01T00%3A00%3A00Z+TO+2000-12-31T23%3A59%3A59Z%5D&fq=%7B%21tag%3Dcrawl_year%7Dcrawl_year%3A%282000%29&fq=%7B%21tag%3Dpublic_suffix%7Dpublic_suffix%3A%28com%29&start=0&rows=10&sort=score+desc&fl=%2Cscore&hl=true&hl.fragsize=200&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E&hl.snippets=10&hl.fl=content&hl.mergeContiguous=false&hl.maxAnalyzedChars=100000&hl.usePhraseHighlighter=true&facet=true&facet.mincount=1&facet.limit=11&facet.field=%7B%21ex%3Dcrawl_year%7Dcrawl_year&facet.field=%7B%21ex%3Ddomain%7Ddomain&facet.field=%7B%21ex%3Dpublic_suffix%7Dpublic_suffix&facet.field=%7B%21ex%3Dcontent_language%7Dcontent_language&facet.field=%7B%21ex%3Dcontent_type_norm%7Dcontent_type_norm&shards=shard1"
When this query is used localy with about 50000 documents, it takes about 10 seconds, but when I try it on host with 200 million documents it takes about 4 minutes. I know naturaly it's going to take a much longer time in the host, but I wonder if anyone had the same issue and was able to get faster results. Knowing that I'm using two Shards.
Waiting for your responses.
You're doing a number of complicated things at once: Date ranges, highlighting, faceting, and distributed search. (Non-solrcloud, looks like)
Still, 10 seconds for a 50k-doc index seems really slow to me. Try selectively removing aspects of your search to see if you can isolate which part is slowing things down and then focus on that. I'd expect that you can find simpler queries that are fast, even if they match a lot of documents.
Either way, check out https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
There are a lot of useful tips there, but the #1 performance issue is usually not having enough memory, especially for large indexes.
Check for how many segments you have on solr
as more the number of segments worse the query response
If you have not set merge factor in your solrConfig.xml then probably you will have close 40 segments which is to bad for query response time
Set your merge factor accordingly
If no new documents are to be added set it 2
mergeFactor
The mergeFactor roughly determines the number of segments.
The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.
For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.
These values are set in the mainIndex section of solrconfig.xml (disregard the indexDefaults section):
mergeFactor Tradeoffs
High value merge factor (e.g., 25):
Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.

Best Practice to Combine both DB and Lucene Search

I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.
Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.

Solr - Use of Cache in Billion Data

We have SOLR storing 3 billions of records in 23 machines and each machine have 4 shards and only 230 million documents have some field like aliasName. Currently queryCache or documentCache or Filter Cache is disable.
Problem: We are trying to get the results which have query like (q=alisaName:[* TO *] AND firstname:ash AND lastName:Coburn) is returning the match documents in 4.3 seconds. Basically we want only those matched firstname and lastname records where aliasName is not empty.
I am thinking to enable filter query fq=aliasName:[* TO *] and not sure it will make it faster as firstname and last name is mostly different in the each queries? how much memory should we allocate for filter query to perform? It should not impact the other existing queries like q=firstanme:ash AND last name:something)
Please don't worry about I/O operations as we are using flash drive.
Really appreciate the reply if you have worked on similar issue and suggest the best solution.
According to solr documentation...
filterCache
This cache stores unordered sets of document IDs that match the key (usually queries)
URL: https://wiki.apache.org/solr/SolrCaching#filterCache
So I think it comes down to two things:
What is the percentage of documents that you have with populated aliasName ? In my opinion if most documents have this field populated, then the filter cache might be useless. But, if it is only a small percentage of documents, the filter cache will have a huge performance impact, and less memory used.
What kind of Id are you using? Although I assume that the documentation refers to lucene document Ids, and not solr Ids. But maybe a smaller Solr Ids could result in a smaller cache size as well (I am not sure).
At the end you will have to perform a trial and see how it goes, maybe try on a couple of nodes first and see if there is a performance improvement.

Cloudant Search Index Query Limit

Why are results from search index queries limited to 200 rows, whereas standard view queries seem to have no limit?
Fundamentally because we hold a 200 item array in memory as we stream over all hits, preserving the top 200 scoring hits. A standard view just streams all rows between a start and end point. The intent of a search is to typically to find the needle in a haystack, so you don't generally fetch thousands of results (compare with Google, who clicks through to page 500?). If you don't find what you want, you refine your search and look again.
There are cases when retrieving all matches makes sense (and we can stream this in the order we find them, so there's no RAM issue). That's a feature we can (and should) add, but it's not currently available.
It's also worth noting that the _view API (aka "mapreduce") is fundamentally different than search because of the ordering of results on disk. Materialized views are persisted in CouchDB b+ trees, so they are essentially sorted by key. That allows for efficient range queries (start/end key), and makes limit/paging trivial. However, it also means that you have to order the view rows on disk, which restricts the types of boolean queries that you can perform against the materialized views. That's where search helps, but Bob (aka "The Lucene Expert") notes the limitations.

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Resources