Solr - Use of Cache in Billion Data

Solr - Use of Cache in Billion Data - solr

We have SOLR storing 3 billions of records in 23 machines and each machine have 4 shards and only 230 million documents have some field like aliasName. Currently queryCache or documentCache or Filter Cache is disable.
Problem: We are trying to get the results which have query like (q=alisaName:[* TO *] AND firstname:ash AND lastName:Coburn) is returning the match documents in 4.3 seconds. Basically we want only those matched firstname and lastname records where aliasName is not empty.
I am thinking to enable filter query fq=aliasName:[* TO *] and not sure it will make it faster as firstname and last name is mostly different in the each queries? how much memory should we allocate for filter query to perform? It should not impact the other existing queries like q=firstanme:ash AND last name:something)
Please don't worry about I/O operations as we are using flash drive.
Really appreciate the reply if you have worked on similar issue and suggest the best solution.

According to solr documentation...
filterCache
This cache stores unordered sets of document IDs that match the key (usually queries)
URL: https://wiki.apache.org/solr/SolrCaching#filterCache
So I think it comes down to two things:
What is the percentage of documents that you have with populated aliasName ? In my opinion if most documents have this field populated, then the filter cache might be useless. But, if it is only a small percentage of documents, the filter cache will have a huge performance impact, and less memory used.
What kind of Id are you using? Although I assume that the documentation refers to lucene document Ids, and not solr Ids. But maybe a smaller Solr Ids could result in a smaller cache size as well (I am not sure).
At the end you will have to perform a trial and see how it goes, maybe try on a couple of nodes first and see if there is a performance improvement.

Related

Best Practice to Combine both DB and Lucene Search

I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.

Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.

Index performance for large # documents in Lucene

I have been using postgresql for full text search, matching a list of articles against documents containing a particular word. The performance for which degraded with a rise in the no. of rows. I had been using postgresql support for full text searches which made the performance faster, but over time resulted in slower searches as the articles increased.
I am just starting to implement with solr for searching. Going thru various resources on the net I came across that it can do much more than searching and give me finer control over my results.
Solr seems to use an inverted index, wouldn't the performance degrade over time if many documents (over 1 million) contain a search term begin queried by the user? Also if I am limiting the results via pagination for the searched term, while calculating the score for the documents, wouldn't it need to load all of the 1 million+ documents first and then limit the results which would dampen the performance with many documents having the same word?
Is there a way to sort the index by the score itself in the first place which would avoid loading of the documents later?

Lucene has been designed to solve all the problems you mentioned. Apart from inverted index, there is also posting lists, docvalues, separation of indexed and stored value, and so on.
And then you have Solr on top of that to add even more goodies.
And 1 million documents is an introductory level problem for Lucene/Solr. It is being routinely tested on indexing a Wikipedia dump.
If you feel you actually need to understand how it works, rather than just be reassured about this, check books on Lucene, including the old ones. Also check Lucene Javadocs - they often have additional information.

Solr: How can I improve the performance of a filter query (for a specific value, not a range query) on a numeric field?

I have an index with something like 60-100 Million documents. We almost always query these documents (in addition to other filter queries and field queries, etc) on a foreign key id, to scope the query to a specific parent object.
So, for example: /solr/q=*:*&fq=parent_id_s:42
Yes, that _s means this is currently a solr.StrField field type.
My question is: should I change this to a TrieIntField? Would that speed up performance? And if so, what would be the ideal precisionStep and positionIncrementGap values, given that I know that I will always be querying for a single specific value, and that the cardinality of that parent_id is in the 10,000-100,000 (maximum) order of magnitude?
Edit for aditional detail (from comment on an answer below):
The way our system is used, it turns out that we end up using that same fq for many queries in a row. And when the cache is populated, the system runs blazing fast. When the cache gets dumped because of a commit, this query (even a test case with ONLY this fq) can take up to 20 seconds. So I'm trying to figure out how to speed up that initial query that populates the cache.
Second Edit:
I apologize, after further testing it turns out that the above poor performance only happens when there are also facet fields being returned (e.g. stuff like &facet=true&facet.field=resolved_facet_facet). With a dozen or so of these fields, that's when the query takes up to 20-30 seconds sometimes, but only with a fresh searcher. It's instant when the cache is populated. So maybe my problem is the facet fields, not the parent_id field.

TrieIntField with a precisionStep is optimized for range queries. As you're only searching for a specific value your field type is optimal.
Have you looked at autowarming queries? These run whenever a new IndexSearcher is being created (on startup, on an index commit for example), so that it becomes available with some cache already in place. Depending on your requirements, you can also set useColdSearcher flag to true, so that the new Searcher is only available when the cache has been warmed. For more details have a look here: https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig#QuerySettingsinSolrConfig-Query-RelatedListeners

It sounds like you probably aren't getting much benefit from the caching of result sets from the filter. One of the more important features of filters is that they cache their result sets. This makes the first run of a certain filter take longer while a cache is built, but subsequent uses of the same filter are much faster.
With the cardinality you've described, you are probably just wasting cycles, and polluting the filter cache, by building caches without them ever being of use. You can turn off caching of a filter query like:
/solr/q=*:*&fq={!cache=false}parent_id_s:42

I also think filter query doesn't help in this case.
q=parent_id_s:42 is to query the index by the term "parent_id_s:42" and get a set of document ids. Since the postings (document ids) are indexed by the term, and assuming you have enough memory to hold this (either in JVM or OS cache), then this lookup should be pretty fast.
Assuming filter cache is already warmed up and you have 100% hit ratio, which one of the following is faster?
q=parent_id_s:42
fq=parent_id_s:42
I think they are very close. But I could be wrong. Anyone knows? Any know ran performance test for this?

Max length for Solr Delete by query

I am working on making some improvements on reindexing process. So we have our custom logic to figure out which documents have been modified and need to be reindexed. So at the end I can generate a delete query with something like delete all documents where fieldId in list
So instead of deleting and adding 50k documents everytime we only re-index a tiny percentage of it.
Now I am thinking about edge case scenario where our list of fieldIds is extremely large say 30-40,000 ids so if that's the case is there a upper limit on request length that I should worry about, or would it in turn cause negative effects on performance and exacerbate the situation instead of making it better.
I read some articles on google where they are advising to make it a post request instead.
I am using SolrNet latest build which is build on Solr 4.0

I would revisit that logic because deleting the documents then re-index them again is not the best solution. Because firstly it is an expensive operation, secondly your index will be empty or in-complete for a while until you re-index the documents again, which means if you query your index in the middle of the operation you could get zero, or partial results.
I would advise to just index again with the same document Id (uniquekey defined in solr schema.xml). And solr is smart eough to overwrite the document if it is indexed with the same Id. Then you don't have to worry about the hassle of deleting old documents. You might also do 'Optimize' to the index from time to time to physically get rid of 'deleted' documents.

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?

You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.

You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight