Which Solr QueryParser is the fastest for a simple query? - solr

I have a query like sku:(123 456 ... 999) with 9000 skus in it. Sku is a "string" type. Which QueryParser should I use to get maximum performance from Solr?
Solr version is 7.2

If these SKUs are defined as the uniqueKey for your document, you can use the Realtime Get endpoint and bypass almost everything in Solr. That'll probably be the most performant way of handling it, but it changes the expected behavior slightly (non-committed documents are returned as well, for example).
Otherwise the performance difference will probably be neglible, so go with the standard Lucene query parser. If you want to optimize it further, it's probably better to look at the query profile (i.e. if it's the same set of 9000 SKUs being requested - index a tag for those SKUs instead and query for that).
In all cases this can differ based on your document profile and your server's performance, so the strategy is usually to test it for your specific use case and get timing information for your own infrastructure.

Related

Will searchable field give me better performance in azure cognitive search?

We use azure search and there are some collection (size upto 40 or 50) fields, for example:
CacheId:["1","2","1a"].
Then we may have query like: for items belong to CacheId 1 or 2, retrieve facet for field "Category".
The index has around 500k documents and sometimes we do see slowdown or throttle when it is busy.
I am wondering if we can change this CacheId field from Collection to a space separated string (e.g. "1 2 1a"), and then use the standard analyser for the field.
After that, I can run query such as:
search=CacheId:2b 1&searchMode=any
This will give all the documents that has cacheId 2b or 1 and then I add facet in query.
However, I couldn't find any documentation to see if this way will be any quicker comparing to current Collection field.
Does anyone have more knowledge on this? Will it make things better, worse or no difference at all?
Azure Search has some documentation on how to analyze, monitor, and improve query performance. You could use those resources to try and optimize your current queries first.
If no optimizations can be made, your best bet will be to test the performance of both setups using your production queries. I'm doubtful that moving from a collection to a string will improve performance, especially if following the best practices mentioned in the linked docs, but you can gather data through testing to be sure.

How to help my Solr engine to understand related terms?

I have a big list of related terms (not synonyms) that I would like my solr engine to take into account when searching. For example:
Database --> PostgreSQL, Oracle, Derby, MySQL, MSSQL, RabbitMQ, MongoDB
For this kind of list, I would like Solr to take into account that if a user is searching for "postgresql configuration" he might also bring results related to "RabbitMQ" or "Oracle", but not as absolute synonyms. Just to boost results that have these keywords/terms.
What is the best approach to implement such connection? Thanks!
You've already discovered that these are synonyms - and that you want to use that metainformation as a boost (which is a good idea).
The key is then to define a field that does what you want - in addition to your regular field. Most of these cases are implemented by having a second field that does the "less accurate" version of the field, and apply a lower boost to matches in that field compared to the accurate version.
You define both fields - one with synonyms (for example content_synonyms) and one without (content), and then add a copyField instruction from the content field (this means that Solr will take anything submitted to the content field and "copy" it as the source text for the content_synonyms field as well.
Using edismax you can then use qf to query both fields and give a higher weight to the exact content field: qf=content^10 content_synonyms will score hits in content 10x higher than hits in content_synonyms, in effect using the synonym field for boosting content.
The exact weights will have to be adjusted to fit your use case, document profile and query profile.

Lucene and SQL Server - best practice

I am pretty new to Lucene, so would like to get some help from you guys :)
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
Thanks,
Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids
Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.
I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.
http://wiki.apache.org/solr/SchemaXml

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

Create a Solr Index using Lucene IndexWriter

I need to index vast amounts of content in extremely short order, I have tried various techniques using Solrnet/solr using threading and TPL, however the speeds leave a lot to be desired. Hence considering a move to using Lucene.net index writer to create an index (preliminarily I see almost an order of magnitude of speed improvement) . Any "gotchas" to be aware of?
I am not too sure if:
1. Trie based Numeric Range query would continue to be available for query via Solr. ( I am using NumericFields in Lucene)?
2. Faceting etc. would continue to be available ?
Anything else I need to watch out for?
Please see Scaling Lucene and Solr about improving run times.
If you decide to go with Lucene:
You need a unique id field for the index to be a valid Solr index.
The schema must match the Solr schema.
The Lucene version must be the same as in Solr.
I think the range query and faceting will be available, as long as you index the respective fields according to the requirements in Solr, and use the same analyzers.

Resources