Use solrj to look for results within only specific documents

Use solrj to look for results within only specific documents - solr

I have a solr server and query it using solrj. Suppose I want to search for results within only specific documents, and I have the ID's of the documents I want to look in. How do I configure the query to only return results from a specified list of documents?
List<String> documentList = ...; // collection of Strings of the
// ID's of the documents I want
// to look for
this.query = new SolrQuery();
this.query.setFields("id", "score");
this.query.addSort("score", SolrQuery.ORDER.desc);
this.query.addSort("id", SolrQuery.ORDER.desc);
this.query.setQuery(searchString);
What do I need to to make it so that all of the documents returned by the query are documents whose id is in the list of acceptable documents?

I've not used solrj much, but it should be as easy as adding a filter query (I'm assuming you don't want whether a document is acceptable to affect the score) with the document list, e.g.:
String filterQuery = "id:(1 OR 2 OR 3)";
this.query.addFilterQuery(filterQuery);
So you'll want to convert documentList into a string delimited by OR (and yes, I believe it does have to be uppercase).
If the number of acceptable documents is really large, then you'll have to make changes to your Solr configuration to allow a greater number of boolean terms in your query (I think the default is 512, or perhaps 1024; but I've used 32768 with no problems).

Related

Is it possible to use multiple words in a filter query in SOLRJ / SOLR?

I am using SOLRJ (with SOLR 7) and my index features some fields for the document contents named content_eng, content_ita, ...
It also features a field with the full path to the document (processed by a StandardTokenizer and a WordDelimiterGraphFilter).
The user is able to search in the content_xyz fields thanks to the lines :
final SolrQuery query = new SolrQuery();
query.setQuery(searchedText);
query.set("qf",searchFields); // searchFields is a generated String which looks like "content_eng content_ita" (field names separated by space)
Now the user needs to be able to specify some words contained in the path (namely some subdirectories). So I added a filterQuery :
query.addFilterQuery(
"full_path_split:" + searchedPath);
If searchedPath contains only a single word contained in the document path, the document is correctly returned however if searchedPath has several words contained in the path, the document is not returned. To sum it up the fq only works if searchedPath contains a single word.
For example doc1 is in /home/user/dir1/doc1.txt
If I search for all (* in searchedText) documents that are in user dir (fq=full_path_split%3Adir) doc1.txt is returned.
If I do the same search but for documents that are in user and dir1 (fq=full_path_split%3user+dir1) doc1.txt is not returned, and I think it is because the fq is parsed as "+full_path_split:user +text:dir1" as debug=query shows. I don't know where text comes from it may be a default field.
So is it possible to use a filter query with several words to fulfill my needs ?
Any help appreciated,

Your suspicion is correct - the _text_:dir1 part comes from you not providing a field name, and the default field name being used instead.
You can work around this by using the more general edismax (or the older dismax) parser as you're doing in your main query with qf:
fq={!type=edismax qf='full_path_split'}user dir1

How can I query Solr to get a list with all field-names prefixed by a string?

I would like to create an output based on the field-names of my Solr index objects.
What I have are objects like this e.g.:
{
"Id":"ID12345678",
"GroupKey":"Beta",
"PricePackage":5796.0,
"PriceCoupon":5316.0,
"PriceMin":5316.0
}
Whereby the Price* fields may vary from object to object, some might have more than three of those, some less, however they would be always prefixed with Price.
How can I query Solr to get a list with all field-names prefixed by Price?
I've looked into filters, facets but could not find any clue on how to do this, as all examples - e.g. regex facet - are in regard to the field-value, not the field-name itself. Or at least I could not adapt it to that.

You can get a comma separated list of all existing field names if you query for 0 documents and use the csv response writer (wt parameter) to generate the field name list.
For example if you request /solr/collection/select?q=*:*&wt=csv you get a list of all fields. If you only want fields prefixed with Price you could also add the field list parameter (fl) to limit the fields.
So the request to /solr/collection/select?q=*:*&wt=csv&fl=Price*should return the following response:
PricePackage,PriceCoupon,PriceMin
With this solution you get all fields existing including dynamic fields.

solr filter query on document value

I'm looking for a solution where my very long query strings are returning a 414 http response. Some queries can reach up to 10,000 chars, I could look at changing how many chars apache/jetty allows, but I'd rather not allow my webserver to have anyone post 10,000 chars.
Is there a way in solr where I can save a large query string in a document and use it in a filtered query?
select?q=*:*&fq=id:123 - this would return a whole document, but is there a way to return the value of a field in document 123 in the query
The field queryValue in document with the id of 123 would be Intersects((LONGSTRING))
So is there a way to do something like select?q=*:*&fq=foo:{id:123.queryValue}
this would be the same as select?q=*:*&fq=foo:Intersects((LONGSTRING))?

Two possibilities:
Joining
You can use the Join query parser to fetch the result from one collection / core and use that to filter results in a different core, but there are several limitations that will be relevant when you're talking larger installations and data sizes. You'll have to experiment to see if this works for your use case.
The Join Query Parser
Hashing
As long as you're only doing exact matches, hash the string on the client side when indexing and when querying. Exactly how you do this will depend on your language of choice. For python you'd get the hash of the long string using hashlib, and by using sha256, you'll get a resulting string that you can use for indexing and querying that's 64 bytes if you're using the hex form, 44 if you're using base64.
Example:
>>> import hashlib
>>> hashlib.sha256(b"long_query_string_here").hexdigest()
'19c9288c069c47667e2b33767c3973aefde5a2b52d477e183bb54b9330253f1e'
You would then store then 19c92... value in Solr, and do the same transformation when you have value you're querying after.
fq=hashed_id:19c9288c069c47667e2b33767c3973aefde5a2b52d477e183bb54b9330253f1e

There might be alternative methods to what you are looking for before doing literal solution you seek:
You can POST query to Solr instead of using GET. There is no URL limit on that
If you are sending a long list of ids and using OR construct, there are alternative query parsers to make it more efficient (e.g. TermsQueryParser)
If you have constant (or semi-constant) query parameters, you could factor them out into defaults on request handlers (in solrconfig.xml). You can create as many request handlers as you want and defaults can be overriden, so this effectively allows you to pre-define classes/types of queries.

Lucene comparing document contents

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.
Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?

I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);
Or if searching for a document already in the index:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);
Solr also includes a MoreLikeThis handler.

Filter Solr Documents before sending it to the user

I want to perform a comparison based on one field in a Solr document before it is sent to the Writer or to the user. I want to have the final result object, probably SolrDocumentList, so that I can loop through all the SolrDocument objects and perform a field to field comparison. For instance, if my search returns 10 documents and 5 documents have myfield="myValue", my final list should contain 6 documents with only one document having myfield="myValue", the other 4 documents should be discarded, regardless of what the other fields' contents are.
Is there any plugin for this?
If not, where should I place my code?

You can use Result Grouping / Field Collapsing. Try some thing like this: &q=solr+memory&group=true&group.field=manu_exact&group.main=true
More documenation here: http://wiki.apache.org/solr/FieldCollapsing

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Use solrj to look for results within only specific documents - solr

Related

Is it possible to use multiple words in a filter query in SOLRJ / SOLR?

How can I query Solr to get a list with all field-names prefixed by a string?

solr filter query on document value

Lucene comparing document contents

Filter Solr Documents before sending it to the user

Categories

Resources