I'm looking for a solution where my very long query strings are returning a 414 http response. Some queries can reach up to 10,000 chars, I could look at changing how many chars apache/jetty allows, but I'd rather not allow my webserver to have anyone post 10,000 chars.
Is there a way in solr where I can save a large query string in a document and use it in a filtered query?
select?q=*:*&fq=id:123 - this would return a whole document, but is there a way to return the value of a field in document 123 in the query
The field queryValue in document with the id of 123 would be Intersects((LONGSTRING))
So is there a way to do something like select?q=*:*&fq=foo:{id:123.queryValue}
this would be the same as select?q=*:*&fq=foo:Intersects((LONGSTRING))?
Two possibilities:
Joining
You can use the Join query parser to fetch the result from one collection / core and use that to filter results in a different core, but there are several limitations that will be relevant when you're talking larger installations and data sizes. You'll have to experiment to see if this works for your use case.
The Join Query Parser
Hashing
As long as you're only doing exact matches, hash the string on the client side when indexing and when querying. Exactly how you do this will depend on your language of choice. For python you'd get the hash of the long string using hashlib, and by using sha256, you'll get a resulting string that you can use for indexing and querying that's 64 bytes if you're using the hex form, 44 if you're using base64.
Example:
>>> import hashlib
>>> hashlib.sha256(b"long_query_string_here").hexdigest()
'19c9288c069c47667e2b33767c3973aefde5a2b52d477e183bb54b9330253f1e'
You would then store then 19c92... value in Solr, and do the same transformation when you have value you're querying after.
fq=hashed_id:19c9288c069c47667e2b33767c3973aefde5a2b52d477e183bb54b9330253f1e
There might be alternative methods to what you are looking for before doing literal solution you seek:
You can POST query to Solr instead of using GET. There is no URL limit on that
If you are sending a long list of ids and using OR construct, there are alternative query parsers to make it more efficient (e.g. TermsQueryParser)
If you have constant (or semi-constant) query parameters, you could factor them out into defaults on request handlers (in solrconfig.xml). You can create as many request handlers as you want and defaults can be overriden, so this effectively allows you to pre-define classes/types of queries.
Related
I have facets I want to use for filtering, as I assume most common use of facets is.
The filter in the UI is through a multi-select dropdown. However, the text labels in the facets are quite long, and when selecting multiple, I end up with very long strings to use in the filter. Therefore I want to use keys/ids for each facet text. But how do I get a key out of a facet, and not just the value?
--- Facet example ---
Id | Value
--------------------------------------------------------------------------
1 | This is a very long facet text with many characters, including æøå.
2 | And there are other texts, also with / and & and more æ, ø og å.
If I had an id in the facet, retrieved from the index, where the facet would be a complex type with key and value, then I could use that when selected in the UI and do a filter on the id instead of the long text(s).
Ideas, input?
Thanks!
Unfortunately, there isn't a concept of "complex" facets, which is what you are requesting.
Facets only return the text and the count indicating how often it occurs in the source documents.
When you complain about very long strings in the filter, is it because you are running into request size limits? Have you considered POST vs GET when making your query?
Have you considered using search.in if your search term cardinality is quite high (as described here)?
Facets in general are not meant to have extremely long values, as they serve the purpose of quick filtering/hierarchical navigation for the end-user. Although technically you can make any field facet-able, usually fields which represent full text(s), or possibly having a high cardinality should not be used as facets
One possible work around, would be to have another field in your index, which uses some fixed length hashing on your text field (this should be quite possible using the push API; we don't have this facility via indexers); then once you get back a list of facets, you can apply the same hash function on the client side (UI) and then use the generated (presumably small) fixed length string and query against that "new" field.
Can someone explain with example that how Solr function query is used.
I could not find any concrete example which shows the result difference with function queries and without function queries.
I want something with example URL and what is shows in response result.
A function query is a query that invokes a function on one (or more) of the fields available. You add a function query if the value you have in a field has to be processed to get the value you want - just as you'd do in a mathematical sense.
Showing "the difference between a query with function queries and without" isn't really possible, as they don't do the same thing. You pick one (or both) depending on what you need.
An adopted example from the reference manual - Lets imagine we have a set of documents that describe users, and these users have two fields - mails_read and mails_received. To get anyone that has read less than 50% of their mails, we can apply a filter query as a function (with the frange query parser) (fq here means filter query - the frange is what makes it a function query):
fq={!frange l=0 u=0.5}div(mails_read,mails_received)
Otherwise we'd be limited to receive those who just had read a specific range of emails or that had received a specific range of emails - or we'd have to index a value that kept the updated value for mails_read / mails_received each time we updated the document (which is a perfectly valid strategy, and usually more efficient).
Another example is to use a function query for boosting documents, and the most common one is to boost by recency (i.e. that a more recent document receives a larger boost):
bf=recip(ms(NOW/HOUR,mydatefield),3.16e-11,1,1)
This applies the recip function to the difference (expressed in milliseconds) between the mydatefield field and the current hour.
recip: Performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function.
Yet another fine use case is to use the special _val_ field - if you query against this magic field with a function, the value returned by the function will be used as the score of the document (instead of affecting it through boosting or limiting the resulting set of documents as a query).
_val_:"div(popularity, price)"
.. would give the score of the document based on the result of the division (what the values represent is up to you).
While working with search definition which looks like
search music{
document music{
field title type string {
indexing: summary | attribute | index
}
}
}
if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?
First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.
Unnecessary details:
You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html
The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html
However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.
The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.
I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?
Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.
Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.