Is there a way to dereference parameters across different facets in Solr? - solr

I have a Solr JSON facet query which calculates a metric for the present year. I want to enhance this query to calculate this exact metric for the previous year as well and then calculate the ratio of increase/decrease.
Here is the JSON facet query that I have written so far -
json.facet={
"thisYear": {
"type": "terms",
"field": "<some-value>",
"domain": {
"filter": "<query to identify this year's document>"
},
"facet": {
"thisYearFacet": "<the metric calculated for the present year>"
}
},
"lastYear": {
"type": "terms",
"field": "<some-value>",
"domain": {
"filter": "<query to identify last year's document>"
},
"facet": {
"lastYearFacet": "<the metric calculated for the last year>"
}
},
// Here is where I am facing trouble!
"compare": {
"type": "func",
"func": "avg(div($thisYearFacet,$lastYearFacet))"
}
}
But running the above query throws the following error -
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.search.SyntaxError"],
"msg":"org.apache.solr.search.SyntaxError: Missing param thisYearFacet while parsing function 'avg(div($thisYearFacet,$lastYearFacet))'",
"code":400}}
Is there a way to make the calculated variables "thisYearFacet" and "lastYearFacet" accessible in the "compare" facet?

Related

Solr json facet query, unique for particular values

I have following query , For the two company ids, I would also like to get the unique rows (unique_internal_plays and unique_external_plays). Is that possible ?
{
"facet":{
"unique_viewers" : "unique(uuid)",
"internal_plays": {
"type": "query",
"q": "company:100"
},
"external_plays": {
"type": "query",
"q": "-company:100"
},
"unique_internal_plays": {
"type": "query",
"q": "company:100"
},
"unique_external_plays": {
"type": "query",
"q": "-company:100"
}
}
}
For any facet in the JSON Facet API you can further divide the given facet into nested facets. If you combine this with a stats facet (an aggregate facet), you can get the unique count for a field in that specific bucket:
"internal_plays": {
"type": "query",
"q": "company:100",
"facet": {
"unique_viewers": "unique(uuid)"
}
}
This will create a nested facet under the facet query, effectively giving you a way to further pivot/run statistics across the set for the matching documents.

Manipulate field value of copy-field in Apache Solr

I have a simple string "PART_NUMBER" value as a field in solr. I would like to add an additional field which places that value in a URL field. To do this, I created a new field type, field, and copy field
"add-field-type": {
"name": "endpoint_url",
"class": "solr.TextField",
"positionIncrementGap": "100",
"analyzer": {
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.PatternReplaceFilterFactory",
"pattern": "([\\s\\S]*)",
"replacement": "http://myurl/$1.jpg"
}
]
}
},
"add-field": {
"name": "URL",
"type": "endpoint_url",
"stored": true,
"indexed": true
},
"add-copy-field":{ "source":"PART_NUMBER", "dest":"URL" }
As some of you probably guessed, my query output looks like
{
"id": "1",
"PART_NUMBER": "ABCD1234",
"URL": "ABCD1234",
"_version_": 1645658574812086272
}
Because the endpoint_url fieldtype only modifies the index. Indeed, when doing my analysis, I get
http://myurl/ABCD1234.jpg
My question: Is there any way to apply a tokenizer or filter and feed it back in to the field value? I would prefer this output when returning the result:
{
"id": "1",
"PART_NUMBER": "ABCD1234",
"URL": "http://myurl/ABCD1234.jpg",
"_version_": 1645658574812086272
}
Is this possible to do in Solr?
Solution was posted here:
Custom Solr analyzers not being used during indexing
I need to use an Update Processors In order to change the field value before analysis. The process can be found here:
https://lucene.apache.org/solr/guide/8_1/update-request-processors.html

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

Elasticsearch score results based partly on Popularity

I'm using Elasticsearch for this project but a Solr solution might be appropriate too. In the query I'd like to include a portion of a should clause that will return results even if none of the other terms can. This will be used for document popularity. I'll periodically calculate reading popularity and add a float field to each doc with a numeric value.
The idea is to return docs based on terms but when that fails, return popular docs ranked by popularity. These should be ordered by term match scores or magnitude of popularity score.
I realize that I could quantize the popularity and treat it like a tag "hottest", "hotter", "hot"... but would like to use numeric field since the ranking is well defined.
Here is the current form of my data (from fetch by id):
GET /index/docs/ipad
returns a sample object
{
"_index": "index",
"_type": "docs",
"_id": "doc1",
"_version": 1,
"found": true,
"_source": {
"category": ["tablets", "electronics"],
"text": ["buy", "an", "ipad"],
"popularity": 0.95347457,
"id": "doc1"
}
}
Current query format
POST /index/docs/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{"terms": {"text": ["ipad"]}}
],
"must": [
{"terms": {"category": ["electronics"]}}
]
}
}
}
This may seem an odd query format but these are structured objects, not free form text.
Can I add popularity to this query so that it returns items ranked by popularity magnitude along with those returned by the should terms? I'd boost the actual terms above the popularity so they'd be favored.
Note I do not want to boost by popularity, I want to return popular if the rest of the query returns nothing.
One approach I can think of is wrapping match_all filter in constant score
and using sort on score followed by popularity
example:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"text": [
"ipad"
]
}
},
{
"constant_score": {
"filter": {
"match_all": {}
},
"boost": 0
}
}
],
"must": [
{
"terms": {
"category": [
"electronics"
]
}
}
],
"minimum_should_match": 1
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"popularity": {
"unmapped_type": "double"
}
}
]
}
You want to look into the function score query and a decay function for this.
Here's a gentle intro: https://www.found.no/foundation/function-scoring/

Elasticsearch is Aggregating by "Partial Term" instead of "Entire Term"

I'm currently trying to do something fancy in elasticsearch...and it ALMOST works.
Use case: I have to limit the number of results per a certain field to (x) results.
Example: In a result set of restaurants I only want to return two locations per restaurant name. If I search Mexican Food, then I should get (x) Taco Bell hits, (x) Del Taco Hits and (x) El Torito Hits.
The Problem: My aggregation is currently only matching partials of the term.
For Instance: If I try to match company_name, it will create one bucket for taco and another bucket for bell, so Taco Bell might show up in 2 buckets, resulting in (x) * 2 results for that company.
I find it hard to believe that this is the desired behavior. Is there a way to aggregate by the entire search term?
Here's my current aggregation JSON:
"aggs": {
"by_company": {
"terms": {
"field": "company_name"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1, "from": 0}
}
}
}
}
Your help, as always, is greatly appreciated!
Yes. If your "company_name" is just a regular string with the standard analyzer, OR your whatever analyzer you are using for "company_name" is splitting the name then this is your answer. ES stores "terms", not words, or entire text unless you are telling it to.
Assuming your current analyzer for that field does just what I described above, then you need another - let's call it "raw" - field that should mirror your company_name field but it should store the company name as is.
This is what I mean:
{
"mappings": {
"test": {
"properties": {
...,
"company_name": {
"type": "multi_field",
"fields": {
"company_name": {
"type": "string" #and whatever you currently have in your mapping for `company_name`
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
And in your query, you'll do it like this:
"aggs": {
"by_company": {
"terms": {
"field": "company_name.raw"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1, "from": 0}
}
}
}
}

Resources