Manipulate field value of copy-field in Apache Solr - solr

I have a simple string "PART_NUMBER" value as a field in solr. I would like to add an additional field which places that value in a URL field. To do this, I created a new field type, field, and copy field
"add-field-type": {
"name": "endpoint_url",
"class": "solr.TextField",
"positionIncrementGap": "100",
"analyzer": {
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.PatternReplaceFilterFactory",
"pattern": "([\\s\\S]*)",
"replacement": "http://myurl/$1.jpg"
}
]
}
},
"add-field": {
"name": "URL",
"type": "endpoint_url",
"stored": true,
"indexed": true
},
"add-copy-field":{ "source":"PART_NUMBER", "dest":"URL" }
As some of you probably guessed, my query output looks like
{
"id": "1",
"PART_NUMBER": "ABCD1234",
"URL": "ABCD1234",
"_version_": 1645658574812086272
}
Because the endpoint_url fieldtype only modifies the index. Indeed, when doing my analysis, I get
http://myurl/ABCD1234.jpg
My question: Is there any way to apply a tokenizer or filter and feed it back in to the field value? I would prefer this output when returning the result:
{
"id": "1",
"PART_NUMBER": "ABCD1234",
"URL": "http://myurl/ABCD1234.jpg",
"_version_": 1645658574812086272
}
Is this possible to do in Solr?

Solution was posted here:
Custom Solr analyzers not being used during indexing
I need to use an Update Processors In order to change the field value before analysis. The process can be found here:
https://lucene.apache.org/solr/guide/8_1/update-request-processors.html

Related

Is it possible to apply a solr document int field value as boost value if a specific field is matched?

Ex.
"docs": [
{
"id": "f37914",
"index_id": "some_index",
"field_1": [
{
"Some value",
"boost": 20.
}
]
},
]
If 'field_1' is matched, then boost by corresponding 'boost' field.
Boost what? the document? the specific field? you can do any of them.
Anyway the way to do it is to user Function Queries:
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions
For example if you want to boost the document (and assuming if the value doesn't match then the score is 0) then you can do something like that:
q:_val_:"if(query($q1), field(boost), 0)"&q1=field_1:"Some Value"
_val_ is just a hook into Solr function query, query returns true if q1 matches, field is a simple function that just return the value of the field it self and if allows us to join the two together.
So what I ended up doing is using lucence payloads and solr 6.6 new DelimitedPayloadTokenFilter feature.
First I created a terms field with the following configuration:
{
"add-field-type": {
"name": "terms",
"stored": "true",
"class": "solr.TextField",
"positionIncrementGap": "100",
"indexAnalyzer": {
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.DelimitedPayloadTokenFilterFactory",
"encoder": "float",
"delimiter": "|"
}
]
},
"queryAnalyzer": {
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.SynonymGraphFilterFactory",
"ignoreCase": "true",
"expand": "false",
"tokenizerFactory": "solr.KeywordTokenizerFactory",
"synonyms": "synonyms.txt"
}
]
}
},
"add-field" : {
"name":"terms",
"type":"terms",
"stored": "true",
"multiValued": "true"
}
}
I indexed my documents likes so:
[
{
"id" : "1",
"terms" : [
"some term|10.0",
"another term|60.0"
]
}
,
{
"id" : "2",
"terms" : [
"some term|11.0",
"another term|21.0"
]
}
]
I used solr's functional query support to query for a match on terms and grab the attached boost payload and apply it to the relevancy score:
/solr/payloads/select?indent=on&wt=json&q={!payload_score%20f=ai_terms_wtih_synm_3%20v=$payload_term%20func=max}&fl=id,score&payload_term=some+term

"There is no index available for this selector" despite the fact I made one

In my data, I have two fields that I want to use as an index together. They are sensorid (any string) and timestamp (yyyy-mm-dd hh:mm:ss).
So I made an index for these two using the Cloudant index generator. This was created successfully and it appears as a design document.
{
"index": {
"fields": [
{
"name": "sensorid",
"type": "string"
},
{
"name": "timestamp",
"type": "string"
}
]
},
"type": "text"
}
However, when I try to make the following query to find all documents with a timestamp newer than some value, I am told there is no index available for the selector:
{
"selector": {
"timestamp": {
"$gt": "2015-10-13 16:00:00"
}
},
"fields": [
"_id",
"_rev"
],
"sort": [
{
"_id": "asc"
}
]
}
What have I done wrong?
It seems to me like cloudant query only allows sorting on fields that are part of the selector.
Therefore your selector should include the _id field and look like:
"selector":{
"_id":{
"$gt":0
},
"timestamp":{
"$gt":"2015-10-13 16:00:00"
}
}
I hope this works for you!

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

Elasticsearch score results based partly on Popularity

I'm using Elasticsearch for this project but a Solr solution might be appropriate too. In the query I'd like to include a portion of a should clause that will return results even if none of the other terms can. This will be used for document popularity. I'll periodically calculate reading popularity and add a float field to each doc with a numeric value.
The idea is to return docs based on terms but when that fails, return popular docs ranked by popularity. These should be ordered by term match scores or magnitude of popularity score.
I realize that I could quantize the popularity and treat it like a tag "hottest", "hotter", "hot"... but would like to use numeric field since the ranking is well defined.
Here is the current form of my data (from fetch by id):
GET /index/docs/ipad
returns a sample object
{
"_index": "index",
"_type": "docs",
"_id": "doc1",
"_version": 1,
"found": true,
"_source": {
"category": ["tablets", "electronics"],
"text": ["buy", "an", "ipad"],
"popularity": 0.95347457,
"id": "doc1"
}
}
Current query format
POST /index/docs/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{"terms": {"text": ["ipad"]}}
],
"must": [
{"terms": {"category": ["electronics"]}}
]
}
}
}
This may seem an odd query format but these are structured objects, not free form text.
Can I add popularity to this query so that it returns items ranked by popularity magnitude along with those returned by the should terms? I'd boost the actual terms above the popularity so they'd be favored.
Note I do not want to boost by popularity, I want to return popular if the rest of the query returns nothing.
One approach I can think of is wrapping match_all filter in constant score
and using sort on score followed by popularity
example:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"text": [
"ipad"
]
}
},
{
"constant_score": {
"filter": {
"match_all": {}
},
"boost": 0
}
}
],
"must": [
{
"terms": {
"category": [
"electronics"
]
}
}
],
"minimum_should_match": 1
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"popularity": {
"unmapped_type": "double"
}
}
]
}
You want to look into the function score query and a decay function for this.
Here's a gentle intro: https://www.found.no/foundation/function-scoring/

CouchDB View With OR Condition

I have two kinds of documents in couchDB with following json type:
1.
{
"_id": "4a91f3e8-616a-431d-8199-ace00055763d",
"_rev": "2-9105188217acd506251c98cd4566e788",
"Vehicle": {
"type": "STRING",
"name": "Vehicle",
"value": "12345"
},
"Start": {
"type": "DATE",
"name": "Start",
"value": "2014-09-10T11:19:00.000Z"
}
}
2.
{
"_id": "4a91f3e8-616a-431d-8199-ace00055763d",
"_rev": "2-9105188217acd506251c98cd4566e788",
"Equipment": {
"type": "STRING",
"name": "Equipment",
"value": "12345"
},
"Start": {
"type": "DATE",
"name": "Start",
"value": "2014-09-10T11:19:00.000Z"
}
}
I want to make one view which search all these documents whose doc.Vehicle.value=12345 OR doc.Equipment.value=12345.
How can I make this view that will return all these kind of documents.
Thanks in advance.
Just emit both (yes, map functions may emits multiple times different key-values for the same doc) values with your view:
function(doc){
if (doc.Equipment) {
emit(doc.Equipment.value, null)
}
if (doc.Vehicle) {
emit(doc.Vehicle.value, null)
}
}
And request them by the same key:
http://localhost:5984/db/_design/ddoc/_view/by_equip_value?key="12345"
See also the Guide to Views for more info about CouchDB views.
With Kxepals Version, you cannot query the type of results ("12345" can be either Vehicle, OR Equipment). you can only see the result when you use "include_docs=true" and search inside the doc, or make a second query with the id of the results.
If you want to see the type (or Query by type) you need to extend the View :
..
if(doc.Equipment) {
emit (doc.Equipment.value,doc.Equipment.name);
}
if(doc.Vehicle) {
emit(doc.Vehicle.value,doc.Vehicle.name);
}
Here, the name is the value of the result rows.
But you can also define the results in the query, if you put the name as a first query item:
if(doc.Equipment) {
emit([doc.Equipment.name,doc.Equipment.value],null);
}
if(doc.Vehicle) {
emit ([doc.Vehicle.name,doc.Vehicle.value],null);
}
Here, the
Your Query for Vehicles:
/viewname?startkey=["Vehicle"]&Endkey=["Vehicle",{}]
Equipment:
/viewname?startkey=["Equipment"]&endkey=["Equipment,{}]
Here, the name is the first Item of the result rows key array.
Maybe this will help : http://de.slideshare.net/okurow/couchdb-mapreduce-13321353
BTW: Better solution would be :
{
"_id": "4a91f3e8-616a-431d-8199-ace00055763d",
"_rev": "2-9105188217acd506251c98cd4566e788",
"type": "Vehicle",
"value":"12345",
"Start": {
"type": "DATE",
"name": "Start", // ? maybe also obsolete, because already inside "Start" Element
"value": "2014-09-10T11:19:00.000Z"
}
}
{
"_id": "4a91f3e8-616a-431d-8199-ace00055763d",
"_rev": "2-9105188217acd506251c98cd4566e788",
"type": "Equipment",
"value":"12345",
"Start": {
"type": "DATE",
"name": "Start", // ? maybe also obsolete, because already inside "Start" Element
"value": "2014-09-10T11:19:00.000Z"
}
}
in this case you can use only one emit:
emit([doc.type,doc.value],null)

Resources