Exact dismax search in multivalue field with spaces

Exact dismax search in multivalue field with spaces - solr

I'm trying to add tags to user-provided text in order to automagically classify the article.
It works pretty well except for words with spaces.
For instance, I want to add the "clothes" tag when user type the following words in that order: "tee shirt" or "tee shirts".
The sentence "my tee shirt is blue" should brings a result since "tee shirt" is written correctly in that order but neither "tee my shirt" nor "my shirt" should return a result.
I have a dedicated "tags" core to do that.
I create an empty core with
/opt/solr/bin/solr create -c "tags"
and update the core schema using curl
curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : { "name":"myShingleTextField", "class":"solr.TextField", "positionIncrementGap":"100", "analyzer" : { "tokenizer":{ "class":"solr.StandardTokenizerFactory" }, "filters":[ { "class":"solr.LowerCaseFilterFactory" }, { "class":"solr.ShingleFilterFactory", "maxShingleSize":"3", "outputUnigrams":"true" }, ]}} }' http://localhost:8983/solr/tags/schema
curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field": { "name":"keywords", "type":"myShingleTextField", "multiValued":true, "indexed":true, "stored":true, "required":true, "docValues":false } }' http://localhost:8983/solr/tags/schema
curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field": { "name":"results", "type":"string", "multiValued":true, "indexed":true, "stored":true, "required":true, "docValues":true } }' http://localhost:8983/solr/tags/schema
I then /update it with the following (simplified) document:
{ "add": { "doc": { "keywords": ["tee shirt", "tee shirts"], "results": ["clothes"] } }, "commit": { } }
I finally do my query:
/select?defType=dismax&q=tee%20my%20shirt&qf=keywords
It return a document while I don't want one ("my" between "tee" and "shirt").
Maybe it's a tokenizer issue or maybe dismax query is not what I need.
I tried escaping quotes or spaces, modifying the mm parameter to 2 (which kinda works but prevents one-word to match) and other tweaking that didn't work.
Any idea ?

Thanks to #MatsLindth hints, I came up with a solution.
First, I needed to replace spaces in multiple words by '_' and to use a KeywordTokenizerFactory tokenizer to store them verbatim.
On the query side, I had to specify "tokenSeparator" parameter as '_'.
So my custom field type definition command is now :
curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : { "name":"myShingleTextField", "class":"solr.TextField", "positionIncrementGap":"100", "indexAnalyzer": { "tokenizer":{ "class":"solr.KeywordTokenizerFactory" } }, "queryAnalyzer" : { "tokenizer":{ "class":"solr.StandardTokenizerFactory" }, "filters":[ { "class":"solr.LowerCaseFilterFactory" }, { "class":"solr.ShingleFilterFactory", "maxShingleSize":"3", "outputUnigrams":"true", "tokenSeparator":"_" }, ]}} }' http://localhost:8983/solr/tags/schema
My update command is now:
{ "add": { "doc": { "keywords": ["tee_shirt", "tee_shirts"], "results": ["clothes"] } }, "commit": { } }
So when I do :
/select?defType=dismax&q=tee%20my%20shirt&qf=keywords
we have
"tee my shirt"
"tee" "my" "shirt" (StandardTokenizerFactory + LowerCaseFilterFactory)
"tee" "tee_my" "tee_my_shirt" "my" "my_shirt" "shirt" (ShingleFilterFactory)
Nothing matches as expected but querying "my tee shirt" brings "tee_shirt" which obviously matches "tee_shirt", yay!
Thanks again to #MatsLindth and Solr analysis page!

Related

How to get PubSub message data in an API?

I want to get this kind of message json data from Pub/Sub
{
"message": {
"data": {
"from": "no-reply#example.com",
"to": "user#example.com",
"subject": "test",
"body": "test"
}
}
}
And parse its data to use for other service.
private parseMessage(message: Message) {
try {
const decoded = Buffer.from(message.data.toString(), 'base64').toString().trim();
return JSON.parse(decoded);
} catch (err) {
throw new BadRequestException(
'Parse Error: ' + message,
);
}
}
But when run the API got this error:
SyntaxError: Unexpected token � in JSON at position 0
at JSON.parse (<anonymous>)
at EventController.parseMessage (../myapp/src/api/posts/posts.controller.ts:44:18)
response: {
statusCode: 400,
message: 'Parse Error: [object Object]',
error: 'Bad Request'
},
status: 400
It seems this post isn't right:
curl -X 'POST' \
'http://localhost:3000/posts' \
-H 'Content-Type: application/json' \
-d '{
"message": {
"data": {
"from": "no-reply#example.com",
"to": "user#example.com",
"subject": "test",
"body": "test"
}
}
}'
Then how to make fake Pub/Sub message data?

I think you need to encode your data into Base64.
Base64 encoding schemes need to encode binary data that needs to be stored and transferred over media that are designed to deal with ASCII. This is to ensure that the data remain intact without modification during transport.
You can also refer to this GCP public documentation.
Eg. From the Doc:
# 'world' base64-encoded is 'd29ybGQ='
curl localhost:8080 \
-X POST \
-H "Content-Type: application/json" \
-d '{
"context": {
"eventId":"1144231683168617",
"timestamp":"2020-05-06T07:33:34.556Z",
"eventType":"google.pubsub.topic.publish",
"resource":{
"service":"pubsub.googleapis.com",
"name":"projects/sample-project/topics/gcf-test",
"type":"type.googleapis.com/google.pubsub.v1.PubsubMessage"
}
},
"data": {
"#type": "type.googleapis.com/google.pubsub.v1.PubsubMessage",
"attributes": {
"attr1":"attr1-value"
},
"data": "d29ybGQ="
}
}'

Query tensor in vespa.ai

Imitating: https://blog.vespa.ai/billion-scale-knn/
Command line:
curl -s -d '{"yql":"select * from user where {\"targetHits\":10}nearestNeighbor(approximate, q_binary_code);","ranking.features.query(q_binary_code)":[1,2,3,4,5,6,7,8,9,10],"hits":10}' -H "Content-Type: application/json" -X POST http://localhost:8080/search/ | jq .
Error message:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 4,
"summary": "Invalid query parameter",
"source": "content",
"message": "Expected a tensor value of 'query(q_binary_code)' but has [1,2,3,4,5,6,7,8,9,10]"
}
]
}
}
Question: How pass q_binary_code?

With recent Vespa versions, you can define the query tensor in the schema. It must be defined
schema code {
document code {
field id type int {..}
field binary_code type tensor<int8>(b[16]) {..}
}
rank-profile coarse-ranking {
inputs {
query(q_binary_code) tensor<int8>(b[16])
}
num-threads-per-search:12
first-phase { expression { closeness(field,binary_code) } }
}
You also must define the rank profile in the query request:
curl -s -d '{"yql":"select * from user where {\"targetHits\":10}nearestNeighbor(binary_code, q_binary_code);","ranking.features.query(q_binary_code)":[1,2,3,4,5,6,7,8,9,10],"hits":10, "ranking": "coarse-ranking"}' -H "Content-Type: application/json" -X POST http://localhost:8080/search/ | jq .

Reload managed synonyms in Solr

I am using Solr 8.6.1, started in solrcloud mode.
The field type is
{
"add-field-type" : {
"name":"articleTitle",
"positionIncrementGap":100,
"multiValued":false,
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{ "class":"solr.StandardTokenizerFactory" },
"filters":[
{ "class":"solr.LowerCaseFilterFactory" },
{ "class":"solr.ManagedStopFilterFactory", "managed":"english" },
{ "class":"solr.ManagedSynonymGraphFilterFactory", "managed":"english" },
{ "class":"solr.FlattenGraphFilterFactory" },
{ "class":"solr.PorterStemFilterFactory" }
]
},
"queryAnalyzer":{
"tokenizer":{ "class":"solr.StandardTokenizerFactory" },
"filters":[
{ "class":"solr.LowerCaseFilterFactory" },
{ "class":"solr.ManagedStopFilterFactory", "managed":"english" },
{ "class":"solr.ManagedSynonymGraphFilterFactory", "managed":"english" },
{ "class":"solr.PorterStemFilterFactory" }
]
}
}
}
After I add a document
{
"id": 100,
"articleTitle": "Best smartphone"
}
I update the synonyms list by API
curl -X PUT -H 'Content-type:application/json' --data-binary '["iphone", "smartphone"]' "http://localhost:8983/solr/articles/schema/analysis/synonyms/english"
and reload the collection by API
http://localhost:8983/solr/admin/collections?action=RELOAD&name=articles
However when I try to search the documents don't pop-up.
http://localhost:8983/solr/articles/select?q=articleTitle:iphone
No result are returned. I expected that added document will be returned.
It works only if I first update the synonyms list and after that add the document into collection.
How to configure Solr to find the documents by synonyms if the synonyms are changed after documents are created?

equiv: "Could not add an item of type WORD_ALTERNATIVES"

Using equiv() on an empty table throws a strange error in vespa.ai 7.99.22:
Could not add an item of type WORD_ALTERNATIVES: Equiv can only have word/int/phrase as children
Definition:
search post {
document post {
field description type string {
indexing: summary | index
stemming: multiple
}
}
fieldset text {
fields: description
}
}
Query (no rows in table post):
curl -s -H "Content-Type: application/json"
--data '{"yql" : "select * from post where text contains equiv(\"Q123\",\"Q456\");"}'
http://localhost:8080/search/ | jq .
Result:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 4,
"summary": "Invalid query parameter",
"source": "content",
"message": "Could not add an item of type WORD_ALTERNATIVES: Equiv can only have word/int/phrase as children"
}
]
}
}
What is the issue?

Using stemming:multiple leads to a WordAlternativesItem which is not a permitted child of EquivItem, so this combination is not supported.
However, we believe this is unnecessarily restrictive. I'ill lift this restriction now, please try again in the next version which should be out on Monday (2019-09-16) if the winds are favourable.

Join elasticsearch indices while matching fields in nested/inner objects

I am trying to join 2 elasticsearch indices by using terms filter lookup. I referred to http://www.elasticsearch.org/blog/terms-filter-lookup/ and http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html. These Examples lookup on an array of fields like "followers" : ["1", "3"] and join works fine for similar data.
My requirement is to join with a field inside an array of objects. When I extend the above example to include an array of objects, my query fails.
Following is the sample data:
PUT /users/user/2 {
"followers" : [
{
"userId":"1",
"username":"abc",
"location":"xyz"
},
{
"userId":"3",
"username":"def",
"location":"xyz"
}
}
]
}
PUT /tweets/tweet/1 {
"user" : "2"
}
PUT /tweets/tweet/2 {
"user" : "1"
}
I am now trying to find tweets that are created by followers of user 2
POST /tweets/_search {
"query" : {
"filtered" : {
"filter" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers.userId"
},
"_cache_key" : "user_2_friends"
}
}
}
}
}
My search results are 0 for above query. I tried 2 other approaches as well 1)declare the followers object as a nested object during mapping and use "nested" in the query, 2)tried to add a match query for followers.userId after giving path as "followers". None yielded results.
Does terms filter lookup support array of objects? Any pointers to solving my problem would be of great help

What you're trying to do worked for me, unless I'm missing something. What version of Elasticsearch are you using? I'm using 1.3.4.
So I created both indices and added the docs you have listed:
curl -XPUT "http://localhost:9200/users"
curl -XPUT "http://localhost:9200/users/user/2 " -d '
{
"followers" : [
{
"userId":"1",
"username":"abc",
"location":"xyz"
},
{
"userId":"3",
"username":"def",
"location":"xyz"
}
]
}'
curl -XPUT "http://localhost:9200/tweets"
curl -XPUT "http://localhost:9200/tweets/tweet/1 " -d'
{
"user" : "2"
}'
curl -XPUT "http://localhost:9200/tweets/tweet/2 " -d'
{
"user" : "1"
}'
then ran your search query:
curl -XPOST "http://localhost:9200/tweets/_search " -d'
{
"query": {
"filtered": {
"filter": {
"terms": {
"user": {
"index": "users",
"type": "user",
"id": "2",
"path": "followers.userId"
},
"_cache_key": "user_2_friends"
}
}
}
}
}'
and got back this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "tweets",
"_type": "tweet",
"_id": "2",
"_score": 1,
"_source": {
"user": "1"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/4a2a2d77d0b6f4502ff6c5022b268acfa65ee6d2

Clear the indices if you have any
curl -XDELETE "http://example.com:9200/currencylookup/"
curl -XDELETE "http://example.com:9200/currency/"
Create the lookup table
curl -XPUT http://example.com:9200/currencylookup/type/2 -d '
{ "conv" : [
{ "currency":"usd","username":"abc", "location":"USA" },
{ "currency":"inr", "username":"def", "location":"India" },
{ "currency":"IDR", "username":"def", "location":"Indonesia" }]
}'
Lets put some dummy docs
curl -XPUT "http://example.com:9200/currency/type/USA" -d '{ "amount":"100", "currency":"usd", "location":"USA" }'
curl -XPUT "http://example.com:9200/currency/type/JPY" -d '{ "amount":"50", "currency":"JPY", "location":"JAPAN" }'
curl -XPUT "http://example.com:9200/currency/type/INR" -d '{ "amount":"50", "currency":"inr", "location":"INDIA" }'
curl -XPUT "http://example.com:9200/currency/type/IDR" -d '{ "amount":"30", "currency" : "IDR", "location": "Indonesia" }'
Time to check the output
curl http://example.com:9200/currency/_search?pretty -d '{
"query" : {
"filtered" : {
"filter" : {
"terms" : {
"currency" : {
"index" : "currencylookup",
"type" : "type",
"id" : "2",
"path" : "conv.currency"
},
"_cache_key" : "currencyexchange"
}
}
}
}
}'
Results
# curl http://example.com:9200/currency/_search?pretty -d '{
"query" : {
"filtered" : {
"filter" : {
"terms" : {
"currency" : {
"index" : "currencylookup",
"type" : "type",
"id" : "2",
"path" : "conv.currency"
},
"_cache_key" : "currencyexchange"
}
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "currency",
"_type" : "type",
"_id" : "INR",
"_score" : 1.0,
"_source":{ "amount":"50", "currency":"inr", "location":"INDIA" }
}, {
"_index" : "currency",
"_type" : "type",
"_id" : "USA",
"_score" : 1.0,
"_source":{ "amount":"100", "currency":"usd", "location":"USA" }
} ]
}
}
Conclusion
Capital letters are culprit here.
You can see 'IDR' is in caps so the match is failed for it and 'JPY' is not in look up even if it was there it would not have got matched because it is in caps.
cross matching values must be in small letters or numbers like
eg:
abc
1abc

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Exact dismax search in multivalue field with spaces - solr

Related

How to get PubSub message data in an API?

Query tensor in vespa.ai

Reload managed synonyms in Solr

equiv: "Could not add an item of type WORD_ALTERNATIVES"

Join elasticsearch indices while matching fields in nested/inner objects

Categories

Resources