i'm trying to figure out why the following queries produce vastly different results. i'm told a fuzzy query is almost never a good idea per this document Found-fuzzy so i'm trying to use a match query with a fuzziness parameter. they produce extremely different results. i'm not sure what's the best way of doing this.
my example is a movie title containing 'batman'. the user, however, types 'bat man' (with a space). this would make sense that a fuzzy query should find batman. it should also find other variations like spider man, but for now that's ok i guess. (not really, but...)
so the fuzzy search is actually returning more relevant results than the match one below. any ideas?
--fuzzy:
{
"query":{
"bool":{
"should": [
{
"fuzzy": {
"title": {
"value": "bat man",
"boost": 4
}
}
}
], "minimum_number_should_match": 1
}
}
}
--match:
{
"query":{
"bool":{
"should": [
{
"match": {
"title": {
"query": "bat man",
"boost": 4
}
}
}
], "minimum_number_should_match": 1
}
}
}
EDIT
i'm adding examples of what gets returned.
first, nothing gets returned using the match query, even with a high fuzziness value added (fuzziness: 5)
but i do get several 'batman' related titles using the fuzzy query such as 'batman' or 'batman returns'.
this gets even stranger when i do multiple fuzzy searches on 'bat man' using the fuzzy search... if i search my 'starring' field, in addition to the title field, (starring contains lists of actors), i get 'jason bateman' as well as the title 'batman'.
{
"_index": "store24",
"_type": "searchdata",
"_id": "081227987909",
"_score": 4.600759,
"fields": {
"title": [
"Batman"
]
}
},
{
"_index": "store24",
"_type": "searchdata",
"_id": "883929053353",
"_score": 4.1418676,
"fields": {
"title": [
"Batman Forever"
]
}
},
{
"_index": "store24",
"_type": "searchdata",
"_id": "883929331789",
"_score": 3.5298011,
"fields": {
"title": [
"Batman Returns"
]
}
}
BEST SO FAR (STILL NOT GREAT)
what i've found that works best so far is to combine both queries. this seems redundant, but i can't as yet make one work like the other. so, this seems to be better:
"should": [
{
"fuzzy": {
"title": {
"boost": 6.0,
"min_similarity": 1.0,
"value": "batman"
}
}
},
{
"match": {
"title": {
"query": "batman",
"boost": 6.0
,"fuzziness": 1
}
}
}
]
Elastic Search analyzes docs and converts them into terms, which are what is actually searched (not the docs themselves). The key difference between the two query types is that the match query does not analyze the query text before sending the query. So consider the example below:
The search of 'bat man' in a fuzzy search would first tokenize the term, then search. So what it really looks for is 'btmn,' which might not turn up the same matches. A good example of this is how Jason Bateman showed up because the last name was tokenized to btmn or a similar form.
More detailed information on the Analyzing of text fields when searching can be read http://exploringelasticsearch.com/searching_data.html#sec-searching-analysis
When a search is performed on an analyzed field, the query itself is
analyzed, matching it up to the documents which are analyzed when
added to the database. Reducing words to these short tokens normalizes
the text allowing for fast efficient lookups. Whether you’re searching
for "rollerblading" in any form, internally we’re just looking for
"rollerblad".
Related
I'm having trouble creating a Solr query to be able to pull out the right documents, and am starting to wonder if what I am trying to do is even possible.
Currently on Solr 8.9 using a managed schema and every field is using a wildcard field.
Firstly what the document looks like
(changed names due to redacting internal business language):
{
"id": "COUNTY:1",
"county_name_s": "Hertfordshire",
"coordinates_s": {
"id": "COUNTY:1COORDINATES:!",
"lat_s": "54.238948",
"long_s": "54.238948"
},
"cities": [
{
"id": "COUNTY:1CITY:1",
"city_name_s": "St Albans",
"size": {
"id": "COUNTY:1CITY:1SIZE:1",
"sq_ft_s": "100",
"sq_meters_s": "5879"
}
},
{
"id": "COUNTY:1CITY:2",
"city_name_s": "Watford",
"size": {
"id": "COUNTY:1CITY:2SIZE:2",
"sq_ft_s": "150",
"sq_meters_s": "10000"
}
}
],
"mayor": {
"title_s": "Mrs.",
"first_name_s": "Sheila",
"last_name_s": "Smith"
}
}
And what I want to return:
{
"id": "COUNTY:1",
"county_name_s": "Hertfordshire",
"coordinates": {
"id": "COUNTY:1COORDINATES:!",
"lat_s": "54.238948",
"long_s": "54.238948"
},
"cities": [
{
"id": "COUNTY:1CITY:1",
"city_name_s": "St Albans",
"size": {
"id": "COUNTY:1CITY:1SIZE:1",
"sq_ft_s": "100",
"sq_meters_s": "5879"
}
}
],
"mayor": {
"title_s": "Mrs.",
"first_name_s": "Sheila",
"last_name_s": "Smith"
}
}
Basically my goal is to return more or less the entire thing, however with filtering out one of the cities. For example, the condition for the city would be like city_name_s:"St Albans". So it's to say that I want the parent and all children, however if the child is in that array (ie cities array), then the given field (city_name_s) must equal my defined value, or we don't want that child.
Things I've tried:
I've basically tried two approaches here:
I've tried to play around with {!child} and {!parent} to get a result that I want. Currently I can only get something from City level or the entire thing as if the filter was not there at county level.
I've tried to change values for the childFilter option, with things like:
city_name_s:"St Albans" OR (*:* NOT city_name_s:[* TO *]) to try and say 'if field exists it should be this'.
Anyhow I'm starting to run out of ideas with this; been hacking away at it for the past couple of days and not really got any closer.
Thanks in advance for any help; bashing my head against the wall currently so any suggestions are more than welcome :)
I had a similar issue in solr 9.0.0 and this solved it for me: Apache Solr Filter on Child Documents
In your case, just add fl=*,[child childFilter=city_name_s:"St Albans"]
I'm trying to standartize a property in a json-ld document. A simple example:
json-ld
{
"#context": {
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"dcterms": "http://purl.org/dc/terms/"
},
"#graph": [
{
"#id": "1",
"rdfs:label": "A title"
},
{
"#id": "2",
"dcterms:title": "Another title"
}
]
}
frame (failing attempt)
{
"type": "array",
"items": {
"title": ["rdfs:label", "dcterms:title"]
}
}
This produces an empty graph, instead of this:
desired output
[{
"title": "A title"
},
{
"title": "Another title"
}]
The documentation at https://json-ld.org/primer/latest/#framing seems to be work in progress and there is really not a lot of examples or tutorials covering json-ld framing.
Playground example
Framing is used to shape the data in a JSON-LD document, using an example frame document which is used to both match the flattened data and show an example of how the resulting data should be shaped
https://json-ld.org/spec/latest/json-ld-framing/#framing
This beeing said, re-shaping data does not mean you can change the semantics. rdfs:label and dcterms:title are different things in the source data and will be different things in the result, you can not merge them to a "title" property that expands to only one URI (which one?). If that were the case, the result would have different semantics than the source, but framing is only meant to change the structure.
Here is nested json object have to save in SOLR then retrive it as is and ability to query on any level of attribute -
{
"number": "19940852773",
"details": [{
"number": "19940852773",
"pId": 70972062,
"bReviewReduction": 66.7000
},
{
"number": "19940852773",
"pId": 70972063,
"bReviewReduction": 0.0000
}
],
"line_details": [{
"number": "19940852773",
"paymentId": 70972062,
"paymentDetailId": 14972918
},
{
"number": "19940852773",
"paymentId": 70972905,
"paymentDetailId": 14973428
},
{
"number": "19940852773",
"paymentId": 70972905,
"paymentDetailId": 14973429,
"numOfServiceLines": 21
}
]
}
Solr doesn't support indexing the content as-is, but you can use the custom JSON indexing features to transform it to a form. Submitting the document above will give you a document / set of documents containing a number of childDocument entries, and you'll have to handle this through the child/parent transformations when retrieving and querying.
Usually it's a better strategy to try to flatten the document structure into something suitable for answering the queries you want Solr to answer for you, instead of trying to keep the raw data structure and special case all queries to handle that.
In my instance of Solr 4.10.3 I would like to index JSONs with a nested structure.
Example:
{
"id": "myDoc",
"title": "myTitle"
"nestedDoc": {
"name": "test name"
"nestedAttribute": {
"attr1": "attr1Val"
}
}
}
I am able to store it correctly through the admin interface:
/solr/#/mySchema/documents
and I'm also able to search and retrieve the document.
The problem I'm facing is that when I get the response document from my Solr search, I cannot see the nested attributes. I only see:
{
"id": "myDoc",
"title": "myTitle"
}
Is there a way to include ALL the nested fields in the returned documents?
I tried with : "fl=[child parentFilter=title:myTitle]" but it's not working (ChildDocTransformerFactory from:https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents). Is that the right way to do it or is there any other way?
I'm using: Solr 4.10.3!!!!!!
To get returned all the nested structure, you indeed need to use ChildDocTransformerFactor. However, you first need to properly index your documents.
If you just passed your structure as it is, Solr will index them as separate documents and won't know that they're actually connected. If you want to be able to correctly query nested documents, you'll have to pre-process your data structure as described in this post or try using (modifying as needed) a pre-processing script. Unfortunately, including the latest Solr 6.0, there's no nice and smooth solution on indexing and returning nested document structures, so everything is done through "workarounds".
Particularly in your case, you'll need to transform your document structure into this:
{
"type": "parentDoc",
"id": "myDoc",
"title": "myTitle"
"_childDocuments_": [
{
"type": "nestedDoc",
"name": "test name",
"_childDocuments_" :[
{
"type": "nestedAttribute"
"attr1": "attr1Val"
}]
}]
}
Then, the following ChildDocTransformerFactor query will return you all subdocuments (btw, although it says it's available since Solr 4.9, I've actually only seen it in Solr 5.3... so you need to test):
q=title:myTitle&fl=*,[child parentFilter=type:parentDoc limit=50]
Note, although it returns all nested documents, the returned document structure will be flattend (alas!), i.e., you'll get:
{
"type": "parentDoc",
"id": "myDoc",
"title": "myTitle"
"_childDocuments_": [
{
"type": "nestedDoc",
"name": "test name"
},
{
"type": "nestedAttribute"
"attr1": "attr1Val"
}]
}
Probably, not really what you've expected but... this is the unfortunate Solr's behavior that will be fixed in a nearest future release.
You can put
q={!parent which=}
and in fl field :"fl=*,[child parentFilter=title:myTitle].
It will give you all parent field and children field of title:mytitle
Is it possible to define which part of the text in which of the indexed text fields matches the query?
No, as far as I know and can tell from the Jira, no such feature exists currently. You can, of course, attempt to highlight the parts of the text yourself, but that requires to implement the highlighting and also implement the stemming according to the rules applied by MongoDB.
The whole feature is somewhat complicated - even consuming it - as can be seen from the respective elasticsearch documentation.
Refer to Mongodb Doc Highlighting
db.fruit.aggregate([
{
$searchBeta: {
"search": {
"path": "description",
"query": ["variety", "bunch"]
},
"highlight": {
"path": "description"
}
}
},
{
$project: {
"description": 1,
"_id": 0,
"highlights": { "$meta": "searchHighlights" }
}
}
])
I'm afraid that solution applies only to MongoDB Atlas at the moment #LF00.