Lucene/Solr: Store offset information for certain keywords - solr

We are using Solr to store documents with keywords; each keyword is associated with a span within the document.
The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.
For example, this document
Bill studied The Bill of Rights last summer.
could be accompanied by the following keywords (with offsets in parentheses):
William Brown (0:4)
legal term (13:31)
summer 2011 (32:43)
(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)
I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.
But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.
I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...
EDIT: edited to make clear that Bill can refer to different people/things in different documents.
EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).

Two Q Monte
Solution Pros:
Annotations logically stored with source docs
No knowledge of highlighter implementation or custom Java highlighter development required
Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.
Solution Cons:
Requires two queries to be run
Requires code in your search client to merge results from one query into the other.
With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
{
"id": "123",
"text" : "Bill studied The Bill of Rights last summer.",
"content_type": "source",
"_childDocuments_": [
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
},
{
"id": "123-2",
"content_type": "source_annotation",
"annotation": "legal term",
"start_offset": 13,
"end_offset": 31
},
{
"id": "123-3",
"content_type": "source_annotation",
"annotation": "summer 2011",
"start_offset": 32,
"end_offset": 43
}
]
}
]
... using block join to query the annotations.
1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
}
]
}
Store these results in your code so that you can fold in the annotation offsets after the next query returns.
2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123
(id:123 discovered in Annotation Query gets ORed into second query)
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123",
"content_type": "source",
"text": "Bill studied The Bill of Rights last summer."
}
],
"highlighting":{}
}
Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!
Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.
More block join info on Yonik's blog here.

By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.
When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):
text: foo is awesome
start: 0 4 7
end: 3 6 13
Once the SynonymFilterFactory is applied:
bar
text: foo is awesome
start: 0 4 7
end: 3 6 13
So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory
In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.
Do you need to fetch the stored offsets for some later processing?

Related

How can I store and search through large documents with MongoDB?

Well. Here's the DB schema/architecture problem.
Currently in our project we use MongoDB. We have one DB with one collection. Overall there are almost 4 billions of documents in that collection (value is constant). Each document has a unique specific ID and there is a lot of different information related to this ID (that's why MongoDB was chosen - data is totally different, so schemaless is perfect).
{
"_id": ObjectID("5c619e81aeeb3aa0163acf02"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
The purpose of that collection is to store a lot of data, that is easy to update (some data is being updated every day, some is updated once a month) and to search over different fields to retrieve the ID. Also we store the "history" of each field (and we should have ability to search over history as well). So when overtime updates were turned on we faced a problem called MongoDB 16MB maximum document size.
We've tried several workarounds (like splitting document), but all of them include either $group or $lookup stage in aggregation (grouping up by id, see example below), but both can't use indexes, which makes search over several fields EXTREMELY long.
{
"_id": ObjectID("5c619e81aeeb3aa0163acd12"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133
}
{
"_id": ObjectID("5c619e81aeeb3aa0163acd11"),
"our_id": 1552322211,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
Also we can't use $match stage before those, because the search can include logical operators (like field_1 = 'a' && field_c != 320, where field_1 is from one document and field_c is from another, so the search must be done after grouping/joining documents together) + the logical expression can be VERY complex.
So are there any tricky workarounds? If no, what other DB's can you suggest for moving to?
Kind regards.
Okay, so after some time spent on testing different approaches, I've finally ended up with using Elasticsearch, because there is no way to perform requested searches through MongoDB in adequate amount of time.

Does it make sense to model a Solr document if search does not allow to specify the attributes?

I want to provide a search feature in my site where the user is able to search by text only, without specifying the attributes.
For example, instead of allowing the user to search by "author=George Martin" he will simply query "George Martin".
I would like to know if there is any advantage in a document model like this one:
{
"id": 1,
"title": "Game of Thrones",
"author": "George R. R. Martin",
"published": "August, 1996"
}
Compared to:
{
"id": 1,
"data": [
"Game of Thrones",
"George R. R. Martin",
"August, 1996"
]
}
If I'm not going to use "author:value" in the Solr API, I should get the same results, right?
The first version will allow you to assign different weights to the different fields. I.e. a hit in the title might be more important than a hit in the author field - or vice versa.
Using the edismax handler (defType=edismax) and query fields (qf=title author published) will give you the same behavior as your second example, but will retain the structure of the document.
As the fields are put into the qf parameter, there is no need for the user to explicitly tell Solr which fields she wants to search.
To give the fields different weights, assign a weight to the field in the qf list: qf=title^5 author^2 published will give a hit in title five times the weight than a hit in published - i.e. "The Hunt for Red October" will be more important than something published in October.

Ibm Watson Conversation Fuzzy Matching update causing issue with existing entities

The Fuzzy matching feature of Ibm watson conversation since its latest update is matching words incorrectly. Eg. "what" is getting picked up as entity "chatbot" whereas there is no synonym in chatbot entity that is even close to what.
My question is that is there a way to exclude words from fuzzy matching yet keeping it ON for the entity. Or any other solution to tackle this problem.
Thanks
I assume you have an entity in chatbot for 'chat bot', and its getting a partial match on chat, and then doing fuzzy match from 'chat' to 'what' because its only one character difference and could be a spelling error.
You can turn fuzzy matching off, but you cannot currently blacklist any specific words. You can also try to protect yourself by your dialog design in that youre only looking for #chatbot at certain points, so it shouldn't interrupt very often
I know what you mean, we need to use fuzzy matching, but it sometimes creates more trouble. We have had a number of words picked up and reported as something different. The method we use to remove some of the issues, is to view the confidence value that's given for the incorrect spelling "what" .. and then using this as an additional condition.
i.e. if "what" reports a confidence value of 0.6 then set your condition to be 0.7 .. entities['chatbot']?.confidence > 0.7
Fuzzy logic can be switched on or off for each individual "class" of entities, i.e. 'chatbot' in the example above or 'city' in many of the doc examples.
I don't believe you can set a one global condition that checks all entities for there confidence value, so you do need to check the confidence at the class level. As shown above.
Also at present you cannot blacklist individual words to stop the fuzzy logic from checking them, like 'what' in your example.
Yes, you can definitely examine the confidence value. One concern I have about that is that you have no idea how many entities you are receiving, so you will have to write some fairly complex logic, but if you only have one entity, its pretty simple. When we detect entities, we return this:
"entities": [
{
"entity": "appliance",
"location": [
23,
29
],
"value": "wipers",
"confidence": 1
},
{
"entity": "appliance",
"location": [
11,
18
],
"value": "lights",
"confidence": 0.87
}
]
So to access the confidence of an entity you would do entity[0].confidence > 0.x in your dialog trigger.

Azure Search synonyms not reflecting in results

The synonyms don't seem to function in Azure Search
I updated my synonyms map with the following payload
{
"name" : "synonymmap1",
"format" : "solr",
"synonyms" :
"Bob, Bobby,Bobby\n
Bill, William, Billy\n
Harold, Harry\n
Elizabeth, Beth\n
Michael,Mike\n
Robert, Rob\n"
}
Then when I examined the synonymMap, I see this
{
"#odata.context":
"https://athenasearchdev.search.windows.net/$metadata#synonymmaps",
"value": [
{
"#odata.etag": "\"0x8D4E7F3C1A9404D\"",
"name": "synonymmap1",
"format": "solr",
"synonyms": "Bob, Bobby,Bobby\n\r\n Bill, William, Billy\n\r\n Harold, Harry\n\r\n Elizabeth, Beth,Liza, Elize\n\r\n Michael,Mike\n\r\n Robert, Rob\n\r\n"
}
]
}
However, the synonyms don't seem to function. e.g results for a search on Mike and Michael are not identical?
I understand this is a preview feature, but wanted help on the following
a) once defined as synonyms, should we not expect exact same results and search scores across all synonym variations
b) Can these synonyms apply at a column level (e. first name alone and not address)- or is it always across the document
c) if we have a large set of synonyms (over 1000)- does it lead to performance impact?
I am Nate from Azure Search. To answer the questions first :
a) Yes, you should. If "Bill" and "Williams" were defined as synonyms. Searching on either should yield the same result.
b) It's always at the column level. You use the field/column property called 'synonymMaps' to specify which synonym maps to use. Please see "Setting the synonym map in the index definition" in https://azure.microsoft.com/en-us/blog/azure-search-synonyms-public-preview/ for more information.
c) Do you mean over 1000 synonyms for a word? or 1000 synonym rule in the synonym map? The former definitely impacts performance because the search query will expand to 1000 of terms. In fact, you can't define more than 50 synonyms in a rule. The latter, 1000s of rules in a synonym map shouldn't impact performance unless the rules are constantly updated.
Regarding your comments that synonyms don't function, based on your questions, I was wondering if the synonyms feature was enabled in the index definition. Could you check that and if it doesn't function, feel free to drop me an email at nateko#microsoft.com.
The extraneous new line characters you see in the retrieved synonym map may have been inserted by the http client you were using at the time of uploading. Some http clients, fiddler and postman for example, insert new line character at the line ending automatically so you don't have to do it yourself.
Thanks,
Nate

solr - How can I exclude from the result the multivalue fields that don't match my query?

I have some indexed documents like the one above:
{
"doc_desc": "Indexing Child Documents in JSON",
"doc_id": "379",
"image_id": [
"28086# ho hum... this is page 1 of chapter 1",
"28087# more text... this is page 2 of chapter 1",
"28088# more text... this is page 3 of chapter 1"
]
}
When I query for “ho hum” I need that the document returned be something like:
{
"doc_desc": "Indexing Child Documents in JSON",
"doc_id": "379",
"image_id": [
"28086# ho hum... this is page 1 of chapter 1"
]
}
So I can know the exactly page that have the words i was searching for. How can i do that?
In other words... How can I exclude from the result the multivalue fields that don't match my query?
OBS: I am using solr-4.10.2 and a data-import (db-data-config.xml) from my SQL Server database.
You can't, at least not without a lot of manual tinkering.
Two possible solutions is to index each page as a separate document, or use the Block Join feature of Solr. The first option is probably the quickest to implement.

Resources