How can I store and search through large documents with MongoDB? - database

Well. Here's the DB schema/architecture problem.
Currently in our project we use MongoDB. We have one DB with one collection. Overall there are almost 4 billions of documents in that collection (value is constant). Each document has a unique specific ID and there is a lot of different information related to this ID (that's why MongoDB was chosen - data is totally different, so schemaless is perfect).
{
"_id": ObjectID("5c619e81aeeb3aa0163acf02"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
The purpose of that collection is to store a lot of data, that is easy to update (some data is being updated every day, some is updated once a month) and to search over different fields to retrieve the ID. Also we store the "history" of each field (and we should have ability to search over history as well). So when overtime updates were turned on we faced a problem called MongoDB 16MB maximum document size.
We've tried several workarounds (like splitting document), but all of them include either $group or $lookup stage in aggregation (grouping up by id, see example below), but both can't use indexes, which makes search over several fields EXTREMELY long.
{
"_id": ObjectID("5c619e81aeeb3aa0163acd12"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133
}
{
"_id": ObjectID("5c619e81aeeb3aa0163acd11"),
"our_id": 1552322211,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
Also we can't use $match stage before those, because the search can include logical operators (like field_1 = 'a' && field_c != 320, where field_1 is from one document and field_c is from another, so the search must be done after grouping/joining documents together) + the logical expression can be VERY complex.
So are there any tricky workarounds? If no, what other DB's can you suggest for moving to?
Kind regards.

Okay, so after some time spent on testing different approaches, I've finally ended up with using Elasticsearch, because there is no way to perform requested searches through MongoDB in adequate amount of time.

Related

Is the order of multi-value fields reliable in Solr?

I have an indexed multi-value field. If I add a parallel multi-value field, is it reliable to have the same order?
Consider this CSV and separator |:
ID,Name,Number
988,Sixth|Second|Third,6|2|3
989,Fifth|Fourth|First,5|4|1
If I get the records by id (not search), can I be sure that the arrays of two fields are always in the original matching order?
{
"doc":
{
"ID":988,
"Name":["Sixth",
"Second",
"Third"]},
"Number":[6,
2,
3]}}
Yes, the order is deterministic and stable. You can safely assume that the sequence of multivalued fields is kept intact. The post detailing this on the mailing list has since disappeared.
There is no guarantee about the ordering of the fields - i.e. "Number" can come before "Name", but internally in the field (the mulivalued part) the values will be returned in the same order as they were indexed.
We've been running applications on Solr since 2008 that depend on this behavior and it has never been an issue.
If there's many fields where you need to know that [0] in one field corresponds to [0] in another field, etc., it might be more useful to add a stored only (not indexed, etc.) JSON representation of the structure as a field and just index the other fields (and not store them) to make the application level code simpler.

Carrot: different clusters for the same query

When issuing the same query with match all query (* : *) I get different clusters and scores all the time. What could be the reason?
First try:
label: "В Минске"
score: 52.79549568196028
Second try:
label: "В Минске"
"score": 54.74385944060893
Third try:
label: "В Минске"
"score": 48.884082925408734
Document ids inside clusters are also different. Clusters themselves change: in one query response I get a cluster "тысячами евро", in the subsequent one it is gone, but new cluster appears: "Тысячами Долларов"
Is there some carrot parameter that could make clusters stable for a given query? Could it be desiredClusterCountBase ?
The Solr index is the same for all cases. Algorithm used: org.carrot2.clustering.lingo.LingoClusteringAlgorithm with StopWordLabelFilter.enabled=false and clustering.rows=1000.
It looks like I found the reason:
in the index there were duplicate of each document, with only one difference: one copy had a publication date, the other did not.
at the same time, my date filter did not work correctly, because publication dates were incorrectly stamped on each document and ranking function with reciprocal rank could return different documents each time for the top 1000 (this part is hard to debug without looking into Solr source code)
clustering module would get slightly different sets of documents => clusters would change. However, one could see that most prominent clusters (by size) were still stable, only scores were changing. Less prominent clusters could be replaced by other less prominent clusters between requests.
I don't know if this is a bug still, but removing all documents from the index and putting them back with the correct publication date has solved the issue.

Ibm Watson Conversation Fuzzy Matching update causing issue with existing entities

The Fuzzy matching feature of Ibm watson conversation since its latest update is matching words incorrectly. Eg. "what" is getting picked up as entity "chatbot" whereas there is no synonym in chatbot entity that is even close to what.
My question is that is there a way to exclude words from fuzzy matching yet keeping it ON for the entity. Or any other solution to tackle this problem.
Thanks
I assume you have an entity in chatbot for 'chat bot', and its getting a partial match on chat, and then doing fuzzy match from 'chat' to 'what' because its only one character difference and could be a spelling error.
You can turn fuzzy matching off, but you cannot currently blacklist any specific words. You can also try to protect yourself by your dialog design in that youre only looking for #chatbot at certain points, so it shouldn't interrupt very often
I know what you mean, we need to use fuzzy matching, but it sometimes creates more trouble. We have had a number of words picked up and reported as something different. The method we use to remove some of the issues, is to view the confidence value that's given for the incorrect spelling "what" .. and then using this as an additional condition.
i.e. if "what" reports a confidence value of 0.6 then set your condition to be 0.7 .. entities['chatbot']?.confidence > 0.7
Fuzzy logic can be switched on or off for each individual "class" of entities, i.e. 'chatbot' in the example above or 'city' in many of the doc examples.
I don't believe you can set a one global condition that checks all entities for there confidence value, so you do need to check the confidence at the class level. As shown above.
Also at present you cannot blacklist individual words to stop the fuzzy logic from checking them, like 'what' in your example.
Yes, you can definitely examine the confidence value. One concern I have about that is that you have no idea how many entities you are receiving, so you will have to write some fairly complex logic, but if you only have one entity, its pretty simple. When we detect entities, we return this:
"entities": [
{
"entity": "appliance",
"location": [
23,
29
],
"value": "wipers",
"confidence": 1
},
{
"entity": "appliance",
"location": [
11,
18
],
"value": "lights",
"confidence": 0.87
}
]
So to access the confidence of an entity you would do entity[0].confidence > 0.x in your dialog trigger.

Lucene/Solr: Store offset information for certain keywords

We are using Solr to store documents with keywords; each keyword is associated with a span within the document.
The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.
For example, this document
Bill studied The Bill of Rights last summer.
could be accompanied by the following keywords (with offsets in parentheses):
William Brown (0:4)
legal term (13:31)
summer 2011 (32:43)
(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)
I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.
But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.
I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...
EDIT: edited to make clear that Bill can refer to different people/things in different documents.
EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).
Two Q Monte
Solution Pros:
Annotations logically stored with source docs
No knowledge of highlighter implementation or custom Java highlighter development required
Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.
Solution Cons:
Requires two queries to be run
Requires code in your search client to merge results from one query into the other.
With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
{
"id": "123",
"text" : "Bill studied The Bill of Rights last summer.",
"content_type": "source",
"_childDocuments_": [
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
},
{
"id": "123-2",
"content_type": "source_annotation",
"annotation": "legal term",
"start_offset": 13,
"end_offset": 31
},
{
"id": "123-3",
"content_type": "source_annotation",
"annotation": "summer 2011",
"start_offset": 32,
"end_offset": 43
}
]
}
]
... using block join to query the annotations.
1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
}
]
}
Store these results in your code so that you can fold in the annotation offsets after the next query returns.
2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123
(id:123 discovered in Annotation Query gets ORed into second query)
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123",
"content_type": "source",
"text": "Bill studied The Bill of Rights last summer."
}
],
"highlighting":{}
}
Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!
Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.
More block join info on Yonik's blog here.
By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.
When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):
text: foo is awesome
start: 0 4 7
end: 3 6 13
Once the SynonymFilterFactory is applied:
bar
text: foo is awesome
start: 0 4 7
end: 3 6 13
So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory
In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.
Do you need to fetch the stored offsets for some later processing?

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

Resources