I've got a MongoDB collection where a particular string may appear in any of a number of fields:
{"_id":1, "field1": "foo", "field2": "bar", "field3": "baz", "otherfield": "stuff"},
{"_id":2, "field1": "bar", "field2": "baz", "field3": "foo", "otherfield": "morestuff"},
{"_id":3, "field1": "baz", "field2": "foo", "field3": "bar", "otherfield": "you get the idea"}
I need to query so that I am returned all records where any one of a set of fields is equal to any value in an array ... basically, if I have ["foo","bar"] I need it to match if either of those strings are in field1 or field2 (but not any other field).
Obviously I can do this with a series of multiple queries
db.collection.find({"field1":{"$in":["foo","bar"]}})
db.collection.find({"field2":{"$in":["foo","bar"]}})
etc., and I've also made a very large $or query that concatenates them all together, but it seems far too inefficient (my actual collection needs to match any of 15 strings that can occur in any of 9 fields) ... but I'm still new to nosql DBs and am not sure of the best paradigm I need to use here. Any help is greatly appreciated.
try
db.collection.find(
// Find documents matching any of these values
{$or:[
{"field1":{"$in":["foo","bar"]}},
{"field2":{"$in":["foo","bar"]}}
]}
)
also refer to this question
Found another answer through poring over the documentation that seems to hit a sweet spot -- text indexes.
db.collection.ensureIndex({"field1":"text","field2":"text"})
db.records.runCommand("text",{search:"foo bar"})
When I run my actual query with many more strings and fields (and about 100,000 records), the $or/$in approach takes 620 milliseconds while the text index takes 131 milliseconds. The one drawback is that it returns a different type of document as a result; luckily the actual documents are a parameter of each result object.
Thanks to those who took the time to make suggestions.
I would collect all the relevant fields in one field (i.e. collected) by adding their values like
"foo:field1",
"bar:field2",
"baz:field3",
"stuff:otherfield",
"bar:field1",
"baz:field2"
...
into that field.
If you search for bar existing in any field you can use:
db.collection.find( { collected: { $regex: "^bar" } }, ... );
Your example in the question would look like:
db.collection.find( collected: { { $all: [ "foo:field1", "foo:field2", "bar:field1", "bar:field2" ] } }, ... );
Related
I have following Data in my Index,
{
"name" : "The 100",
"lists" : [
"2c8540ee-85df-4f1a-b35f-00124e1d3c4a;Bellamy",
"2c8540ee-85df-4f1a-b35f-00155c40f11c;Pike",
"2c8540ee-85df-4f1a-b35f-00155c02e581;Clark"
]
}
I have to get all the documents where the lists has Pike in it.
Though a full search query works with Any I could't get the contains work.
$filter=lists/any(t: t eq '2c8540ee-85df-4f1a-b35f-00155c40f11c;Pike')
However i am not sure how to search only with Pike.
$filter=lists/any(t: t eq 'Pike')
I guess the eq looks for a full text search, is there any way with the given data structure I should make this query work.
Currently the field lists has no searchable property only the filterable property.
The eq operator looks for exact, case-sensitive matches. That's why it doesn't match 'Pike'. You need to structure your index such that terms like 'Pike' can be easily found. You can accomplish this in one of two ways:
Separate the GUIDs from the names when you index documents. So instead of indexing "2c8540ee-85df-4f1a-b35f-00155c40f11c;Pike" as a single string, you could index them as separate strings in the same array, or perhaps in two different collection fields (one for GUIDs and one for names) if you need to correlate them by position.
If the field is searchable, you can use the new search.ismatch function in your filter. Assuming the field is using the standard analyzer, full-text search will word-break on the semicolons, so you should be able to search just for "Pike" and get a match. The syntax would look like this: $filter=search.ismatch('Pike', 'lists') (If looking for "Pike" is all your filter does, you can just use the search and searchFields parameters to the Search API instead of $filter.) If the "lists" field is not already searchable, you will need to either add a new field and re-index the "lists" values, or re-create your index from scratch with the new field definition.
Update
There is a new approach to solve this type of problem that's available in API versions 2019-05-06 and above. You can now use complex types to represent structured data, including in collections. For the original example, you could structure the data like this:
{
"name" : "The 100",
"lists" : [
{ "id": "2c8540ee-85df-4f1a-b35f-00124e1d3c4a", "name": "Bellamy" },
{ "id": "2c8540ee-85df-4f1a-b35f-00155c40f11c", "name": "Pike" },
{ "id": "2c8540ee-85df-4f1a-b35f-00155c02e581", "name": "Clark" }
]
}
And then directly query for the name sub-field like this:
$filter=lists/any(l: l/name eq 'Pike')
The documentation for complex types is here.
I'm locked in a trap with Elastic, trying to sort hits by the size of a sub-property (array).
I applied the following body query :
'{
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "number",
"script": "doc[\"myarray\"].values.size()",
"order": "desc"
}
}
}'
However as Elastic Array type isn't in the mapping (support out of the box) i have an error telling me that my array isn't the mapping (normal...)
Any idea ?
Thanks !
The best and recommended way of doing this is to index in the same document an additional field that should include the field size as a number, since this is known at indexing time. And then just sort on that field.
The difficulty of the apparently simple task you want to achieve is that the array in Elasticsearch is considered a flat data structure and everything is just "combined". If you also are using an analyzer on this field that potentially will split the field into terms, do you count the number of unique terms or the values that you indexed initially?
For example, let's say myarray is like ["abc 123", "abc", "123", "abc abc"]. Do you count the comma separated values (4 values in total) or the unique terms (abc and 123 so only 2 values in total)?
The correct and most efficient way of doing this is by indexing the length itself in the documents:
{ "myarray":["abc 123", "abc", "123", "abc abc"], "myarray_length":4 }
if you want the array's size then you have use dynamic scripting offered from elastic search.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/modules-scripting.html (choose your version).
If you use AWS for hosting ES please do read this
https://kirankoduru.github.io/elasticsearch/moving-from-aws-elasticsearch-service.html
And andrei-stefan is correct you can make use of multi field as type when mapping
I created a JSON index in cloudant on _id like so:
{
"index": {
"fields": [ "_id"]
},
"ddoc": "mydesigndoc",
"type": "json",
"name": "myindex"
}
First off, unless I specified the index name, somehow cloudant could not differentiate between the index I created and the default text based index for _id (if that is truly the case, then this is a bug I believe)
I ran the following query against the _find endpoint of my db:
{
"selector": {
"_id": {
"$nin":["v1","v2"]
}
},
"fields":["_id", "field1", "field2"],
"use_index": "mydesigndoc/myindex"
}
The result was this error:
{"error":"no_usable_index","reason":"There is no index available for this selector."}
if I change "$nin":["v1","v2"] to "$eq":"v1" then it works fine, but that is not the query I am after.
So in order to get what I want, I had to this to my selector "_id": {"$gt":null}, which now looks like:
{
"selector": {
"_id": {
"$nin":["v1","v2"],
"$gt":null
}
},
"fields":["_id", "field1", "field2"],
"use_index": "mydesigndoc/myindex"
}
Why is this behavior? This seems to be only happening if I use the _id field in the selector.
What are the ramifications of adding "_id": {"$gt":null} to my selector? Is this going to scan the entire table rather than use the index?
I would appreciate any help, thank you
Cloudant Query can use Cloudant's pre-existing primary index for selection and range querying without you having to create your own index in the _id field.
Unfortunately, the index doesn't really help when using the $nin operator - Cloudant would have to scan the entire database to check for documents which are not in your list - the index doesn't really get it any further forward.
By changing the operator to $eq you are playing to the strengths of the index which can be used to locate the record you need quickly and efficiently.
In short, the query you are attempting is inefficient. If your query was more complex e.g. the equivalent of WHERE colour='red' AND _id NOT IN ['a','b'] then a Cloudant index on colour could be used to reduce the data set to a reasonable level before doing the $nin operation on the remaining data.
I want to be able to match a whole phrase using a full text index, but I can't seem to work out how to do it. The Lucene Query Parser syntax states that:
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
But when I specify the following selector, it returns all records with either "sign" or "design" in the name but I would expect it to return only those with "sign design".
POST https://foo.cloudant.com/remote/_find
{"selector":{"$text":"\"SIGN DESIGN\""}}
My index is defined as follows:
db.index({
name: 'subbies_text',
type: 'text',
index: {},
})
Alternatively, is it possible to do a substring match on a field in json index?
You are using the index API to create the index, correct?
Would you please try creating this design document?
{ "_id": '_design/library',
"indexes": {
"subbies_text": {
"analyzer": {
"name":'standard'
},
"index": "function(doc) { index('XXX', doc.YYY); }"
}
}
}
(However, change the "XXX" and "YYY" to your field name.
If you know how many maximum words to allow, you can make a searchable index with a map-reduce view. I think it is not ideal, but just for posterity:
You can emit() every consecutive pair of words that you see. So, for example, given the phrase "The quick brown fox" then you can emit ["the","quick"], ["quick","brown"], ["brown", "fox"]. I think this can be nice and simple, but it's really only appropriate for small amounts of data. The index will likely grow too large.
If you want to use cloudant search, you should create a search index first just like JasonSmith said. Then you can use this search index to do the specific queries.
Suppose you have a document which has a "name:SIGNDESIN" field.
1.If you want to query a whole phrase ,you can query like this:
curl https://<username:password>#<username>.cloudant.com/db/_design/<design_doc>/_search/<searchname>?q=name:SIGNDESIN | jq .
2.If you want to query a substring phrase, you can query like this:
curl https://<username:password>#<username>.cloudant.com/db/_design/<design_doc>/_search/<searchname>?q=name:SI* | jq .
I have a non trivial SOLR query, which already involves a filter query and facet calculations over multiple fields. One of the facet fields is a a multi value integer field, that is used to store categories. There are many possible categories and new ones are created dynamically, so using multiple fields is not an option.
What I want to do, is to restrict facet calculation over this field to a certain set of integers (= categories). So for example I want to calculate facets of this field, but only taking categories 3,7,9 and 15 into account. All other values in that field should be ignored.
How do I do that? Is there some build in functionality which can be used to solve this? Or do I have to write a custom search component?
The parameter can be defined for each field specified by the facet.field parameter – you can do it, by adding a parameter like this: facet.field_name.prefix.
I don't know about any way to define the facet base that should be different from the result, but one can use the facet.query to explicitly define each facet filter, e.g.:
facet.query={!key=3}category:3&facet.query={!key=7}category:7&facet.query={!key=9}category:9&facet.query={!key=15}category:15
Given the solr schema/data from this gist, the results will have something like this:
"facet_counts": {
"facet_queries": {
"3": 1,
"7": 1,
"9": 0,
"15": 0
},
"facet_fields": {
"category": [
"2",
2,
"1",
1,
"3",
1,
"7",
1,
"8",
1
]
},
"facet_dates": {},
"facet_ranges": {}
}
Thus giving the needed facet result.
I have some doubts about performance here(especially when there will be more than 4 categories and if the initial query is returning a lot of results), so it is better to do some benchmarking, before using this in production.
Not exactly the answer to my own question, but the solution we are using now: The numbers I want to filter on, build distinct groups. So we can prefix the id with a group id like this:
1.3
1.8
1.9
2.4
2.5
2.11
...
Having the data like this in SOLR, we can use facted prefixes to facet only over a single group: http://wiki.apache.org/solr/SimpleFacetParameters#facet.prefix