Query specificly indexed value in multivalued field - solr

I have a multivalued field which is filled by an array of strings. Now I want to find all documents that have i. e. foo as the i. e. second (!) string in this field. Is this possible?
If it is not, what would be your recommendation to achieve this?

For Solr, you can use UpdateRequestProcessor to copy and modify the field to add position prefix. So, you'll end up with 2_91 or similar. You can use StatelessScriptURP for that.
Alternatively, you could send this information as multiple fields and have dynamic field definition to map them.
Basically, for both Solr and ES, underlying Lucene stores multivalued strings as just one long string with large token offset between last token of first value and first token of second value. So, absolute positions require some sort of hack. Runtime hacks (e.g. ElasticSearch example in the other answer) are expensive during query. Content modifying hacks (e.g. URP in this example) are expensive with additional disk space or with more complex schema.

In elasticsearch, you can achieve this using Script Filter, Here is a sample,
consider a mapping for phone_no as,
{
"index": {
"mappings": {
"type": {
"properties": {
"phone_no": {
"type": "string"
}
}
}
}
}
}
put a document (first),
POST index/type
{
"phone_no" :["91","92210"]
}
and second one too,
POST index/type
{
"phone_no" :["92210","91"]
}
so, if you want to find the second value equals 91, then here is a query,
POST index/type/_search
{
"filter" :{
"script": {
"script": "_source.phone_no[1].equals(val)",
"params": {
"val" :"91"
}
}
}
}
where , val can be user defined,
Here in the above script, no case is handled (like, if it have size >1 , which may return execption sometime, you can modify script for your need ). Thanks,
Hope this might helps!!

Related

A way to bring facet results together based on common id

I'm doing a mongodb aggregation with two facets. Each facet is a different operation performed on the same collection. Each facet's results had two fields per object; the id and the operation result. I want to combine each facet's results based on the common id.
The desired result is like this:
[
{
"id":"1",
"bind":"xxx",
"pres":"xxx"
},
{
"id":"2",
......
}
]
I would like unfound areas to be zero or not be included if that is supported.
I've started with
const combined_agg = [
{
"$facet":{
"bind":opp_bind,
"pres":opp_pres,
}
}
Where the two opp are the variables for the two operations. The above gives me:
[
{
"bind":
[
{"binding":6,"id":"xxxx"},
....
],
"pres":
[
{"presenting":4,"id":"xxxx"},
....
]
}
]
From here, I am running into trouble.
I have tried to concatenate the arrays with
{
"$project":{"result":{"$concatArrays":["$bind","$pres"]}}
}
which gives me one object with one large array. I tried to $unwind that large array so I objects are at the root but unwind only gives me the first 20 items of the array.
I tried using $group within the result array, but that gives me an id field with an array of all the ids and two other fields with arrays of their values.
{
"$group":{
"_id":"$result.id",
"fields":{
"$push":{"bind":"$result.bind","pres":"$result.pres"}
}
}
}
I don't know how to separate them out so I can recombine them. I also saw some somewhat similar problems using map but I couldn't wrap my head around it.
I was able to figure out how to do it. I used lookup with a pipeline to get the right format.
Lookup added the result to every object of the original query. Then I used project and filter to find the correct value from the second query. Then I used addFields and arrayElementAt to get the value I wanted along with another project to get only the values I needed. It wasn't very pretty though.

Return first n element from array in elasticsearch query

I have an array field in document named as IP which contains above 10000 ips as element.
for e.g.
IP:["192.168.a:A","192.168.a:B","192.168.a:C","192.168.A:b"...........]
Now i made a search query with some filter and i got the results but the problem is size of result very huge because of above field.
Now I want to fetch only N ips from array let say only 10 order doesn't matters.
So How do i do that...
update:
Apart from IP field there are others fields also and i applied filter on that field not on IP .I want whole document which satisfies filters .I just want to limit the number of elements in single IP fields.(Let me know if there is any other way apart from using script also ).
This kind of request could solve your problem :
GET ips/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"truncate_ip": {
"script": {
"source": """
String[] trunc_ip = new String[10];
for (int i = 0; i < 10; ++i) {
trunc_ip[i]= params['_source']['IP'][i];
}
return trunc_ip;
"""
}
}
}
}
You can use scriptedFields for generating a new field from existing fields in Elastic Search. Details added as comments.
GET indexName/_search
{
"_source": {
"excludes": "ips" //<======= Exclude from source the IP field (change the name based on your document)
},
"query": {
"match_all": {} // <========== Define relevant filters
},
"script_fields": {
"limited_ips": { // <========= add a new scipted field
"script": {
"source": "params['_source'].ips.stream().limit(2).collect(Collectors.toList())" // <==== Replace 2 with the number of i.ps you want in result.
}
}
}
}
Note:
If you remove _source then only the scripted field will be the part of the result.
Apart from accessing the value of the field, the rest of the syntax is Java. Change as it suits you.
Apart from non-analyzed text fields, use doc['fieldName'] to access the field with-in script. It is faster. See the below excerpt from E.S docs :
By far the fastest most efficient way to access a field value from a
script is to use the doc['field_name'] syntax, which retrieves the
field value from doc values. Doc values are a columnar field value
store, enabled by default on all fields except for analyzed text
fields
By default ES returns only 10 matching results so I am not sure what is your search query and what exactly you want to restrict
no of elements in single ip field
No of ip fields matching your search results
Please clarify above and provide your search query to help further.

How can I use a Term Query to search within this array?

I have an index where certain properties of my document can be array-valued IDs, i.e.
POST temp/_doc
{
"test": ["5T8QLHIBB_kDC9Ugho68","5T8QLHIBB_kDC9Ugho69"]
}
Now, when I want to search documents containing a certain ID, a term query like this one:
POST temp/_search
{
"query": {
"term": {
"test": "5T8QLHIBB_kDC9Ugho68"
}
}
}
gives no results. If I try it with simpler values (e.g. "foo" and "bar" instead of the IDs), it works. How can I solve this?
In this case, instructing Elasticsearch to use the keyword part of the field works:
POST temp/_search
{
"query": {
"term": {
"test.keyword": "5T8QLHIBB_kDC9Ugho68"
}
}
}
gives one hit.
I could have avoided this situation by specifying the mapping upfront like this, instead of letting Elasticsearch guess it based on the document I posted; by default, it will use a text field instead of a keyword.
POST temp/_mapping
{
"properties": {
"test": {
"type": "keyword"
}
}
}
I'd add to your answer that there's no 'complexity' distinction between the string hey and 5T8QLHIBB_kDC9Ugho68 -- they get analyzed equally.
As to the array aspect which may be confusing at first -- there is no dedicated array data type in ES. You should define the mapping for the array values as strings (or keywords).
Side note: it's recommended not to intermix different types in one particular array -- but you're good to go w/ ["5T8QLHIBB_kDC9Ugho68", "5T8QLHIBB_kDC9Ugho69"]

Using CouchDB-lucene how can I index an array of objects (not values)

Hello everyone and thanks in advance for any ideas, suggestions or answers.
First, the environment: I am using CouchDB (currently developing on 1.0.2) and couchdb-lucene 0.7. Obviously, I am using couchdb-lucene ("c-l" hereafter) to provide full-text searching within couchdb.
Second, let me provide everyone with an example couchdb document:
{
"_id": "5580c781345e4c65b0e75a220232acf5",
"_rev": "2-bf2921c3173163a18dc1797d9a0c8364",
"$type": "resource",
"$versionids": [
"5580c781345e4c65b0e75a220232acf5-0",
"5580c781345e4c65b0e75a220232acf5-1"
],
"$usagerights": [
{
"group-administrators": 31
},
{
"group-users": 3
}
],
"$currentversionid": "5580c781345e4c65b0e75a220232acf5-1",
"$tags": [
"Tag1",
"Tag2"
],
"$created": "/Date(1314973405895-0500)/",
"$creator": "administrator",
"$modified": "/Date(1314973405895-0500)/",
"$modifier": "administrator",
"$checkedoutat": "/Date(1314975155766-0500)/",
"$checkedoutto": "administrator",
"$lastcommit": "/Date(1314973405895-0500)/",
"$lastcommitter": "administrator",
"$title": "Test resource"
}
Third, let me explain what I want to do. I am trying to figure out how to index the '$usagerights' property. I am using the word index very loosely because I really do not care about being able to search it, I simply want to 'store' it so that it is returned with the search results. Anyway, the property is an array of json objects. Now, these json objects that compose the array will always have a single json property.
Based on my understanding of couchdb-lucene, I need to reduce this array to a comma separated string. I would expect something like "group-administrators:31,group-users:3" to be a final output.
Thus, my question is essentially: How can I reduce the $usagerights json array above to a comma separated string of key:value pairs within the couchdb design document as used by couchdb-lucene?
A previous question I posted regarding indexing of tagging in a similar situation, provided for reference: How-to index arrays (tags) in CouchDB using couchdb-lucene
Finally, if you need any additional details, please just post a comment and I will provide it.
Maybe I am missing something, but the only difference I see from your previous question, is that you should iterate on the objects. Then the code should be:
function(doc) {
var result = new Document(), usage, right;
for(var i in doc.$usagerights) {
usage = doc.$usagerights[i];
for(right in usage) {
result.add(right + ":" + usage[right]);
}
}
return result;
}
There's no requirement to convert to a comma-separated list of values (I'd be intrigued to know where you picked up that idea).
If you simply want the $usagerights item returned with your results, do this;
ret.add(JSON.stringify(doc.$usagerights),
{"index":"no", "store":"yes", "field":"usagerights"});
Lucene stores strings, not JSON, so you'll need to JSON.parse the string on query.

mongodb - retrieve array subset

what seemed a simple task, came to be a challenge for me.
I have the following mongodb structure:
{
(...)
"services": {
"TCP80": {
"data": [{
"status": 1,
"delay": 3.87,
"ts": 1308056460
},{
"status": 1,
"delay": 2.83,
"ts": 1308058080
},{
"status": 1,
"delay": 5.77,
"ts": 1308060720
}]
}
}}
Now, the following query returns whole document:
{ 'services.TCP80.data.ts':{$gt:1308067020} }
I wonder - is it possible for me to receive only those "data" array entries matching $gt criteria (kind of shrinked doc)?
I was considering MapReduce, but could not locate even a single example on how to pass external arguments (timestamp) to Map() function. (This feature was added in 1.1.4 https://jira.mongodb.org/browse/SERVER-401)
Also, there's always an alternative to write storedJs function, but since we speak of large quantities of data, db-locks can't be tolerated here.
Most likely I'll have to redesign the structure to something 1-level deep, like:
{
status:1,delay:3.87,ts:138056460,service:TCP80
},{
status:1,delay:2.83,ts:1308058080,service:TCP80
},{
status:1,delay:5.77,ts:1308060720,service:TCP80
}
but DB will grow dramatically, since "service" is only one of many options which will append each document.
please advice!
thanks in advance
In version 2.1 with the aggregation framework you are now able to do this:
1: db.test.aggregate(
2: {$match : {}},
3: {$unwind: "$services.TCP80.data"},
4: {$match: {"services.TCP80.data.ts": {$gte: 1308060720}}}
5: );
You can use a custom criteria in line 2 to filter the parent documents. If you don't want to filter them, just leave line 2 out.
This is not currently supported. By default you will always receive the whole document/array unless you use field restrictions or the $slice operator. Currently these tools do not allow filtering the array elements based on the search criteria.
You should watch this request for a way to do this: https://jira.mongodb.org/browse/SERVER-828
I'm attempting to do something similar. I tried your suggestion of using the GROUP function, but I couldn't keep the embedded documents separate or was doing something incorrectly.
I needed to pull/get a subset of embedded documents by ID. Here's how I did it using Map/Reduce:
db.parent.mapReduce(
function(parent_id, child_ids){
if(this._id == parent_id)
emit(this._id, {children: this.children, ids: child_ids})
},
function(key, values){
var toReturn = [];
values[0].children.forEach(function(child){
if(values[0].ids.indexOf(product._id.toString()) != -1)
toReturn.push(child);
});
return {children: toReturn};
},
{
mapparams: [
"4d93b112c68c993eae000001", //example parent id
["4d97963ec68c99528d000007", "4debbfd5c68c991bba000014"] //example embedded children ids
]
}
).find()
I've abstracted my collection name to 'parent' and it's embedded documents to 'children'. I pass in two parameters: The parent document ID and an array of the embedded document IDs that I want to retrieve from the parent. Those parameters are passed in as the third parameter to the mapReduce function.
In the map function I find the parent document in the collection (which I'm pretty sure uses the _id index) and emit its id and children to the reduce function.
In the reduce function, I take the passed in document and loop through each of the children, collecting the ones with the desired ID. Looping through all the children is not ideal, but I don't know of another way to find by ID on an embedded document.
I also assume in the reduce function that there is only one document emitted since I'm searching by ID. If you expect more than one parent_id to match, than you will have to loop through the values array in the reduce function.
I hope this helps someone out there, as I googled everywhere with no results. Hopefully we'll see a built in feature soon from MongoDB, but until then I have to use this.
Fadi, as for "keeping embedded documents separate" - group should handle this with no issues
function getServiceData(collection, criteria) {
var res=db[collection].group({
cond: criteria,
initial: {vals:[],globalVar:0},
reduce: function(doc, out) {
if (out.globalVar%2==0)
out.vals.push({doc.whatever.kind.and.depth);
out.globalVar++;
},
finalize: function(out) {
if (vals.length==0)
out.vals='sorry, no data';
return out.vals;
}
});
return res[0];
};

Resources