What is sequence of fields in result document - solr

Does Solr maintain sequence of fields (Dynamic fields ) in result document like in the sequence used to index the document ?
For Example:
Consider the following record being indexed
School_txt , Class_txt , Section_txt
So When I will get this document as a result , will the sequence of fields be maintained or it can be random like Class_tx , School_txt , Section_txt ?
If it can be random then how can I preserve the sequence of fields ?

Yes, the sequence of the fields are maintained (at least with 4.9.0) for each document. This is also true for multiValued field, where the values are returned in the same sequence as they are added (which is useful if you want to merge two fields into a separate value later). Here's an example where I rotated the field sequence while indexing:
{
"id": "1",
"School_txt": "School",
"Class_txt": "Class",
"Section_txt": "Section1",
"_version_": 1473987528354693000
},
{
"id": "2",
"Class_txt": "School2",
"Section_txt": "Class2",
"School_txt": "Section2",
"_version_": 1473987528356790300
},
{
"id": "3",
"Section_txt": "School3",
"School_txt": "Class3",
"Class_txt": "Section3",
"_version_": 1473987528356790300
}

Related

MongoDB filter a sub-array of Objects

The document structure has a round collection, which has an array of holes Objects embedded within it, with each hole played/scored entered.
The structure looks like this (there are more fields, but this summarises):
{
"_id": {
"$oid": "60701a691c071256e4f0d0d6"
},
"schema": {
"$numberDecimal": "1.0"
},
"playerName": "T Woods",
"comp": {
"id": {
"$oid": "607019361c071256e4f0d0d5"
},
"name": "US Open",
"tees": "Pro Tees",
"roundNo": {
"$numberInt": "1"
},
"scoringMethod": "Stableford"
},
"holes": [
{
"holeNo": {
"$numberInt": "1"
},
"holePar": {
"$numberInt": "4"
},
"holeSI": {
"$numberInt": "3"
},
"holeGross": {
"$numberInt": "4"
},
"holeStrokes": {
"$numberInt": "1"
},
"holeNett": {
"$numberInt": "3"
},
"holeGrossPoints": {
"$numberInt": "2"
},
"holeNettPoints": {
"$numberInt": "3"
}
}
]
}
In the Atlas web UI, it shows as (note there are 9 holes in this particular round of golf - limited to 3 for brevity):
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf (i.e. a birdie on par 3 or better).
Being new to MongoDB, and NoSQL constructs, I am stuck with this. Reading around the aggregation pipeline framework, I have tried to break down the stages I will need as:
Filter by the comp.id and comp.roundNo
Filter this result with any hole within the holes array of Objects
Maybe I have approached this wrong, and should filter or structure this pipeline differently?
So far, using the Atlas web UI, I can apply these filters individually as:
{
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"comp.roundNo": 2
}
And:
{ "holes.0.holeGross": 2 }
But I have 2 problems:
The second filter query, I have hard-coded the array index to get this value. I would need to search across all the sub-elements of every document that matches this comp.id && comp.roundNo
How do I combine these? I presuming this is where the aggregation comes in, as well as enumerating across the whole array (as above).
I note in particular it is the extra ".0." part of the second query that I am not seeing from various other online postings trying to do the same thing. Is my data structure incorrect? Do I need the [0]...[17] Objects for an 18-hole round of golf?
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf
if that is the goal, a simple $lte search inside the holes array like the following would do:
db.collection.find({ "holes.holeGross": { $lte: 2 } })
you simply have to not specify an array index such as 0 in the property path in order to search each element of the array.
https://mongoplayground.net/p/KhZLnj9mJe5

How to get the total word count per document in SOLR?

I would like to retrieve some summary statistics from the text documents I have indexed in Solr. In particular, the word count per document.
For example, I have the following three documents indexed:
{
"id":"1",
"text":["This is the text in document 1"]},
{
"id":"2",
"text":["some text in document 2"]},
{
"id":"3",
"text":["and document 3"]}
I would like to get the total number of words per each individual document:
"1",7,
"2",5,
"3",3,
What query can I use to get such a result?
I am new to Solr and I am aware that I can use facets to get the count of the individual words over all documents using something like:
http://localhost:8983/solr/corename/select?q=*&facet=true&facet.field=text&facet.mincount=1
But how to get the total word count per document is not clear to me.
I appreciate your help!
If you do a faceted search over id and an inner facet over text, the inner facet count will give the number of words in that document with that id. But text field type must be text_general or something equivalent (tokenized).
If you only want to count "distinct" words per document id, it is actually much easier:
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": "unique(message)"
}
}
}
}
Gives distinct word count per document. Following gives all words and all counts per document but it's up to you to sum them to get total amount (also it's an expensive call)
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": {
"type": "terms",
"field": "message",
"limit": -1
}
}
}
}
}
#MatsLindth's comment is something to consider too. Solr and you might not agree on what's a "word". Tokenizer is configurable to a point but depending on your needs it might not be very easy.

Change dynamic field title in solr select query

I am using solr to retrieve results from a cassandra table.
Table structure:
CREATE TABLE mytable (
field1 uuid,
field2 text ,
bfield blob,
custmdata_<text, text>,
PRIMARY KEY (field1)
);
Table content
INSERT INTO mytable VALUES ( 62c36092-82a1-3a00-93d1-46196ee77204,"test1", { 'custmdata_data1' : 'data1value', 'custmdata_data2' : 'data2value' });
INSERT INTO mytable VALUES ( e26690db-dd54-4b61-b002-d3c07125f359,"test2", { 'custmdata_data5' : 'data5value', 'custmdata_data1' : 'mydata1value' });
I am able to retrieve the results using solr query.
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"custmdata_data1":"data1value",
"custmdata_data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"custmdata_data5":"data5value",
"custmdata_data1" : "mydata1value"
}
]
}
}
Is there any way to specify the field name in result so that I can retrieve the dynamic fields without having the field name prefix? I need result like this:
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"data1":"data1value",
"data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"data5":"data5value",
"data1" : "mydata1value"
}
]
}
}
Update:
From datastax documentaion, I found that,
Avoid or limit the use of dynamic fields. Lucene allocates memory for
each unique field (column) name, so if you have a row with columns A,
B, C, and another row with B, D, E, Lucene allocates 5 chunks of
memory. For millions of rows, the heap is unwieldy.
So is there a better way to achieve dynamic field based filtering in Solr? What I need is to filter against custom fields that may vary for each insert.
Instead of calling your dyn field custmdata_ call it data and that should get rid of the bit you don't want.
Otherwise removing the dyn field prepended label is not supported but you can rename returned fields with fl: https://wiki.apache.org/solr/CommonQueryParameters#Field_alias
If you're worried about having too many dyn fields, try to workaround it with some C* collection type if you scenario lends itself to that.

Need help in Reading Matching Array Strings from MongoDB

I am using Mongo C APIs to implement DB Interface for a project.
MongoDB Document:
{
_id,
... // Some fields
...
Array
}
Query Used to Populate Values in the Array:
BCON_APPEND (&query, "$push", "{",
"Array", "{",
"Key", key,
"Value", (char*)value, "}",
"}");
Populated Array inside MongoDB:
"Array" : [ { "Key": "key1", "Value" : "string1" },
{ "Key": "key2", "Value" : "string2" },
{ "Key": "key3", "Value" : "string3" },
]
Query used to find the matching row in the array:
BCON_APPEND (&query, $elemMatch,
"Array", "{", "Key", key,
"}");
This query returns the complete document which contains a matching key in the array - which is fine.
Problem:
I am reading each field of the document returned by this query -- one by one.
When I encountered the Array field in the document -- my requirement is to get ONLY the matched row.
I tried to read the Array as follows:
uint32_t *document_len;
const uint8_t **document;
bson_iter_recurse (&iter, &sub_iter))
{
while (bson_iter_next (&sub_iter))
{
bson_iter_document (sub_iter,
&document_len,
document)
// Suppose my "Key" was: "key2"
// How to get the matching row: { "Key": "key2", "Value" : "string2" } as a String here ?
// Also, I want to receive ONLY matching row -- & NOT all 3 rows
}
}
I am not able to read String from this Array and also, not able to get only the matching row -- not all 3 rows.
[Note: In while() loop above, If I put this trace:
while (bson_iter_next (&sub_iter))
{
printf ("Found key \"%s\" in sub document.\n",
bson_iter_key (&sub_iter));
}
I get 3 prints:
Found key 0 in sub document
Found key 1 in sub document
Found key 2 in sub document
So, it is clear that -- I am getting all values from the array and NOT the only matching one and I cannot retrieve the actual strings from the array]
References:
Mongo C APIs https://api.mongodb.org/c/current/
libbson https://api.mongodb.org/libbson/current/bson_iter_t.html
Please help.

MongoDB embedded vs array sub document performance

Given the below competing schemas with up to 100,000 friends I’m interested in finding the most efficient for my needs.
Doc1 (Index on user_id)
{
"_id" : "…",
"user_id" : "1",
friends : {
"2" : {
"id" : "2",
"mutuals" : 3
}
"3" : {
"id" : "3",
"mutuals": "1"
}
"4" : {
"id" : "4",
"mutuals": "5"
}
}
}
Doc2 (Compound multi key index on user_id & friends.id)
{
"_id" : "…",
"user_id" : "1",
friends : [
{
"id" : "2",
"mutuals" : 3
},
{
"id" : "3",
"mutuals": "1"
},
{
"id" : "4",
"mutuals": "5"
}
]}
I can’t seem to find any information on the efficiency of the sub field retrieval. I know that mongo implements data internally as BSON, so I’m wondering whether that means a projection lookup is a binary O(log n)?
Specifically, given a user_id to find whether a friend with friend_id exists, how would the two different queries on each schema compare? (Assuming the above indexes) Note that it doesn’t really matter what’s returned, only that not null is returned if the friend exists.
Doc1col.find({user_id : "…"}, {"friends.friend_id"})
Doc2col.find({user_id : "…", "friends.id" : "friend_id"}, {"_id":1})
Also of interest is how the $set modifier works. For schema 1,given the query Doc1col.update({user_id : "…"}, {"$set" : {"friends.friend_id.mutuals" : 5}), how does the lookup on the friends.friend_id work? Is this a O(log n) operation (where n is the number of friends)?
For schema 2, how would the query Doc2col.update({user_id : "…", "friends.id" : "friend_id"}, {"$set": {"friends.$.mutuals" : 5}) compare to that of the above?
doc1 is preferable if one's primary requirements is to present data to the ui in a nice manageable package. its simple to filter only the desired data using a projection {}, {friends.2 : 1}
doc2 is your strongest match since your use case does not care about the result Note that it doesn’t really matter what’s returned and indexing will speed up the fetch.
on top of that doc2 permits the much cleaner syntax
db.doc2.findOne({user_id: 1, friends.id : 2} )
versus
db.doc1.findOne({ $and : [{ user_id: 1 }, { "friends.2" : {$exists: true} }] })
on a final note, however, one can create a sparse index on doc1 (and use $exists) but your possibility of 100,000 friends -- each friend needed a sparse index -- makes that absurd. opposed to a reasonable number of entries say demographics gender [male,female], agegroups [0-10,11-16,25-30,..] or more impt things [gin, whisky, vodka, ... ]

Resources