I have a dataset which has a boolean attribute, like:
{
name: 'Steven',
isQualified: true
}
And I want to count both sides of a partition. That is, how many documents are qualified or not. What's the best way to do this with a single rethinkdb query?
Here's an example with underscore.js, but it relies on querying all the documents and processing them in my app:
results = _.partition(data, 'isQualified').map(_.iteratee('length'))
At the moment I have this, but it feels inefficient and I'm assuming/hoping there's a better way to do it.
r.expr({
'qualified': r.table('Candidate').filter({isQualified: true}).count(),
'unqualified': r.table('Candidate').filter({isQualified: false}).count()
})
How could I improve this and make it more DRY?
Create an index on isQualified
r.table('Candidate').indexCreate("isQualified");
Then use it to count
r.expr({
'qualified': r.table('Candidate').getAll(true, {index: "isQualified"}).count()
'unqualified': r.table('Candidate').getAll(false, {index: "isQualified"}).count(),
})
This is way faster as the server doesn't have to iterate through all the documents, but just has to traverse a B-tree.
Related
I am a beginner at Mongo and I made a data base with the following topology.
Some fields of metadata and one field that contain the experiment results.
experiment results- vector of integers with ~150,000 values
status = db.DataTest.insert_one(
{
"person_num" : num,
"life_cycle" : cycle,
"other_metadata" : meta_data,
"results_of_experiment": big_array
}
)
I inserted something like 7500 of those documents
Its occupied 8GB of memory and work really slowly for find operations.
I don't need those experiment results to search by them only the option to retrieve them from the DB as chunk of data.
Is there another solution to store on the DB the experiment results?
Is using "gridfs" is relevant to this case and not too complicated?
Based on your comments, the most common query is
db.DataTest.find( { "life_cycle": { $gt: 800 } }).limit(5)
Without an index on the life_cycle field, MongoDB is forced to do a collection scan. That is, fetch & evaluate all documents in your collection one by one. In a large collection, this will take a long time.
MongoDB does not create indexes automatically. You would have to observe your most common queries, and create indexes to support those queries. As far as I know, there is no automatic index creation in any database software; SQL, NoSQL, or otherwise.
Database indexing is a deep subject and cannot be explained in a short answer.
Having said that, if you create an index on the life_cycle field, it should improve your query times but only for the query you posted above. Other query types would likely require different indexes. You can do so in the mongo shell:
db.DataTest.createIndex({life_cycle: 1})
I encourage you to read these pages to understand more about indexing in MongoDB:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/
I'm new to monogDB and trying to design the way I store my data so I can do the kinds of queries I want to. Say I have a document that looks like
{
"foo":["foo1","foo2","foo3"],
"bar":"baz"
}
Where the array "foo" is always of length 3, and the order of the items are meaningful. I would like to be able to make a query that searches for all documents where "foo2" == something. Essentially I want to treat "foo" like any old array and be able to index it in a search, so something like "foo"[1] == something.
Does monogDB support this? Would it be more correct to store my data like,
{
"foo":{
"foo1":"val1",
"foo2":"val2",
"foo3":"val3"
},
"bar":"baz"
}
instead? Thanks.
The schema you have asked about is fine.
To insert at a specific index of array:
Use the $position operator. Read here.
To query at a specific index location:
Use the syntax key.index. As in:
db.users.find({"foo.1":"foo2"})
I'm trying to write the most optimal query to find all of the documents that do not have a specific field. Is there any better way to do this than the examples I have listed below?
// Get the ids of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).pluck("id")
// Get a count of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).count()
Right now, these queries take about 300-400ms on a table with ~40k documents, which seems rather slow. Furthermore, in this specific case, the "location" attribute contains latitude/longitude and has a geospatial index.
Is there any way to accomplish this? Thanks!
A naive suggestion
You could use the hasFields method along with the not method on to filter out unwanted documents:
r.db("mydb").table("mytable")
.filter(function (row) {
return row.hasFields({ location: true }).not()
})
This might or might not be faster, but it's worth trying.
Using a secondary index
Ideally, you'd want a way to make location a secondary index and then use getAll or between since queries using indexes are always faster. A way you could work around that is making all rows in your table have a value false value for their location, if they don't have a location. Then, you would create a secondary index for location. Finally, you can then query the table using getAll as much as you want!
Adding a location property to all fields without a location
For that, you'd need to first insert location: false into all rows without a location. You could do this as follows:
r.db("mydb").table("mytable")
.filter(function (row) {
return row.hasFields({ location: true }).not()
})
.update({
location: false
})
After this, you would need to find a way to insert location: false every time you add a document without a location.
Create secondary index for the table
Now that all documents have a location field, we can create a secondary index for location.
r.db("mydb").table("mytable")
.indexCreate('location')
Keep in mind that you only have to add the { location: false } and create the index only once.
Use getAll
Now we can just use getAll to query documents using the location index.
r.db("mydb").table("mytable")
.getAll(false, { index: 'location' })
This will probably be faster than the query above.
Using a secondary index (function)
You can also create a secondary index as a function. Basically, you create a function and then query the results of that function using getAll. This is probably easier and more straight-forward than what I proposed before.
Create the index
Here it is:
r.db("mydb").table("mytable")
.indexCreate('has_location',
function(x) { return x.hasFields('location');
})
Use getAll.
Here it is:
r.db("mydb").table("mytable")
.getAll(false, { index: 'has_location' })
The following document records a conversation between Milhouse and Bart. I would like to insert a new message with the right num (the next in the example would be 3) in a unique operation. Is that possible ?
{ user_a:"Bart",
user_b:"Milhouse",
conversation:{
last_msg:2,
messages:[
{ from:"Bart",
msg:"Hello"
num:1
},
{ from:"Milhouse",
msg:"Wanna go out ?"
num:2
}
]
}
}
In MongoDB, arrays keep their order, so by adding a num attribute, you're only creating more data for something that you could accomplish without the additional field. Just use the position in the array to accomplish the same thing. Grabbing the X message in an array will provide faster searches than searching for { num: X }.
To keep the order, I don't think there's an easy way to add the num category besides does a find() on conversation.last_msg before you insert the new subdocument and increment last_msg.
Depending on what you need to keep the ordering for, you might consider including a time stamp in your subdocument, which is commonly kept in conversation records anyway and may provide other useful information.
Also, I haven't used it, but there's a Mongoose plugin that may or may not be able to do what you want: https://npmjs.org/package/mongoose-auto-increment
You can't create an auto increment field but you can use functions to generate and administrate sequence :
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
I would recommend using a timestamp rather than a numerical value. By using a timestamp, you can keep the ordering of the subdocument and make other use of it.
The documentation for the IN query operation states that those queries are implemented as a big OR'ed equality query:
qry = Article.query(Article.tags.IN(['python', 'ruby', 'php']))
is equivalent to:
qry = Article.query(ndb.OR(Article.tags == 'python',
Article.tags == 'ruby',
Article.tags == 'php'))
I am currently modelling some entities for a GAE project and plan on using these membership queries with a lot of possible values:
qry = Player.query(Player.facebook_id.IN(list_of_facebook_ids))
where list_of_facebook_ids could have thousands of items.
Will this type of query perform well with thousands of possible values in the list? If not, what would be the recommended approach for modelling this?
This won't work with thousands of values (in fact I bet it starts degrading with more than 10 values). The only alternative I can think of are some form of precomputation. You'll have to change your schema.
One way you can you do it is to create a new model called FacebookPlayer which is an index. This would be keyed by facebook_id. You would update it whenever you add a new player. It looks something like this:
class FacebookUser(ndb.Model):
player = ndb.KeyProperty(kind='Player', required=True)
Now you can avoid queries altogether. You can do this:
# Build keys from facebook ids.
facebook_id_keys = []
for facebook_id in list_of_facebook_ids:
facebook_id_keys.append(ndb.Key('FacebookPlayer', facebook_id))
keysOfUsersMatchedByFacebookId = []
for facebook_player in ndb.get_multi(facebook_id_keys):
if facebook_player:
keysOfUsersMatchedByFacebookId.append(facebook_player.player)
usersMatchedByFacebookId = ndb.get_multi(keysOfUsersMatchedByFacebookId)
If list_of_facebook_ids is thousands of items, you should do this in batches.