What performance considerations are there for storing and querying around 100 million documents consisting of around 500,000 floating point, integer and (short) string attributes each in a single collection in MongoDB? What kind of hardware would be required? Is there a better solution, other than MongoDB for this?
I expect that most queries will involve retrieving only a few hundred attributes from a few hundred documents per query. The kind of query that I want to optimize for would be like:
db.theobjects.find( { object_id: { $in: [ "12345", "4567", "45637", ..., "object_idn" ] } }, { attr1: 1, attr2: 1, ..., attrn: 1 } )
Related
I am planning to create a Collection of "users" in MongoDB consisting of documents like this:
{
"name": "Example",
"age": 100,
"data": [
{
"key1": value1,
"key2": value2,
"key3": value3
},
{
"key1": value1,
"key2": value2,
"key3": value3
},
{
"key1": value1,
"key2": value2,
"key3": value3
}
]
}
with the "data"-array containing tens of thousands of entries. All entries have the same structure with identical keys. Does MongoDB really store the Strings I use as keys for every single element in the array? As the values are small numbers, the repetetive keys would consume more storage space than the values.
I am uncertain whether to use a schemafull oder schemaless db for this. The elements in "data" are added one by one (1 every second) and in series (no random writes). If I want to do something with the "data", I always need all data-entries of a user at the same time. If I would use a schemafull database, I would have a giant table with the entries of "data" associating them with a userId. It feels a bit wrong to write thousands of entries in individual rows and always accessing them together. Is there a good way to basically do a relation-free schemaful table inside of MongoDB documents? Or a schemaful database to store thousands of entries grouped together?
Do you have ideas or suggestions?
im working on an api and need fast queries for my MongoDB. I have objects like:
{
_id: "123456",
values: [
{
"value": "A",
"begin":0,
"end":1,
},
{
"value": "B",
"begin":1,
"end":2,
},
{
"value": "C",
"begin":3,
"end":7,
}
],
"name": "test"
}
I have about 6k documents in my Database and i want to unwind and group the values so i can count how many different values i have. After the unwind operation i have about 32 million documents.
I tried some variations of the aggregation but everytime its slow. I always need over 20 seconds.
I want to try the indexes but i cant creat and index that makes it any faster. I tried it with the MongoDB Compass app but i think im doing somethink wrong while i create the index.
Hope someone can help me.
Greetings
Well. Here's the DB schema/architecture problem.
Currently in our project we use MongoDB. We have one DB with one collection. Overall there are almost 4 billions of documents in that collection (value is constant). Each document has a unique specific ID and there is a lot of different information related to this ID (that's why MongoDB was chosen - data is totally different, so schemaless is perfect).
{
"_id": ObjectID("5c619e81aeeb3aa0163acf02"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
The purpose of that collection is to store a lot of data, that is easy to update (some data is being updated every day, some is updated once a month) and to search over different fields to retrieve the ID. Also we store the "history" of each field (and we should have ability to search over history as well). So when overtime updates were turned on we faced a problem called MongoDB 16MB maximum document size.
We've tried several workarounds (like splitting document), but all of them include either $group or $lookup stage in aggregation (grouping up by id, see example below), but both can't use indexes, which makes search over several fields EXTREMELY long.
{
"_id": ObjectID("5c619e81aeeb3aa0163acd12"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133
}
{
"_id": ObjectID("5c619e81aeeb3aa0163acd11"),
"our_id": 1552322211,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
Also we can't use $match stage before those, because the search can include logical operators (like field_1 = 'a' && field_c != 320, where field_1 is from one document and field_c is from another, so the search must be done after grouping/joining documents together) + the logical expression can be VERY complex.
So are there any tricky workarounds? If no, what other DB's can you suggest for moving to?
Kind regards.
Okay, so after some time spent on testing different approaches, I've finally ended up with using Elasticsearch, because there is no way to perform requested searches through MongoDB in adequate amount of time.
I've got a MongoDB collection where a particular string may appear in any of a number of fields:
{"_id":1, "field1": "foo", "field2": "bar", "field3": "baz", "otherfield": "stuff"},
{"_id":2, "field1": "bar", "field2": "baz", "field3": "foo", "otherfield": "morestuff"},
{"_id":3, "field1": "baz", "field2": "foo", "field3": "bar", "otherfield": "you get the idea"}
I need to query so that I am returned all records where any one of a set of fields is equal to any value in an array ... basically, if I have ["foo","bar"] I need it to match if either of those strings are in field1 or field2 (but not any other field).
Obviously I can do this with a series of multiple queries
db.collection.find({"field1":{"$in":["foo","bar"]}})
db.collection.find({"field2":{"$in":["foo","bar"]}})
etc., and I've also made a very large $or query that concatenates them all together, but it seems far too inefficient (my actual collection needs to match any of 15 strings that can occur in any of 9 fields) ... but I'm still new to nosql DBs and am not sure of the best paradigm I need to use here. Any help is greatly appreciated.
try
db.collection.find(
// Find documents matching any of these values
{$or:[
{"field1":{"$in":["foo","bar"]}},
{"field2":{"$in":["foo","bar"]}}
]}
)
also refer to this question
Found another answer through poring over the documentation that seems to hit a sweet spot -- text indexes.
db.collection.ensureIndex({"field1":"text","field2":"text"})
db.records.runCommand("text",{search:"foo bar"})
When I run my actual query with many more strings and fields (and about 100,000 records), the $or/$in approach takes 620 milliseconds while the text index takes 131 milliseconds. The one drawback is that it returns a different type of document as a result; luckily the actual documents are a parameter of each result object.
Thanks to those who took the time to make suggestions.
I would collect all the relevant fields in one field (i.e. collected) by adding their values like
"foo:field1",
"bar:field2",
"baz:field3",
"stuff:otherfield",
"bar:field1",
"baz:field2"
...
into that field.
If you search for bar existing in any field you can use:
db.collection.find( { collected: { $regex: "^bar" } }, ... );
Your example in the question would look like:
db.collection.find( collected: { { $all: [ "foo:field1", "foo:field2", "bar:field1", "bar:field2" ] } }, ... );
I have a non trivial SOLR query, which already involves a filter query and facet calculations over multiple fields. One of the facet fields is a a multi value integer field, that is used to store categories. There are many possible categories and new ones are created dynamically, so using multiple fields is not an option.
What I want to do, is to restrict facet calculation over this field to a certain set of integers (= categories). So for example I want to calculate facets of this field, but only taking categories 3,7,9 and 15 into account. All other values in that field should be ignored.
How do I do that? Is there some build in functionality which can be used to solve this? Or do I have to write a custom search component?
The parameter can be defined for each field specified by the facet.field parameter – you can do it, by adding a parameter like this: facet.field_name.prefix.
I don't know about any way to define the facet base that should be different from the result, but one can use the facet.query to explicitly define each facet filter, e.g.:
facet.query={!key=3}category:3&facet.query={!key=7}category:7&facet.query={!key=9}category:9&facet.query={!key=15}category:15
Given the solr schema/data from this gist, the results will have something like this:
"facet_counts": {
"facet_queries": {
"3": 1,
"7": 1,
"9": 0,
"15": 0
},
"facet_fields": {
"category": [
"2",
2,
"1",
1,
"3",
1,
"7",
1,
"8",
1
]
},
"facet_dates": {},
"facet_ranges": {}
}
Thus giving the needed facet result.
I have some doubts about performance here(especially when there will be more than 4 categories and if the initial query is returning a lot of results), so it is better to do some benchmarking, before using this in production.
Not exactly the answer to my own question, but the solution we are using now: The numbers I want to filter on, build distinct groups. So we can prefix the id with a group id like this:
1.3
1.8
1.9
2.4
2.5
2.11
...
Having the data like this in SOLR, we can use facted prefixes to facet only over a single group: http://wiki.apache.org/solr/SimpleFacetParameters#facet.prefix