indexing large array in mongoDB - arrays

according to mongoDB documentation, it's not recommended to create multikey index for large arrays, so what is the alternative option for that?
I want to notify my app users whenever one of their contacts also start using the app, so I have to upload and manage the contacts list of each user.
we are using mongoDB with replica set of master with two secondaries machines.
does mongo can handle multikey indexing for array with hundreds of values?
hundreds of contacts for hundreds thousands of users can be very hard to mange.
the multikey solution looks like that:
{
customerId: "id1",
contacts: ["aaa", "aab", "aac", .... "zzz"]
}
index: createIndex({ contacts: 1 }).
another solution is to save each contacts in it's own document and save all the app users that related to him:
{
phone: "aaa",
contacts: ["id1", "id2", "id3"]
},
{
phone: "aab",
contacts: ["id1"]
},
{
phone: "aac",
contacts: ["id1"]
},
......
{
phone: "zzz",
contacts: ["id1"]
}
index: createIndex( { phone: 1 } )
both have poor performance on writing when uploading the contacts list:
the first on calculate huge index, and the second for updating lots of documents concurrent.
Is there a better way to do it?
I'm using a replica set with two secondaries machines, does shard key could help?
Thanks

To index a field that holds an array value, MongoDB creates an index key for each element in the array. These multikey indexes support efficient queries
against array fields.
So if i were you, my data model would be like this :
{
customerId: "id1",
contacts: ["_idx", "_idy", "_idw", .... "_idz"]
}
And then create your index on the contacts. MongoDB creates by default indexes on ids. So you will have to create new documents for the non app users, just try to to add a field, like "app_user" : true/false.
For index performance, you could make it build in the background without any issues, and for replica sets, this is how it's done.
For the sharding, it won't help you, because you won't even be able to shard anything, since you have one primary node in your cluster. Sharding needs at least 2 sets of primary Mongo instances, so in your case, you could add a fourth server, then have two replica sets, of one primary and one secondary, then shard them, and tranform your system into 2 replicated shards.
Once this is achieved, it will obviously balance the loads between the 2 shards, eventhough a hundred documents isn't really much to deal with for MongoDB.
On the other hand if you're going to go for sharding, you will need more setup, for config servers if you're using Mongodb 3.4 or higher.

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

Managing huge mongoDB collection

I am working on a project with collection that grow fast. the collection holds data on email tracking and email status(respond, bounced, spam,...) i merged them together to improve query performance. i have complex queries with $lookup and aggregate. my question is there any solution to manage the collection without archiving data. the collection can grow by million a day. in certain point even sharding wont help. how the big companies solve this problem.
schema example
{
ownerId: mongoId,
threadId: 1234,
messageId: 23455,
subjectLine: hello,
emailState: 1,
created: timestamp,
update: timestamp,
count: 3,
opens: [
{
userAgent: firefox,
ip: 1.2.4,
created: timestamp
}
]
}

What is the best approach for storing two list of same type in mongoDB?

I'm wondering in terms of database design what is the best approach between storing reference id, or embedded document even if it's means that multiple document can appears more than once.
Let's say I have that kind of model for the moment :
Collection User :
{
name: String,
types : List<Type>
sharedTypes: List<Type>
}
If I use the embedded model and don't use another collection it may result in duplicate object Type. For example, user A create Type aa and user B create Type bb. When they share each other they type it will result in :
{
name: UserA,
types : [{name: aa}]
sharedTypes: [{name:bb}]
},
{
name: UserB,
types : [{name: bb}]
sharedTypes: [{name:aa}]
}
Which results in duplication, so I guess it's pretty bad design. Should I use another approach like creating collection Type and store referenceId ?
Collection Type :
{
id: String
name: String
}
Which will still result in duplication but not one whole document, I guess it's better.
{
name: UserA,
types : ["randomString1"]
sharedTypes: ["randomString2"]
},
{
name: UserA,
types : ["randomString2"]
sharedTypes: ["randomString1"]
}
And the last one approach and maybe the best is to store from the collection types like this.
Collection User :
{
id: String
name: String
}
Collection Type :
{
id: String
name: String,
createdBy: String (id of user),
sharedWith: List<String> (ids of user)
}
What is the best approach between this 3.
I'm doing query like, I got one group of user, so for each user, I want the type created and the type people shared with me.
Broadly, the decision to embed vs. use a reference ID comes down to this:
Do you need to easily preserve the referential integrity of the joined data at point in time, meaning you want to ensure that the state of the joined data is "permanently associated" with the parent data? Then embedding is a good idea. This is also a good practice in the "insert only" design paradigm. Very often other requirements like immutability, hashing/checksum, security, and archiving make the embedded approach easier to manage in the long run because version / createDate management is vastly simplified.
Do you need the fastest, most quick-hit scalability? Then embed and ensure indexes are appropriately constructed. An indexed lookup followed by the extraction of a rich shape with arbitrarily complex embedded data is a very high performance operation.
(Opposite) Do you want to ensure that updates to joined data are quickly and immediately reflected in a join with parents? Then use a reference ID and the $lookup function to bring the data together.
Does the joined data grow essentially without bound, like transactions against an account? This is likely better handled through a reference ID to a separate transaction collection and joined with $lookup.

How do i properly design a data model for Recipe / Ingredient using MongoDB

Recently i have designed a database model or ERD using Hackalode.
So the problem I'm currently facing is that base on my current design, i can't query it correctly as I wanted. I studied ERD with MYSQL and do know that Mongo doesn't work the same
The idea was simple, I want a recipe that has a array list of ingredients, and the ingredients are from separate collection.
The recipe also consist of measurement of the ingredient ie. (1 tbps sugar)
Can also query from list of ingredients and find the recipe that contains the ingredients
I wanted this collections to be in Many to Many relationship and the recipe can use the ingredients that are already in the database.
I just don't know how to query the data
I have tried a lot of ways by using $elemMatch and populate and all i get is empty array list as a result.
Im expecting two types of query where i can query by name of ingredients or by the recipe
My expectation result would be like this
[{
id: ...,
name: ....,
description: ...,
macros: [...],
ingredients: [
{
id,
amount: ....,
unit: ....
ingredient: {
id: ....,
name: ....
}
}
}, { ... }]
But instead of getting
[]
Imho, your design is utterly wrong. You over normalized your data. I would do something much simpler and use embedding. The reasoning behind that is that you define your use cases first and then you model your data to answer the question arising from your use cases in the most efficient way.
Assumed use cases
As a user, I want a list of all recipes.
As a user, I want a list of all recipes by ingredient.
As a designer, I want to be able to show a list of all ingredients.
As a user, I want to be able to link to recipes for compound ingredients, should it be present on the site.
Surely, this is just a small excerpt, but it is sufficient for this example.
How to answer the questions
Ok, the first one is extremely simple:
db.recipes.find()[.limit()[.skip()]]
Now, how could we find by ingredient? Simple answer: do a text index on ingredient names (and probably some other fields, as you can only have one text index per collection. Then, the query is equally simple:
db.recipes.find({$text:{$search:"ingredient name"}})
"Hey, wait a moment! How do I get a list of all ingredients?" Let us assume we want a simple list of ingredients, with a number on how often they are actually used:
db.recipes.aggregate([
// We want all ingredients as single values
{$unwind:"$Ingredients"},
// We want the response to be "Ingredient"
{$project:{_id:0,"Ingredient":"$Ingredients.Name"}
// We count the occurrence of each ingredient
// in the recipes
{$group:{_id:"$Ingredient",count:{$sum:1}}}
])
This would actually be sufficient, unless you have a database of gazillions of recipes. In that case, you might want to have a deep look into incremental map/reduce instead of an aggregation. Hint: You should add a timestamp to the recipes to be able to use incremental map/reduce.
If you have a couple of hundred K to a couple of million recipes, you can also add an $out stage to preaggregate your data.
On measurements
Imho, it makes no sense to have defined measurements. There are teaspoons, tablespoons, metric and imperial measurements, groupings like "dozen" or specifications like "clove". Which you really do not want to convert to each other or even set to a limited number of measurements. How many ounces is a clove of garlic? ;)
Bottom line: Make it a free text field, maybe with some autocomplete suggestions.
Revised data model
Recipe
{
_id: new ObjectId(),
Name: "Surf & Turf Kebap",
Ingredients: [
{
Name: "Flunk Steak",
Measurement: "200 g"
},
{
Name: "Prawns",
Measurement: "300g",
Note: "Fresh ones!"
},
{
Name: "Garlic Oil",
Measurement: "1 Tablespoon",
Link: "/recipes/5c2cc4acd98df737db7c5401"
}
]
}
And the example of the text index:
db.recipes.createIndex({Name:"text","Ingredients.Name":"text"})
The theory behind it
A recipe is you basic data structure, as your application is supposed to store and provide them, potentially based on certain criteria. Ingredients and measurements (to the extend where it makes sense) can easily be derived from the recipes. So why bother to store ingredients and measurements independently. It only makes your data model unnecessarily complicated, while not providing any advantage.
hth

Solr request: SQL-like JOIN, GROUP BY, SUM(), WHERE SUM()

I'm new to Solr and I have the following problem:
I have those documents:
category:contract:
{
"contract_id_s": "contract-ENG-00001",
"title_s": "contract title",
"ref_easy_s": "REFAAA",
"commitment_id_s": "ENG-00001",
},
category:commitment:
{
"commitment_id_s": "ENG-00001",
"title_s": "commitment title",
"status_s": "Validated",
"date_changed_status_s": "2015-09-30",
"date_status_initiated_s": "2015-09-27",
"date_status_confirmed_s": "2015-09-28",
"date_status_validated_s": "2015-09-30",
},
category:commitment AND sub_category_s:commitment_project:
{
"id": "ENG-00001_AAA",
"commitment_id_s": "ENG-00001",
"project_id_s": "AAA",
"project_name_s": "project name",
"project_amount_asked_s": "2000",
"project_amount_validated_s": "2100"
},
{
"id": "ENG-00001_AAA2",
"commitment_id_s": "ENG-00001",
"project_id_s": "AAA",
"project_name_s": "project name",
"project_amount_asked_s": "1000",
"project_amount_validated_s": "1200"
},
For each commitment, there could be a contract.
For each commitment, there could be some payments.
Here is what I want to do:
- by default, only select commitment that have at least :
. one sub_category_s:commitment_project with a project_amount_validated_s value.
. one contract.
- if filtered on amounts, only select in this list, commitments with the SUM of project_amount_validated_s > amount_min AND < amount_max.
I don't know what is the best practice in terms of performance?
- Requesting the ids of the commitments then requesting the details for them?
- Is there a way to JOIN the contract informations in this request?
- Or the best practice is to request each document one by one?
The problem is that I don't want to request useless data (performance, bandwidth).
There are some tools available to you in the form of:
Solr's Block Join Query Parser (which allows for simple parent/child
queries).
Solr Facets (which allow for aggregrations (e.g. sum of payments) ... with recent support for faceting on parent/child fields).
The Solr Expand Component (which recently allows parent information to be expanded from a child block join query).
However, I'm not certain you can do everything you're hoping in one query (using with these pieces). And even if you can, stitching them together doesn't even come close the the simplicity of the SELECT...JOIN...GROUP BY...HAVING SQL query you're hoping to replicate. (Unless you want to try out the Solr 6 developer snapshot with parallel SQL support)
BUT If this is your only use-case, AND Solr is not your primary datastore, I'd strongly recommend modeling your Solr data to fit your use-case.
E.g. Start simple, denormalize, and only include the fields in your datamodel needed for search:
Only one type of record: commitment
Fields
commitment_id_s
title_s
status_s
date_changed_status_s
date_status_initiated_s
date_status_confirmed_s
date_status_validated_s
total_payments_asked (numeric sum of project_amount_asked from DB)
total_payments_validated (numeric sum of project_amount_validated from DB)
project_names (multiValued list of searchable project names)
contract_names (multiValued list of searchable contract names)
Then your query just needs a filter:
total_payments_validated:[<amount_min>TO<amount_max>]
to enforce your default criteria.
Once your search has identified the commitment IDs matching the Solr query, then go back and query the source database for any additional information needed for display (project details, contract details, dates, etc...)
Ok, I've found a solution by using !join.
For instance, in PHP:
[
'q' => "{!join from=id to=service_id score=none}uri:\\$serviceUri* AND -deleted:true",
'fq' => "{!cache=false}category:monthly_volume AND type:\"$type\" AND timestamp:[$strDateStart TO $strDateEnd]",
'alt' => 'json',
'max-results' => 1000,
'sort' => 'timestamp ASC',
'statsFields' => 'stats.field=value&stats.facet=timestamp',
]
Or with URL request:
http://localhost:8983/solr/fluks-admin/select?q={!join+from=id+to=sector_id+score=none}{!join+from=uri+to=service+score=none}uri:/test-en/service-en*+AND+-deleted:true&fq={!cache=false}category:indicator+AND+timestamp:[201608+TO+201610]+AND+type:("-3"+OR+2+OR+3)+AND+-deleted:true&wt=json&indent=true&json.facet={sum_timestamp:{terms:{limit:-1, field:timestamp, facet:{sum_type:{terms:{limit:-1, field:type, facet:{sum_vol_value:"sum(vol_value)"}}}}}}}

Resources