Managing huge mongoDB collection - database

I am working on a project with collection that grow fast. the collection holds data on email tracking and email status(respond, bounced, spam,...) i merged them together to improve query performance. i have complex queries with $lookup and aggregate. my question is there any solution to manage the collection without archiving data. the collection can grow by million a day. in certain point even sharding wont help. how the big companies solve this problem.
schema example
{
ownerId: mongoId,
threadId: 1234,
messageId: 23455,
subjectLine: hello,
emailState: 1,
created: timestamp,
update: timestamp,
count: 3,
opens: [
{
userAgent: firefox,
ip: 1.2.4,
created: timestamp
}
]
}

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

Database schema design for stock market financial data

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

Schema for chat application using mongodb

I am trying to build a schema for chat application in mongodb. I have 2 types of user models - Producer and Consumer. Producer and Consumer can have conversations with each other. My ultimate goal is to fetch all the conversations for any producer and consumer and show them in a list, just like all the messaging apps (eg Facebook) do.
Here is the schema I have come up with:
Producer: {
_id: 123,
'name': "Sam"
}
Consumer:{
_id: 456,
name: "Mark"
}
Conversation: {
_id: 321,
producerId: 123,
consumerId: 456,
lastMessageId: 1111,
lastMessageDate: 7/7/2018
}
Message: {
_id: 1111,
conversationId: 321,
body: 'Hi'
}
Now I want to fetch say all the consersations of Sam. I want to show them in a list just like Facebook does, grouping them with
each Consumer and sorting according to time.
I think I need to do following queries for this:
1) Get all Conversations where producerId is 123 sorted by lastMessageDate.
I can then show the list of all Conversations.
2) If I want to know all the messages in a conversation, I make query on Message and get all messages where conversationId is 321
Now, here for each new message, I also need to update the conversation with new messageId and date everytime. Is this the right way to proceed and is this optimal considering the number of queries involved. Is there a better way I can proceed with this? Any help would be highly appreciated.
Design:
I wouldn't say it's bad. Depending on the case you've described, it's actually pretty good. Such denormalization of last message date and ID is great, especially if you plan a view with a list of all conversations - you have the last message date in the same query. Maybe go even one step further and add last message text, if it's applicable in this view.
You can read more on pros and cons of denormalization (and schema modeling in general) on the MongoDB blog (parts 1, 2 and 3). It's not that fresh but not outdated.
Also, if such multi-document updates might scary you with some possible inconsistencies, MongoDB v4 got you covered with transactions.
Querying:
On one hand, you can involve multiple queries, and it's not bad at all (especially, when few of them are easily cachable, like the producer or consumer data). On the other hand, you can use aggregations to fetch all these things at once if needed.

indexing large array in mongoDB

according to mongoDB documentation, it's not recommended to create multikey index for large arrays, so what is the alternative option for that?
I want to notify my app users whenever one of their contacts also start using the app, so I have to upload and manage the contacts list of each user.
we are using mongoDB with replica set of master with two secondaries machines.
does mongo can handle multikey indexing for array with hundreds of values?
hundreds of contacts for hundreds thousands of users can be very hard to mange.
the multikey solution looks like that:
{
customerId: "id1",
contacts: ["aaa", "aab", "aac", .... "zzz"]
}
index: createIndex({ contacts: 1 }).
another solution is to save each contacts in it's own document and save all the app users that related to him:
{
phone: "aaa",
contacts: ["id1", "id2", "id3"]
},
{
phone: "aab",
contacts: ["id1"]
},
{
phone: "aac",
contacts: ["id1"]
},
......
{
phone: "zzz",
contacts: ["id1"]
}
index: createIndex( { phone: 1 } )
both have poor performance on writing when uploading the contacts list:
the first on calculate huge index, and the second for updating lots of documents concurrent.
Is there a better way to do it?
I'm using a replica set with two secondaries machines, does shard key could help?
Thanks
To index a field that holds an array value, MongoDB creates an index key for each element in the array. These multikey indexes support efficient queries
against array fields.
So if i were you, my data model would be like this :
{
customerId: "id1",
contacts: ["_idx", "_idy", "_idw", .... "_idz"]
}
And then create your index on the contacts. MongoDB creates by default indexes on ids. So you will have to create new documents for the non app users, just try to to add a field, like "app_user" : true/false.
For index performance, you could make it build in the background without any issues, and for replica sets, this is how it's done.
For the sharding, it won't help you, because you won't even be able to shard anything, since you have one primary node in your cluster. Sharding needs at least 2 sets of primary Mongo instances, so in your case, you could add a fourth server, then have two replica sets, of one primary and one secondary, then shard them, and tranform your system into 2 replicated shards.
Once this is achieved, it will obviously balance the loads between the 2 shards, eventhough a hundred documents isn't really much to deal with for MongoDB.
On the other hand if you're going to go for sharding, you will need more setup, for config servers if you're using Mongodb 3.4 or higher.

Lack of rich querying functionality in NoSQL?

Every time i contemplate using NoSQL for a solution i always get hung up on the lack of rich querying functionality. I think it very well be my lack of understanding of NoSQL. It also might be due to the fact of i'm comfortable very comfortable with SQL. From my understanding NoSQL really lends itself well for simple schema scenarios (so its probably not going to work well for a relational database where you have 50+ tables). Even for trivial scenarios i always seem to want rich query functionality. Lets take a recipe database as a trivial example.
While the scheme, is no doubt, trivial you would definitely want rich querying ability. You would probably want to search by the following (and more):
Title
Tag
Category
id
Likes
User who created recipe
create date
rating
dietary restrictions
You would also want to combine these criteria into any combination you wanted to. While i know most NoSQL solutions have secondary indexes doesn't this type of querying ability severely limit how many solutions NoSQL is relevant for? I usually need this rich querying ability. Another good example would be a bug tracking application.
I don't think you want to kick off a map reduce job every time wants to search the database (i think this would be analogous to doing table scans most of the time in a traditional relational model). So i would assume there would be a lot of queries where you would have to loop through each entity and look for the criteria you wanted to search for (which would probably be slow). I understand you can run nightly map reduce jobs to either analyze the data or to maybe normalize it into a typical relational database structure for reports.
Now i can see it being useful for scenarios where you would most likely always have to read all the data anyways. Think of a web server log or maybe an IoT type of app where your collecting massive amounts of data (like censor collection) and doing nightly analysis.
So is understanding of NoSQL off or is there a limit to the # of scenarios that i works well with?
I think the issue you are encountering is that you are approaching noSQL with the same mindset of design that you would with SQL. You mentioned "rich querying" several times. To me, that points towards design flaws (using only reference ids/trying to define relationships). A significant concept in noSQL is that data can be repeated (and often should be). Your recipe example is actually a great use cases for noSQL. Here's how I would approach it using 3 of the models you mention (for simplicity sake):
Recipe = {
_id: a001,
name: "Burger",
ingredients: [
{
_id: b001,
name: "Beef"
},
{
_id: b002,
name: "Cheese"
}
],
createdBy: {
_id: c001,
firstName: "John",
lastName: "Doe"
}
}
Person = {
_id: c001,
firstName: "John",
lastName: "Doe",
email: "jd#email.com",
preferences: {
emailNotifactions: true
}
}
Ingredient = {
_id: b001,
name: "Beef",
brand: "Agri-co",
shelfLife: "3 days",
calories: 300
};
The reason I designed it this way is expressly for the purpose of it's existence (assuming it's something like allrecipes.com). When searching/filtering recipes, you can filter by the author, but their email preferences are irrelevant. Similarly, the shelf life and brand of the ingredient are irrelevant. The schema is designed for the specific use-case, not just because your data needs to be saved. Now here are a few of your mentioned queries (mongo):
db.recipes.find({name: "Burger"});
db.recipes.find({ingredients: { $nin: ["Cheese", "Milk"]}}) // dietary restrictions
Your rich querying concerns have now been reduced to single queries in a single collection.
The downside of this design is slower write speed. You need more logic on the backend, with the potential for more programmer error. The write speed is also slower than SQL due to accessing the various models to grab relevant information. That being said, how often is it viewed vs. how often is it written/edited? (this was my comment on reading trumping writing) The other major downside is the necessity of foresight. The relationship between an ingredient and a recipe doesn't change forms. But the information your application requires might. Editing a noSQL model tends to be more difficult than editing a SQL table.
Here's one other contrived example using the same models to emphasize my point about purposeful design. Assume your new site is on famous chefs instead of a recipe database:
Person = {
_id: c001,
firstName: "Paula",
lastName: "Deen",
recipeCount: 15,
commonIngredients: [
{
_id: b001,
name: "Butter",
count: 15
},
{
_id: b002,
name: "Salted Butter",
count: 15
}
],
favoriteRecipes: [
{
_id: a001,
name: "Fried Butter",
calories: "3000"
}
]
};
Recipe = {
_id: a001,
name: "Fried Butter",
ingredients: [
{
_id: b001,
name: "Butter"
}
],
directions: "Fry butter. Eat.",
calories: "3000",
rating: 99,
createdBy: {
_id: c001,
firstName: "Paula",
lastName: "Deen"
}
};
Ingredient = {
_id: b001,
name: "Butter",
brand: "Butterfields",
shelfLife: "1 month"
};
Both of these designs use the same information, but they are modeled for the specific reason you bothered gathering the information. Now, you have the requisite information for a chef list page and typical sorting/filtering. You can navigate from there to a recipe page and have that info available.
Design for the use case, not to model relationships.

Resources