Lack of rich querying functionality in NoSQL? - database

Every time i contemplate using NoSQL for a solution i always get hung up on the lack of rich querying functionality. I think it very well be my lack of understanding of NoSQL. It also might be due to the fact of i'm comfortable very comfortable with SQL. From my understanding NoSQL really lends itself well for simple schema scenarios (so its probably not going to work well for a relational database where you have 50+ tables). Even for trivial scenarios i always seem to want rich query functionality. Lets take a recipe database as a trivial example.
While the scheme, is no doubt, trivial you would definitely want rich querying ability. You would probably want to search by the following (and more):
Title
Tag
Category
id
Likes
User who created recipe
create date
rating
dietary restrictions
You would also want to combine these criteria into any combination you wanted to. While i know most NoSQL solutions have secondary indexes doesn't this type of querying ability severely limit how many solutions NoSQL is relevant for? I usually need this rich querying ability. Another good example would be a bug tracking application.
I don't think you want to kick off a map reduce job every time wants to search the database (i think this would be analogous to doing table scans most of the time in a traditional relational model). So i would assume there would be a lot of queries where you would have to loop through each entity and look for the criteria you wanted to search for (which would probably be slow). I understand you can run nightly map reduce jobs to either analyze the data or to maybe normalize it into a typical relational database structure for reports.
Now i can see it being useful for scenarios where you would most likely always have to read all the data anyways. Think of a web server log or maybe an IoT type of app where your collecting massive amounts of data (like censor collection) and doing nightly analysis.
So is understanding of NoSQL off or is there a limit to the # of scenarios that i works well with?

I think the issue you are encountering is that you are approaching noSQL with the same mindset of design that you would with SQL. You mentioned "rich querying" several times. To me, that points towards design flaws (using only reference ids/trying to define relationships). A significant concept in noSQL is that data can be repeated (and often should be). Your recipe example is actually a great use cases for noSQL. Here's how I would approach it using 3 of the models you mention (for simplicity sake):
Recipe = {
_id: a001,
name: "Burger",
ingredients: [
{
_id: b001,
name: "Beef"
},
{
_id: b002,
name: "Cheese"
}
],
createdBy: {
_id: c001,
firstName: "John",
lastName: "Doe"
}
}
Person = {
_id: c001,
firstName: "John",
lastName: "Doe",
email: "jd#email.com",
preferences: {
emailNotifactions: true
}
}
Ingredient = {
_id: b001,
name: "Beef",
brand: "Agri-co",
shelfLife: "3 days",
calories: 300
};
The reason I designed it this way is expressly for the purpose of it's existence (assuming it's something like allrecipes.com). When searching/filtering recipes, you can filter by the author, but their email preferences are irrelevant. Similarly, the shelf life and brand of the ingredient are irrelevant. The schema is designed for the specific use-case, not just because your data needs to be saved. Now here are a few of your mentioned queries (mongo):
db.recipes.find({name: "Burger"});
db.recipes.find({ingredients: { $nin: ["Cheese", "Milk"]}}) // dietary restrictions
Your rich querying concerns have now been reduced to single queries in a single collection.
The downside of this design is slower write speed. You need more logic on the backend, with the potential for more programmer error. The write speed is also slower than SQL due to accessing the various models to grab relevant information. That being said, how often is it viewed vs. how often is it written/edited? (this was my comment on reading trumping writing) The other major downside is the necessity of foresight. The relationship between an ingredient and a recipe doesn't change forms. But the information your application requires might. Editing a noSQL model tends to be more difficult than editing a SQL table.
Here's one other contrived example using the same models to emphasize my point about purposeful design. Assume your new site is on famous chefs instead of a recipe database:
Person = {
_id: c001,
firstName: "Paula",
lastName: "Deen",
recipeCount: 15,
commonIngredients: [
{
_id: b001,
name: "Butter",
count: 15
},
{
_id: b002,
name: "Salted Butter",
count: 15
}
],
favoriteRecipes: [
{
_id: a001,
name: "Fried Butter",
calories: "3000"
}
]
};
Recipe = {
_id: a001,
name: "Fried Butter",
ingredients: [
{
_id: b001,
name: "Butter"
}
],
directions: "Fry butter. Eat.",
calories: "3000",
rating: 99,
createdBy: {
_id: c001,
firstName: "Paula",
lastName: "Deen"
}
};
Ingredient = {
_id: b001,
name: "Butter",
brand: "Butterfields",
shelfLife: "1 month"
};
Both of these designs use the same information, but they are modeled for the specific reason you bothered gathering the information. Now, you have the requisite information for a chef list page and typical sorting/filtering. You can navigate from there to a recipe page and have that info available.
Design for the use case, not to model relationships.

Related

Persistence of the 1:m relation which don't have individual entity but dependent on root aggregate/entity + DDD

Let's say we have a cafe enity/AR and it has regular opening hours and some explicit opening and closing hours as attriubutes (VO).
Each cafe can have multiple sets of opening and closing hours including regular hours.
let Cafe = {
guid: "askdlaksldklaskl12l3k1l2klaskd",
name: "My Cafe",
location: "New York",
coordinate: ['74.2344","77,12332"],
star_rating: "4",
images: [
"someurl", "someother url"
],
is247: false,
opening_hours: {
regular_hours: [
{
weekday: 1,
start: "10: 00 am",
closing: "6: 00 pm"
},
{
weekday: 2,
start: "10: 00 am",
closing: "6: 00 pm"
},
{
weekday: 3,
start: "10: 00 am",
closing: "6: 00 pm"
}
],
explicit_closing: [{
start: "2022/07/12 3:00",
end: "2022/07/15: 6:00"
},
{
start: "2022/08/10 3:00",
end: "2022/08/13: 6:00"
}],
explicit_opening: [{
start: "2022/09/12 3:00",
end: "2022/09/15: 6:00"
}]
}
}
Here cafe can have guid and is agregate root,and opening/closing/regular hours don't need such guid it can have some db unique id though.
I have made opening_hours as value object in domain layer, if we are storing it to persistence.
The persistence storage logic depends upon what kind of database we are using,
If we were to use mongodb(nosql document db) we can directly map to the station entity and store as it is, for mysql do we need to create different tables for opening hours or other VO which has collection with 1:n mapping.
For mysql (RDBMS) do we need to normalise database and make different tables for relational VO or keep one table per entity. (As it makes more sense to keep normalised data in RDBMS else its use are same as nosql)
Let's say for querying we need to fetch all the cafe's which are open with radius 10km range based on coordinate value.
Though the decision of db selection and its logic should be independent of the domain layer in DDD, however I need some right direction to approach these decision making.
Part of being a value object is that any copy of it can be substituted for any other. So in that sense, it's OK to have an opening_hours table and have entities refer to that object via a foreign key. If doing this, you'll need to make sure that you never do an UPDATE on the opening_hours table (just INSERTs and maybe DELETEs) and enforce a uniqueness constraint on (start, closing).
The UPDATE prohibition arises because
value objects are immutable
updating a value object which is incorporated into multiple instances of an aggregate (which is implied by normalizing: there's no way to ensure that a given value object is in only one instance of an aggregate that doesn't imply at least one of breaking an aggregate boundary or denormalizing) implies breaking an aggregate boundary by updating another instance of an aggregate as part of updating another instance
For a similar reason, an ON DELETE/UPDATE CASCADE would also be frowned upon in DDD.
There's an uneasy relationship in practice between DDD and relational DB normalization: the various (even arguably 1NF) normal forms tend to entail breaking aggregate boundaries or encoding domain logic in the infrastructure layer (e.g. uniqueness constraints). The latter doesn't necessarily cause a problem, but encoding that logic in the infrastructure means that you're either duplicating that logic in the domain layer/ring/whatever of your system (duplication of logic is at least questionable and opens the door for "my domain logic's saying let's go, but my infrastructure's saying no" situations) or your domain layer/ring/whatever not capturing some important part of the domain rules. Accordingly, there's a tendency among at least some DDD practitioners to not go that high into the *NFs, even though one might point out that doing so might be paying the price of supporting higher NFs without getting the benefit (in which case those practitioners might be better off using NoSQL/key-value stores, perhaps ones with stronger consistency guarantees than the common ones, but still NoSQL).

Data Dictionary for NoSQL Database

Let say, I'm using Firebase Cloud Firestore, which is a NoSQL database. Example of database content is as follows:
- customers
- customer_01
- name: "abc"
- email: "abc#gmail.com"
- customer_02
- name: "bcd"
- email: "bcd#gmail.com"
- items
- item_01
- name: "book"
- price: 22.50
- item_02
- name: "pencil"
- price: 1.50
How do I create data dictionary for a NoSQL database?
I would say that Data Dictionaries are not something that makes a lot of sense for NoSQL databases. This is because due to the flexibility that is offered, the number of fields may vary from document to document, which makes it difficult to make it "predictable" enough to store it an dictionary.
If you really need it for project or client requirements and you have your data in a predictable enough format you could keep it simple and just do a dictionary using name, type and description of fields, since you can't really limit the size of fields directly unless it's in your code, not the Firestore. You could also describe the lack of more information of the dictionary with an explanation similar to this one.
So, for the data you share, I would do the following:

How to print the count of array elements along with another variable in MongoDB

I have a data collection which contains records in the following format.
{
"_id": 22,
"title": "Hibernate in Action",
"isbn": "193239415X",
"pageCount": 400,
"publishedDate": ISODate("2004-08-01T07:00:00Z"),
"thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/bauer.jpg",
"shortDescription": "\"2005 Best Java Book!\" -- Java Developer's Journal",
"longDescription": "Hibernate practically exploded on the Java scene. Why is this open-source tool so popular Because it automates a tedious task: persisting your Java objects to a relational database. The inevitable mismatch between your object-oriented code and the relational database requires you to write code that maps one to the other. This code is often complex, tedious and costly to develop. Hibernate does the mapping for you. Not only that, Hibernate makes it easy. Positioned as a layer between your application and your database, Hibernate takes care of loading and saving of objects. Hibernate applications are cheaper, more portable, and more resilient to change. And they perform better than anything you are likely to develop yourself. Hibernate in Action carefully explains the concepts you need, then gets you going. It builds on a single example to show you how to use Hibernate in practice, how to deal with concurrency and transactions, how to efficiently retrieve objects and use caching. The authors created Hibernate and they field questions from the Hibernate community every day - they know how to make Hibernate sing. Knowledge and insight seep out of every pore of this book.",
"status": "PUBLISH",
"authors": ["Christian Bauer", "Gavin King"],
"categories": ["Java"]
}
I want to print title, and authors count where the number of authors is greater than 4.
I used the following command to extract records which has more than 4 authors.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1});
But unable to print the number of authors along with the title. I tried to use the following command too. But it gave only the title list.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1,'this.authors.length':1});
Could you please help me to print the number of authors here along with the title?
You can use aggregation framework's $project with $size to reshape your data and then $match to apply filtering condition:
db.collection.aggregate([
{
$project: {
title: 1,
authorsCount: { $size: "$authors" }
}
},
{
$match: {
authorsCount: { $gt: 4 }
}
}
])
Mongo Playground

How do i properly design a data model for Recipe / Ingredient using MongoDB

Recently i have designed a database model or ERD using Hackalode.
So the problem I'm currently facing is that base on my current design, i can't query it correctly as I wanted. I studied ERD with MYSQL and do know that Mongo doesn't work the same
The idea was simple, I want a recipe that has a array list of ingredients, and the ingredients are from separate collection.
The recipe also consist of measurement of the ingredient ie. (1 tbps sugar)
Can also query from list of ingredients and find the recipe that contains the ingredients
I wanted this collections to be in Many to Many relationship and the recipe can use the ingredients that are already in the database.
I just don't know how to query the data
I have tried a lot of ways by using $elemMatch and populate and all i get is empty array list as a result.
Im expecting two types of query where i can query by name of ingredients or by the recipe
My expectation result would be like this
[{
id: ...,
name: ....,
description: ...,
macros: [...],
ingredients: [
{
id,
amount: ....,
unit: ....
ingredient: {
id: ....,
name: ....
}
}
}, { ... }]
But instead of getting
[]
Imho, your design is utterly wrong. You over normalized your data. I would do something much simpler and use embedding. The reasoning behind that is that you define your use cases first and then you model your data to answer the question arising from your use cases in the most efficient way.
Assumed use cases
As a user, I want a list of all recipes.
As a user, I want a list of all recipes by ingredient.
As a designer, I want to be able to show a list of all ingredients.
As a user, I want to be able to link to recipes for compound ingredients, should it be present on the site.
Surely, this is just a small excerpt, but it is sufficient for this example.
How to answer the questions
Ok, the first one is extremely simple:
db.recipes.find()[.limit()[.skip()]]
Now, how could we find by ingredient? Simple answer: do a text index on ingredient names (and probably some other fields, as you can only have one text index per collection. Then, the query is equally simple:
db.recipes.find({$text:{$search:"ingredient name"}})
"Hey, wait a moment! How do I get a list of all ingredients?" Let us assume we want a simple list of ingredients, with a number on how often they are actually used:
db.recipes.aggregate([
// We want all ingredients as single values
{$unwind:"$Ingredients"},
// We want the response to be "Ingredient"
{$project:{_id:0,"Ingredient":"$Ingredients.Name"}
// We count the occurrence of each ingredient
// in the recipes
{$group:{_id:"$Ingredient",count:{$sum:1}}}
])
This would actually be sufficient, unless you have a database of gazillions of recipes. In that case, you might want to have a deep look into incremental map/reduce instead of an aggregation. Hint: You should add a timestamp to the recipes to be able to use incremental map/reduce.
If you have a couple of hundred K to a couple of million recipes, you can also add an $out stage to preaggregate your data.
On measurements
Imho, it makes no sense to have defined measurements. There are teaspoons, tablespoons, metric and imperial measurements, groupings like "dozen" or specifications like "clove". Which you really do not want to convert to each other or even set to a limited number of measurements. How many ounces is a clove of garlic? ;)
Bottom line: Make it a free text field, maybe with some autocomplete suggestions.
Revised data model
Recipe
{
_id: new ObjectId(),
Name: "Surf & Turf Kebap",
Ingredients: [
{
Name: "Flunk Steak",
Measurement: "200 g"
},
{
Name: "Prawns",
Measurement: "300g",
Note: "Fresh ones!"
},
{
Name: "Garlic Oil",
Measurement: "1 Tablespoon",
Link: "/recipes/5c2cc4acd98df737db7c5401"
}
]
}
And the example of the text index:
db.recipes.createIndex({Name:"text","Ingredients.Name":"text"})
The theory behind it
A recipe is you basic data structure, as your application is supposed to store and provide them, potentially based on certain criteria. Ingredients and measurements (to the extend where it makes sense) can easily be derived from the recipes. So why bother to store ingredients and measurements independently. It only makes your data model unnecessarily complicated, while not providing any advantage.
hth

indexing large array in mongoDB

according to mongoDB documentation, it's not recommended to create multikey index for large arrays, so what is the alternative option for that?
I want to notify my app users whenever one of their contacts also start using the app, so I have to upload and manage the contacts list of each user.
we are using mongoDB with replica set of master with two secondaries machines.
does mongo can handle multikey indexing for array with hundreds of values?
hundreds of contacts for hundreds thousands of users can be very hard to mange.
the multikey solution looks like that:
{
customerId: "id1",
contacts: ["aaa", "aab", "aac", .... "zzz"]
}
index: createIndex({ contacts: 1 }).
another solution is to save each contacts in it's own document and save all the app users that related to him:
{
phone: "aaa",
contacts: ["id1", "id2", "id3"]
},
{
phone: "aab",
contacts: ["id1"]
},
{
phone: "aac",
contacts: ["id1"]
},
......
{
phone: "zzz",
contacts: ["id1"]
}
index: createIndex( { phone: 1 } )
both have poor performance on writing when uploading the contacts list:
the first on calculate huge index, and the second for updating lots of documents concurrent.
Is there a better way to do it?
I'm using a replica set with two secondaries machines, does shard key could help?
Thanks
To index a field that holds an array value, MongoDB creates an index key for each element in the array. These multikey indexes support efficient queries
against array fields.
So if i were you, my data model would be like this :
{
customerId: "id1",
contacts: ["_idx", "_idy", "_idw", .... "_idz"]
}
And then create your index on the contacts. MongoDB creates by default indexes on ids. So you will have to create new documents for the non app users, just try to to add a field, like "app_user" : true/false.
For index performance, you could make it build in the background without any issues, and for replica sets, this is how it's done.
For the sharding, it won't help you, because you won't even be able to shard anything, since you have one primary node in your cluster. Sharding needs at least 2 sets of primary Mongo instances, so in your case, you could add a fourth server, then have two replica sets, of one primary and one secondary, then shard them, and tranform your system into 2 replicated shards.
Once this is achieved, it will obviously balance the loads between the 2 shards, eventhough a hundred documents isn't really much to deal with for MongoDB.
On the other hand if you're going to go for sharding, you will need more setup, for config servers if you're using Mongodb 3.4 or higher.

Resources