Data Dictionary for NoSQL Database - database

Let say, I'm using Firebase Cloud Firestore, which is a NoSQL database. Example of database content is as follows:
- customers
- customer_01
- name: "abc"
- email: "abc#gmail.com"
- customer_02
- name: "bcd"
- email: "bcd#gmail.com"
- items
- item_01
- name: "book"
- price: 22.50
- item_02
- name: "pencil"
- price: 1.50
How do I create data dictionary for a NoSQL database?

I would say that Data Dictionaries are not something that makes a lot of sense for NoSQL databases. This is because due to the flexibility that is offered, the number of fields may vary from document to document, which makes it difficult to make it "predictable" enough to store it an dictionary.
If you really need it for project or client requirements and you have your data in a predictable enough format you could keep it simple and just do a dictionary using name, type and description of fields, since you can't really limit the size of fields directly unless it's in your code, not the Firestore. You could also describe the lack of more information of the dictionary with an explanation similar to this one.
So, for the data you share, I would do the following:

Related

What is the best approach for storing two list of same type in mongoDB?

I'm wondering in terms of database design what is the best approach between storing reference id, or embedded document even if it's means that multiple document can appears more than once.
Let's say I have that kind of model for the moment :
Collection User :
{
name: String,
types : List<Type>
sharedTypes: List<Type>
}
If I use the embedded model and don't use another collection it may result in duplicate object Type. For example, user A create Type aa and user B create Type bb. When they share each other they type it will result in :
{
name: UserA,
types : [{name: aa}]
sharedTypes: [{name:bb}]
},
{
name: UserB,
types : [{name: bb}]
sharedTypes: [{name:aa}]
}
Which results in duplication, so I guess it's pretty bad design. Should I use another approach like creating collection Type and store referenceId ?
Collection Type :
{
id: String
name: String
}
Which will still result in duplication but not one whole document, I guess it's better.
{
name: UserA,
types : ["randomString1"]
sharedTypes: ["randomString2"]
},
{
name: UserA,
types : ["randomString2"]
sharedTypes: ["randomString1"]
}
And the last one approach and maybe the best is to store from the collection types like this.
Collection User :
{
id: String
name: String
}
Collection Type :
{
id: String
name: String,
createdBy: String (id of user),
sharedWith: List<String> (ids of user)
}
What is the best approach between this 3.
I'm doing query like, I got one group of user, so for each user, I want the type created and the type people shared with me.
Broadly, the decision to embed vs. use a reference ID comes down to this:
Do you need to easily preserve the referential integrity of the joined data at point in time, meaning you want to ensure that the state of the joined data is "permanently associated" with the parent data? Then embedding is a good idea. This is also a good practice in the "insert only" design paradigm. Very often other requirements like immutability, hashing/checksum, security, and archiving make the embedded approach easier to manage in the long run because version / createDate management is vastly simplified.
Do you need the fastest, most quick-hit scalability? Then embed and ensure indexes are appropriately constructed. An indexed lookup followed by the extraction of a rich shape with arbitrarily complex embedded data is a very high performance operation.
(Opposite) Do you want to ensure that updates to joined data are quickly and immediately reflected in a join with parents? Then use a reference ID and the $lookup function to bring the data together.
Does the joined data grow essentially without bound, like transactions against an account? This is likely better handled through a reference ID to a separate transaction collection and joined with $lookup.

How do i properly design a data model for Recipe / Ingredient using MongoDB

Recently i have designed a database model or ERD using Hackalode.
So the problem I'm currently facing is that base on my current design, i can't query it correctly as I wanted. I studied ERD with MYSQL and do know that Mongo doesn't work the same
The idea was simple, I want a recipe that has a array list of ingredients, and the ingredients are from separate collection.
The recipe also consist of measurement of the ingredient ie. (1 tbps sugar)
Can also query from list of ingredients and find the recipe that contains the ingredients
I wanted this collections to be in Many to Many relationship and the recipe can use the ingredients that are already in the database.
I just don't know how to query the data
I have tried a lot of ways by using $elemMatch and populate and all i get is empty array list as a result.
Im expecting two types of query where i can query by name of ingredients or by the recipe
My expectation result would be like this
[{
id: ...,
name: ....,
description: ...,
macros: [...],
ingredients: [
{
id,
amount: ....,
unit: ....
ingredient: {
id: ....,
name: ....
}
}
}, { ... }]
But instead of getting
[]
Imho, your design is utterly wrong. You over normalized your data. I would do something much simpler and use embedding. The reasoning behind that is that you define your use cases first and then you model your data to answer the question arising from your use cases in the most efficient way.
Assumed use cases
As a user, I want a list of all recipes.
As a user, I want a list of all recipes by ingredient.
As a designer, I want to be able to show a list of all ingredients.
As a user, I want to be able to link to recipes for compound ingredients, should it be present on the site.
Surely, this is just a small excerpt, but it is sufficient for this example.
How to answer the questions
Ok, the first one is extremely simple:
db.recipes.find()[.limit()[.skip()]]
Now, how could we find by ingredient? Simple answer: do a text index on ingredient names (and probably some other fields, as you can only have one text index per collection. Then, the query is equally simple:
db.recipes.find({$text:{$search:"ingredient name"}})
"Hey, wait a moment! How do I get a list of all ingredients?" Let us assume we want a simple list of ingredients, with a number on how often they are actually used:
db.recipes.aggregate([
// We want all ingredients as single values
{$unwind:"$Ingredients"},
// We want the response to be "Ingredient"
{$project:{_id:0,"Ingredient":"$Ingredients.Name"}
// We count the occurrence of each ingredient
// in the recipes
{$group:{_id:"$Ingredient",count:{$sum:1}}}
])
This would actually be sufficient, unless you have a database of gazillions of recipes. In that case, you might want to have a deep look into incremental map/reduce instead of an aggregation. Hint: You should add a timestamp to the recipes to be able to use incremental map/reduce.
If you have a couple of hundred K to a couple of million recipes, you can also add an $out stage to preaggregate your data.
On measurements
Imho, it makes no sense to have defined measurements. There are teaspoons, tablespoons, metric and imperial measurements, groupings like "dozen" or specifications like "clove". Which you really do not want to convert to each other or even set to a limited number of measurements. How many ounces is a clove of garlic? ;)
Bottom line: Make it a free text field, maybe with some autocomplete suggestions.
Revised data model
Recipe
{
_id: new ObjectId(),
Name: "Surf & Turf Kebap",
Ingredients: [
{
Name: "Flunk Steak",
Measurement: "200 g"
},
{
Name: "Prawns",
Measurement: "300g",
Note: "Fresh ones!"
},
{
Name: "Garlic Oil",
Measurement: "1 Tablespoon",
Link: "/recipes/5c2cc4acd98df737db7c5401"
}
]
}
And the example of the text index:
db.recipes.createIndex({Name:"text","Ingredients.Name":"text"})
The theory behind it
A recipe is you basic data structure, as your application is supposed to store and provide them, potentially based on certain criteria. Ingredients and measurements (to the extend where it makes sense) can easily be derived from the recipes. So why bother to store ingredients and measurements independently. It only makes your data model unnecessarily complicated, while not providing any advantage.
hth

Neo4j or MongoDB for relative attributes/relations

I wish to build a database of objects with various types of relations between them. It will be easier for me to explain by giving an example.
I wish to have a set of objects, each is described by a unique name and a set of attributes (say, height, weight, colour, etc.), but instead of values, these attributes may contain values which are relative to other objects. For example, I might have two objects, A and B, where A has height 1 and weight "weight of B + 2", and B has height "height of A + 3" and weight 4.
Some objects may have completely other attributes; for example, object C may represent a box, and objects A and B will be related to C by the relations "I appear x times in C".
Queries may include "what is the height of A/B" or what is the total weight of objects appearing in C with multiplicities.
I am a bit familiar with MongoDB, and fond of its simplicity. I heard of Neo4j, but never tried working with it. From its description, it sounds more suitable for my need (but I can't tell it is capable of the task). But is MongoDB, with its simplicity, suitable as well? Or perhaps a different database engine?
I am not sure it matters, but I plan to use python as the engine which processes the queries and their outputs.
Either can do this. I tend to prefer neo4j, but either way could work.
In neo4j, you'd create a graph consisting of a node (A) and its "base" (B). You could then connect them like this:
(A:Node { weight: "base+2" })-[:base]->(B:Node { weight: 2 })
Note that modeling in this way would make it possible to change the base relationship to point to another node without changing anything about A. The downside is that you'd need a mini calculator to expand expressions like "base+2", which is easy but in any case extra work.
Interpreting your question another way, you're in a situation where you'd probably want a trigger. Here's an article on neo4j triggers, how graphs handle this. If parsing that expression "base+2" at read time isn't what you want, and you want to actually set the value on A to be b.weight + 2, then you want a trigger. This would let you define some other function to be run when the graph gets updated in a certain way. In this case, when someone inserts a new :base relationship in the graph, you might check the base value (endpoint of the relationship) and add 2 to its weight, and set that new property value on the source of the relationship.
Yes, you can use either DBMS.
To help you decide, this is an example of how to support your uses cases in neo4j.
To create your sample data:
CREATE
(a:Foo {id: 'A'}), (b:Foo {id: 'B'}), (c:Box {id: 123}),
(h1:Height {value: 1}), (w4:Weight {value: 4}),
(a)-[:HAS_HEIGHT {factor: 1, offset: 0}]->(h1),
(a)-[:HAS_WEIGHT {factor: 1, offset: 2}]->(w4),
(b)-[:HAS_WEIGHT {factor: 1, offset: 0}]->(w4),
(b)-[:HAS_HEIGHT {factor: 1, offset: 3}]->(h1),
(c)-[:CONTAINS {count: 5}]->(a),
(c)-[:CONTAINS {count: 2}]->(b);
"A" and "B" are represented by Foo nodes, and "C" by a Box node. Since a given height or weight can be referenced by multiple nodes, this example data model uses shared Weight and Height nodes. The HAS_HEIGHT and HAS_WEIGHT relationships have factor and offset properties to allow adjustment of the height or weight for a particular Foo node.
To query "What is the height of A":
MATCH (:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height)
RETURN ra.factor * ha.value + ra.offset AS height;
To query "What is the ratio of the heights of A and B":
MATCH
(:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height),
(:Foo {id: 'B'})-[rb:HAS_HEIGHT]->(hb:Height)
RETURN
TOFLOAT(ra.factor * ha.value + ra.offset) /
(rb.factor * hb.value + rb.offset) AS ratio;
Note: TOFLOAT() is used above to make sure integer division, which would truncate, is never used.
To query "What is the total weight of objects appearing in C":
MATCH (:Box {id: 123})-[c:CONTAINS]->()-[rx:HAS_WEIGHT]->(wx:Weight)
RETURN SUM(c.count * (rx.factor * wx.value + rx.offset));
I have not used Mongo and decided not to after studying it. So filter my opinion with that in mind; users may find my comments easy to overcome. Mongo is not a true graph database. The user must create and manage the relationships. In Neo4j, relationships are "native" and robust.
There is a head to head comparison at this site:
[https://db-engines.com/en/system/MongoDB%3bNeo4j]
see also: https://stackoverflow.com/questions/10688745/database-for-efficient-large-scale-graph-traversal
There is a distinction between NoSQL (e.g., Mongo) and a true graph database. Many seem to assume that if it is not SQL then it's a graph database. This is not true. Most NoSQL data bases do not store relationships. The free book describes this in chapter 2.
Personally, I'm sold on Neo4j. It makes relationships, transerving graphs and collecting lists along the path easy and powerful.

Lack of rich querying functionality in NoSQL?

Every time i contemplate using NoSQL for a solution i always get hung up on the lack of rich querying functionality. I think it very well be my lack of understanding of NoSQL. It also might be due to the fact of i'm comfortable very comfortable with SQL. From my understanding NoSQL really lends itself well for simple schema scenarios (so its probably not going to work well for a relational database where you have 50+ tables). Even for trivial scenarios i always seem to want rich query functionality. Lets take a recipe database as a trivial example.
While the scheme, is no doubt, trivial you would definitely want rich querying ability. You would probably want to search by the following (and more):
Title
Tag
Category
id
Likes
User who created recipe
create date
rating
dietary restrictions
You would also want to combine these criteria into any combination you wanted to. While i know most NoSQL solutions have secondary indexes doesn't this type of querying ability severely limit how many solutions NoSQL is relevant for? I usually need this rich querying ability. Another good example would be a bug tracking application.
I don't think you want to kick off a map reduce job every time wants to search the database (i think this would be analogous to doing table scans most of the time in a traditional relational model). So i would assume there would be a lot of queries where you would have to loop through each entity and look for the criteria you wanted to search for (which would probably be slow). I understand you can run nightly map reduce jobs to either analyze the data or to maybe normalize it into a typical relational database structure for reports.
Now i can see it being useful for scenarios where you would most likely always have to read all the data anyways. Think of a web server log or maybe an IoT type of app where your collecting massive amounts of data (like censor collection) and doing nightly analysis.
So is understanding of NoSQL off or is there a limit to the # of scenarios that i works well with?
I think the issue you are encountering is that you are approaching noSQL with the same mindset of design that you would with SQL. You mentioned "rich querying" several times. To me, that points towards design flaws (using only reference ids/trying to define relationships). A significant concept in noSQL is that data can be repeated (and often should be). Your recipe example is actually a great use cases for noSQL. Here's how I would approach it using 3 of the models you mention (for simplicity sake):
Recipe = {
_id: a001,
name: "Burger",
ingredients: [
{
_id: b001,
name: "Beef"
},
{
_id: b002,
name: "Cheese"
}
],
createdBy: {
_id: c001,
firstName: "John",
lastName: "Doe"
}
}
Person = {
_id: c001,
firstName: "John",
lastName: "Doe",
email: "jd#email.com",
preferences: {
emailNotifactions: true
}
}
Ingredient = {
_id: b001,
name: "Beef",
brand: "Agri-co",
shelfLife: "3 days",
calories: 300
};
The reason I designed it this way is expressly for the purpose of it's existence (assuming it's something like allrecipes.com). When searching/filtering recipes, you can filter by the author, but their email preferences are irrelevant. Similarly, the shelf life and brand of the ingredient are irrelevant. The schema is designed for the specific use-case, not just because your data needs to be saved. Now here are a few of your mentioned queries (mongo):
db.recipes.find({name: "Burger"});
db.recipes.find({ingredients: { $nin: ["Cheese", "Milk"]}}) // dietary restrictions
Your rich querying concerns have now been reduced to single queries in a single collection.
The downside of this design is slower write speed. You need more logic on the backend, with the potential for more programmer error. The write speed is also slower than SQL due to accessing the various models to grab relevant information. That being said, how often is it viewed vs. how often is it written/edited? (this was my comment on reading trumping writing) The other major downside is the necessity of foresight. The relationship between an ingredient and a recipe doesn't change forms. But the information your application requires might. Editing a noSQL model tends to be more difficult than editing a SQL table.
Here's one other contrived example using the same models to emphasize my point about purposeful design. Assume your new site is on famous chefs instead of a recipe database:
Person = {
_id: c001,
firstName: "Paula",
lastName: "Deen",
recipeCount: 15,
commonIngredients: [
{
_id: b001,
name: "Butter",
count: 15
},
{
_id: b002,
name: "Salted Butter",
count: 15
}
],
favoriteRecipes: [
{
_id: a001,
name: "Fried Butter",
calories: "3000"
}
]
};
Recipe = {
_id: a001,
name: "Fried Butter",
ingredients: [
{
_id: b001,
name: "Butter"
}
],
directions: "Fry butter. Eat.",
calories: "3000",
rating: 99,
createdBy: {
_id: c001,
firstName: "Paula",
lastName: "Deen"
}
};
Ingredient = {
_id: b001,
name: "Butter",
brand: "Butterfields",
shelfLife: "1 month"
};
Both of these designs use the same information, but they are modeled for the specific reason you bothered gathering the information. Now, you have the requisite information for a chef list page and typical sorting/filtering. You can navigate from there to a recipe page and have that info available.
Design for the use case, not to model relationships.

representing play in relational db

I run a project that deals with plays, mostly Shakespeare. Right now, for storage, we just parse them out into a JSON array formatted like this:
[0: {"title": "Hamlet", "author": ["Shakespeare", "William"], "noActs": 5}, | info
1: [null, ["Scene 1 Setting", "Stage direction", ["character", ["char line 1", "2"]] | act 1
...]
and save them to a file.
We'd like to now move to a relational database for a variety of reasons (foremost currently search) but have no idea how to represent things.
I'm looking for an outline of the best way to do things?
the topic is 'normalization'
begin by identifying your main classifications, such as PLAY, AUTHOR, ACT, SCENE
then add attributes - specifically, a primary key like play_id, act_id, etc.
then add still more attributes like NAME or other identifying information.
then finally, add some relationships between these object bby creating more tables like PLAY_AUTHOR which includes PLAY_ID and AUTHOR_ID.
Not much of a theater goer myself, but this may give you some ideas should you choose a relational DB.

Resources