What is it called when a key in a key value pair loses it's meaning? - theory

Let's say I have a feed of data in the form of key-value pairs. This feed comes from multiple sources which provide additional fields and also reuse similar fields to mean different things.
An issue occurs when you try to extract meaning from a data set made from both sources using these similar fields.
What is this phenomenon called?
Example from source A:
{ name: 'building A', ... }
Example from source A-firewall-events
{ name: 'rejected', ip: '...', ... }

The key, name, has no semantic association outside of its context, which in your example is A and A-firewall-events. There is no loss of data or loss of meaning. Unless you append the source of the key value name will not have any inherent meaning. There exists no terminology to describe this, as it is solely the result of poor software engineering practices.

Related

Data structure for set inclusion queries

What I've done so far: I have a map of sets (sort of database), where each set is a collection of strings (different features measured by a doctor). It looks like this:
{{"temperature", "blood pressure"}: Model1, {"temperature", "weight"}: Model2, {"temperature", "blood pressure", "weight"}: Model3}
Each set maps to a ML model that is used for that particular measurements. Sets may have different number of features, overlapping features (e. g. "temperature" is frequent).
Task: doctor makes some measurements, e. g. someone measured only {"temperature", "weight"}. I have to check which sets from my database are inclusive in this set, so I know which model can be used with this data, e. g. for this example there is Model2 available. It's okay if the model does not require all measured features - I only require that the model does not need more features than those measured. I need a data structure to effectively make such queries.
Data: it's not organized in any way yet, I'm also not bound to a particular language (I prefer Python, since the rest of application is in it, but it's not required). I can modify it in any way that' required, e. g. identify a model by a string ID, or throw this into some relational/non-relational database.
Question: what data structure / database type / data organization will be effective for such queries? I'm open to implementing data structures myself, as well as using SQL, MongoDB or any other solution.
The data structure that you want is a trie. That is, order the features in a canonical order (alphabetical works, frequency of feature in your models is somewhat more efficient) and put them into a nested structure (models, further_lookups) like this:
([], {
'blood pressure': ([], {
'temperature': ([Model1], {
'weight': ([Model3], {})
}),
}),
'temperature': ([], {
'weight': ([Model2], {})
}),
})
And now given a particular set of fields, you navigate in along the path you have and collect all of the models you run across. Which is to say [] for the start, [] after 'temperature' and [Model2] after 'weight'.
Note that you have to try both using and not using any particular field. So if you had 'blood pressure' as well then you'd need to both try searching for models with 'blood pressure' and without. This is easily done with recursion. It theoretically can take exponential time in the number of features that you have, but is unlikely to do so in practice.
I don't have recommendations for a good implementation of a trie in a data store.

What is the best approach for storing two list of same type in mongoDB?

I'm wondering in terms of database design what is the best approach between storing reference id, or embedded document even if it's means that multiple document can appears more than once.
Let's say I have that kind of model for the moment :
Collection User :
{
name: String,
types : List<Type>
sharedTypes: List<Type>
}
If I use the embedded model and don't use another collection it may result in duplicate object Type. For example, user A create Type aa and user B create Type bb. When they share each other they type it will result in :
{
name: UserA,
types : [{name: aa}]
sharedTypes: [{name:bb}]
},
{
name: UserB,
types : [{name: bb}]
sharedTypes: [{name:aa}]
}
Which results in duplication, so I guess it's pretty bad design. Should I use another approach like creating collection Type and store referenceId ?
Collection Type :
{
id: String
name: String
}
Which will still result in duplication but not one whole document, I guess it's better.
{
name: UserA,
types : ["randomString1"]
sharedTypes: ["randomString2"]
},
{
name: UserA,
types : ["randomString2"]
sharedTypes: ["randomString1"]
}
And the last one approach and maybe the best is to store from the collection types like this.
Collection User :
{
id: String
name: String
}
Collection Type :
{
id: String
name: String,
createdBy: String (id of user),
sharedWith: List<String> (ids of user)
}
What is the best approach between this 3.
I'm doing query like, I got one group of user, so for each user, I want the type created and the type people shared with me.
Broadly, the decision to embed vs. use a reference ID comes down to this:
Do you need to easily preserve the referential integrity of the joined data at point in time, meaning you want to ensure that the state of the joined data is "permanently associated" with the parent data? Then embedding is a good idea. This is also a good practice in the "insert only" design paradigm. Very often other requirements like immutability, hashing/checksum, security, and archiving make the embedded approach easier to manage in the long run because version / createDate management is vastly simplified.
Do you need the fastest, most quick-hit scalability? Then embed and ensure indexes are appropriately constructed. An indexed lookup followed by the extraction of a rich shape with arbitrarily complex embedded data is a very high performance operation.
(Opposite) Do you want to ensure that updates to joined data are quickly and immediately reflected in a join with parents? Then use a reference ID and the $lookup function to bring the data together.
Does the joined data grow essentially without bound, like transactions against an account? This is likely better handled through a reference ID to a separate transaction collection and joined with $lookup.

How do i properly design a data model for Recipe / Ingredient using MongoDB

Recently i have designed a database model or ERD using Hackalode.
So the problem I'm currently facing is that base on my current design, i can't query it correctly as I wanted. I studied ERD with MYSQL and do know that Mongo doesn't work the same
The idea was simple, I want a recipe that has a array list of ingredients, and the ingredients are from separate collection.
The recipe also consist of measurement of the ingredient ie. (1 tbps sugar)
Can also query from list of ingredients and find the recipe that contains the ingredients
I wanted this collections to be in Many to Many relationship and the recipe can use the ingredients that are already in the database.
I just don't know how to query the data
I have tried a lot of ways by using $elemMatch and populate and all i get is empty array list as a result.
Im expecting two types of query where i can query by name of ingredients or by the recipe
My expectation result would be like this
[{
id: ...,
name: ....,
description: ...,
macros: [...],
ingredients: [
{
id,
amount: ....,
unit: ....
ingredient: {
id: ....,
name: ....
}
}
}, { ... }]
But instead of getting
[]
Imho, your design is utterly wrong. You over normalized your data. I would do something much simpler and use embedding. The reasoning behind that is that you define your use cases first and then you model your data to answer the question arising from your use cases in the most efficient way.
Assumed use cases
As a user, I want a list of all recipes.
As a user, I want a list of all recipes by ingredient.
As a designer, I want to be able to show a list of all ingredients.
As a user, I want to be able to link to recipes for compound ingredients, should it be present on the site.
Surely, this is just a small excerpt, but it is sufficient for this example.
How to answer the questions
Ok, the first one is extremely simple:
db.recipes.find()[.limit()[.skip()]]
Now, how could we find by ingredient? Simple answer: do a text index on ingredient names (and probably some other fields, as you can only have one text index per collection. Then, the query is equally simple:
db.recipes.find({$text:{$search:"ingredient name"}})
"Hey, wait a moment! How do I get a list of all ingredients?" Let us assume we want a simple list of ingredients, with a number on how often they are actually used:
db.recipes.aggregate([
// We want all ingredients as single values
{$unwind:"$Ingredients"},
// We want the response to be "Ingredient"
{$project:{_id:0,"Ingredient":"$Ingredients.Name"}
// We count the occurrence of each ingredient
// in the recipes
{$group:{_id:"$Ingredient",count:{$sum:1}}}
])
This would actually be sufficient, unless you have a database of gazillions of recipes. In that case, you might want to have a deep look into incremental map/reduce instead of an aggregation. Hint: You should add a timestamp to the recipes to be able to use incremental map/reduce.
If you have a couple of hundred K to a couple of million recipes, you can also add an $out stage to preaggregate your data.
On measurements
Imho, it makes no sense to have defined measurements. There are teaspoons, tablespoons, metric and imperial measurements, groupings like "dozen" or specifications like "clove". Which you really do not want to convert to each other or even set to a limited number of measurements. How many ounces is a clove of garlic? ;)
Bottom line: Make it a free text field, maybe with some autocomplete suggestions.
Revised data model
Recipe
{
_id: new ObjectId(),
Name: "Surf & Turf Kebap",
Ingredients: [
{
Name: "Flunk Steak",
Measurement: "200 g"
},
{
Name: "Prawns",
Measurement: "300g",
Note: "Fresh ones!"
},
{
Name: "Garlic Oil",
Measurement: "1 Tablespoon",
Link: "/recipes/5c2cc4acd98df737db7c5401"
}
]
}
And the example of the text index:
db.recipes.createIndex({Name:"text","Ingredients.Name":"text"})
The theory behind it
A recipe is you basic data structure, as your application is supposed to store and provide them, potentially based on certain criteria. Ingredients and measurements (to the extend where it makes sense) can easily be derived from the recipes. So why bother to store ingredients and measurements independently. It only makes your data model unnecessarily complicated, while not providing any advantage.
hth

How much trust should I put in the validity of retrieved data from database?

Other way to ask my question is: "Should I keep the data types coming from database as simple and raw as I would ask them from my REST endpoint"
Imagine this case class that I want to store in the database as a row:
case class Product(id: UUID,name: String, price: BigInt)
It clearly isn't and shouldn't be what it says it is because The type signatures of nameand price are a lie.
so what we do is create custom data types that better represent what things are such as: (For the sake of simplicity imagine our only concern is the price data type)
case class Price(value: BigInt) {
require(value > BigInt(0))
}
object Price {
def validate(amount: BigInt): Either[String,Price] =
Try(Price(amount)).toOption.toRight("invalid.price")
}
//As a result my Product class is now:
case class Product(id: UUID,name: String,price: Price)
So now the process of taking user input for product data would look like this:
//this class would be parsed from i.e a form:
case class ProductInputData(name: String, price: BigInt)
def create(input: ProductInputData) = {
for {
validPrice <- Price.validate(input.price)
} yield productsRepo.insert(
Product(id = UUID.randomUUID,name = input.name,price = ???)
)
}
look at the triple question marks (???). this is my main point of concern from an entire application architecture perspective; If I had the ability to store a column as Price in the database (for example slick supports these custom data types) then that means I have the option to store the price as either price : BigInt = validPrice.value or price: Price = validPrice.
I see so many pros and cons in both of these decisions and I can't decide.
here are the arguments that I see supporting each choice:
Store data as simple database types (i.e. BigInt) because:
performance: simple assertion of x > 0 on the creation of Price is trivial but imagine you want to validate a Custom Email type with a complex regex. it would be detrimental upon retrieval of collections
Tolerance against Corruption: If BigInt is inserted as negative value it would't explode in your face every time your application tried to simply read the column and throw it out on to the user interface. It would however cause problem if it got retrieved and then involved in some domain layer processing such as purchase.
Store data as it's domain rich type (i.e. Price) because:
No implicit reasoning and trust: Other method some place else in the system would need the price to be valid. For example:
//two terrible variations of a calculateDiscount method:
//this version simply trusts that price is already valid and came from db:
def calculateDiscount(price: BigInt): BigInt = {
//apply some positive coefficient to price and hopefully get a positive
//number from it and if it's not positive because price is not positive then
//it'll explode in your face.
}
//this version is even worse. It does retain function totality and purity
//but the unforgivable culture it encourages is the kind of defensive and
//pranoid programming that causes every developer to write some guard
//expressions performing duplicated validation All over!
def calculateDiscount(price: BigInt): Option[BigInt] = {
if (price <= BigInt(0))
None
else
Some{
//Do safe processing
}
}
//ideally you want it to look like this:
def calculateDiscount(price: Price): Price
No Constant conversion of domain types to simple types and vice versa: for representation, storage,domain layer and such; you simply have one representation in the system to rule them all.
The source of all this mess that I see is the database. if data was coming from the user it'd be easy: You simply never trust it to be valid. you ask for simple data types cast them to domain types with validation and then proceed. But not the db. Does the modern layered architecture address this issue in some definitive or at least mitigating way?
Protect the integrity of the database. Just as you would protect the integrity of the internal state of an object.
Trust the database. It doesn't make sense to check and re-check what has already been checked going in.
Use domain objects for as long as you can. Wait till the very last moment to give them up (raw JDBC code or right before the data is rendered).
Don't tolerate corrupt data. If the data is corrupt, the application should crash. Otherwise it's likely to produce more corrupt data.
The overhead of the require call when retrieving from the DB is negligible. If you really think it's an issue, provide 2 constructors, one for the data coming from the user (performs validation) and one that assumes the data is good (meant to be used by the database code).
I love exceptions when they point to a bug (data corruption because of insufficient validation on the way in).
That said, I regularly leave requires in code to help catch bugs in more complex validation (maybe data coming from multiple tables combined in some invalid way). The system still crashes (as it should), but I get a better error message.

Cassandra frozen keyword meaning

What's the meaning of the frozen keyword in Cassandra?
I'm trying to read this documentation page: Using a user-defined type, but their explanation for the frozen keyword (which they use in their examples) is not clear enough for me:
To support future capabilities, a column definition of a user-defined
or tuple type requires the frozen keyword. Cassandra serializes a
frozen value having multiple components into a single value. For
examples and usage information, see "Using a user-defined type",
"Tuple type", and Collection type.
I haven't found any other definition or a clear explanation for that in the net.
In Cassandra if you define UDT or Collection as frozen, you can't update UDT's or collection's individual item, you have to reinsert with full value.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_reference/collection_type_r.html
#Alon : "Long story short: frozen = immutable"
For a user-defined type, I have noticed that you can update frozen the set by adding or removing the set elements. For example, let's say we create a table as follows.
create type rule_option_udt(display text, code text);
create TABLE rules (key text, options set<frozen<rule_option_udt>>, primary key (key));
As per the definition we expect that we won't be able to use + or - operation with the options field. But you can use add or remove operation as a normal set.
insert into rules (key, options) values ('fruits', {{display: 'Apple', code: 'apple'}}) ;
update rules set options = options + {{display: 'Orange', code: 'orange'}};
The above update is valid but you can't do the same with simple types such as text. I am not sure if this is expected nature or not but this is how it works for me.
In addition to this <frozen<set<udt>> && set<frozen<udt>> are behaving in the same manner. While describing the table you can see both are being shown as set<frozen<udt>>

Resources