I'm writing a DB migration using Laravel and I'm wondering if I can set a column as not nullable based on the value of another column.
For example:
If User.owns_cat === true
Then User.cat_name->nullable(false)
I know I can handle this via validation rules later but would like to have this rule at the DB level.
What you really need is support for conditional functional dependencies and association rules.
Association Rules (AR)
c => f(A)
where c is a logically determinable value, which, if met, then the column set A will fullfill something where
A = (A1, A2, ..., An)
Conditional Functional Dependencies (CFD)
c => A -> B
where c is a logically determinable value, which, if met, then the column set A determines the values of the column set B, where
A = (A1, A2, ..., An)
and
B = (B1, B2, ..., Bn)
Problem
RDBMS systems do not tend to support ARs or CFDs out of the box, at least that was the case the last time I checked. So, since the underlying system probably does not support the feature you need, Laravel migrations will probably not achieve that either.
Solutions
The problem with defining schema based on application code is that this task is not quite appropriately done at the ORM level of the tech stack, because:
you use some schema generator you can only indirectly influence
more complex problems are very difficult to be solved via such a system
the schema change generator might have bugs, which adds unnecessarily a new level of worry to your problems
So, what you can do:
you can implement a functionality on application-level that checks some condition, which, if met, enforces the rule, that is, if the rule is to be violated, then either throws an error or sets a default, while if the rules are met, then calls set
you can implement a trigger in the database, which, before insert/update would check and enforce the rules you need
schema solution
periodically running data fix
Trigger
You would create a trigger, which would check the value of column2 and if your condition is met, check whether column1 would be null. If so, then you might chose to throw an error or set a default.
Advantage: Your data consistency will be maintained even if write operations happen outside your application.
Disadvantage: This will be quite difficult to maintain if the rules change frequently and this would not have access on your application-level resources.
Whether this approach is ideal for you, depends on your needs.
Schema Solution
So, if c1 has some value, then c2 is not nullable. You can move c1 to another table and make c2 a foreign key. If your condition for c1 is met, ensure that c2 is a proper foreign key to the new table which holds c1.
Advantage: You use only resources the RDBMS provides you out of the box
Disadvantage: This approach is counter-intuitive if your condition for c2 is complex
The applicability of this approach depends on the complexity of your condition.
Periodically Running Data Fix
You can allow temporary inconsistencies with your rules and solve the issues automatically, in a periodical manner.
Advantages: The solution might be a single request to the database server, which would execute a stored procedure or something of the like, performing the tasks quickly.
Disadvantages: You cannot rely on your rules being enforced at every record at every moment.
This is applicable if you do not worry about your rule being temporarily breached as long as it ends up being correct fairly quickly.
Application-Level Support
You can implement a class which has a set method. That set method will get an entity and a function that returns a boolean. If the function returns true, then call set. Otherwise either throw an exception or set a default.
Related
I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.
I have the next scenario/experiment:
I'm calling daily to a rest API to extract information about companies.
In my initial load to the DB everything is new, so everything is very easy.
Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (i.e., they changed the size, the profits, the location, ...)
I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.
In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.
Is there any example out there that I can see?
There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.
In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.
You may of course make it more complex by the decision to use your own surrogate key.
You mentioned that there are two main types of the interface for the process:
• You get periodically a full set of the dimension data
• You get the “changes only” (aka delta interface)
Paradoxically the former is much simple to handle than the latter.
First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).
Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.
Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface.
Connected is the question whether a previously deleted entity can be reused (i.e. reappear in the data).
If you support delete/reuse you'll have to thing about how to show them in your dimension table.
In any case you will need some additional columns in the dimension to cover the historical information.
Some implementation use a change_timestamp, some other use validity interval valid_from and valid_to.
Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.
So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.
This picture below shows a sequence diagram for two clients storing values into a Key-Value datastore:
The problem I'm trying to solve is how to prevent overriding keys. The way the applications (Client_A, and Client_B) prevent this is by checking if key exists first before storing. The issue now is if both clients manage to get the same "does not exist" result, any of the two clients would be able to overwrite the values.
What strategy can be done to be able to prevent such from happening in a database client design?
A "key-value store", as it's usually defined, doesn't store duplicate keys at all. If two clients write to the same key, then only one value is stored -- the one from which ever client wrote "latest".
In order to reliably update values in a consistent way (where the new value depends on the old value associated with a key, or even whether or not there was an old value), your key-value store needs to support some kinds of atomic operations other than simple get and set.
Memcache, for example, supports atomic compare-and-set operations that will only set a value if it hasn't been set by someone else since you read it. Amazon's DynamoDB supports atomic transactions, atomic counters, etc.
START TRANSACTION;
SELECT ... FOR UPDATE;
take action depending on result
UPDATE ...;
COMMIT;
The "transaction" makes the pair. SELECT and UPDATE, "atomic".
Write this sort of code for any situation where another connection can sneak in and mess up what you are doing.
Note: The code written here uses MySQL's InnoDB syntax and semantics; adjust accordingly for other brands.
I'm currently working on generalizing a platform, which handles incoming payments, potentially keeps some of it for offsetting reasons, and then payouts the rest. For now, the system has been made to work with a single type of incoming payment, but we're now generalizing it to handle different types. Each payment type has its own quirks, but a lot of similarities, so we have decided to extract all behavior that varies between each type of "money" in database tables. The tables basically store user configurations, such as algorithms (basically different code fragments, i.e., the name of code to be executed), booleans and passed parameters.
We must store the history of previously configured values for a particular type of money, since configurations may change over time. Eventually there is a finite number of configuration points in the code.
Solutions
It seems as though there basically exists 3 overall approaches for storing "user" configurations. Either as an EAV table in the RBDMS, make explicit columns for each configuration, or leverage JSON/XML/some other text format.
EAV
EAV has its obvious strengths in making everything very general, and thus easy to extract from the database. The largest concern is data integrity, and that every configuration parameter must be specified for every type of money, and ensuring that becomes more difficult in EAV. Keeping the history is straightforward: simply add a version/timestamp column, and select the latest version for a particular type of money.
Explicit columns
Explicit columns, i.e., one column for every configuration point, makes it much simpler to ensure that all configuration points have been defined. Since there eventually are no more additions of configuration points, it seems preferable to EAV. Keeping a history becomes more troublesome, though. Oracle comes with audit functionality, but does not seem to present itself easily to be presented in a GUI. A seperate log table can be leveraged, but is cumbersome to maintain, especially if this is done by using triggers. Every addition of a column (i.e., new configuration point) means that the trigger has to be regenerated.
An alternative to the trigger, is to have a versioning column in the configuration tables. Then, a view can be used on top of the configuration table, selecting the latest row for each type of money. However, adding a new column becomes troublesome, as if it's non-nullable, old rows (i.e. those that have become history), must be updated with potentially bogus values. This can obviously be avoided by relaxing the non-nullable constraint, but then data integrity becomes much worse, as the values must be specified.
I haven't looked too much into the JSON/XML approach, but it seems to have many of the same problems as EAV.
My question: is there a standardized way of approaching this? It seems that most people who deal with user configurations, don't have to keep an exact history in tables as well. One could simply use the migration SQL files that are created, but configurations may be changed from a GUI as well.
I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.
As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.
I heard that decision tables in relational database have been researched a lot in academia. I also know that business rules engines use decision tables and that many BPMS use them as well.
I was wondering if people today use decision tables within their relational databases?
A decision table is a cluster of conditions and actions. A condition can be simple enough that you can represent it with a simple "match a column against this value" string. Or a condition could be hellishly complex. An action, similarly, could be as simple as "move this value to a column". Or the action could involve multiple parts or steps or -- well -- anything.
A CASE function in a SELECT or WHERE clause is a decision table. This is the first example of decision table "in" a relational database.
You can have a "transformation" table with columns that have old-value and replacement-value. You can then write a small piece of code like the following.
def decision_table( aRow ):
result= connection.execute( "SELECT replacement_value FROM transformation WHERE old_value = ?", aRow['somecolumn'] )
replacement= result.fetchone()
aRow['anotherColumn']= result['replacement_value']
Each row of the decision table has a "match this old_value" and "move this replacement_value" kind of definition.
The "condition" parts of a decision table have to be evaluated somewhere. Your application is where this will happen. You will fetch the condition values from the database. You'll use those values in some function(s) to see if the rule is true.
The "action" parts of a decision table have to be executed somewhere; again, your application does stuff. You'll fetch action values from the database. You'll use those values to insert, update or delete other values.
Decision tables are used all the time; they've always been around in relational databases. Each table requires a highly customized data model. It also requires a unique condition function and action procedure.
It doesn't generalize well. If you want, you could store XML in the database and invoke some rules engine to interpret and execute the BPEL rules. In this case, the rules engine does the condition and action processing.
If you want, you could store Python (or Tcl or something else) in the database. Seriously. You'd write the conditions and actions in Python. You'd fetch it from the database and run the Python code fragment.
Lots of choices. None of the "academic". Indeed, the basic condition - action stuff is done all the time.
Wheter or not to put decision tables in a database depends on a number of other questions.
Will your conditions be calculated inside the RDBMS or elsewhere? If the data used for evaluating these conditions, and a suitable method for evaluating them inside the RDBMS can be devised, it is probably a good idea. Maybe your actions also happens inside your database, which would make it even more attractive.
Your conditions, and even execution of your actions might be on the outside of the RDBMS, but you could still keep the connections between combinations of conditions and actions on the inside. Probably because most of you other data is there, and all you have is a web server sitting on top of it.
I can think of two ways to model this, depending on how many conditions you have (and wheter they are binary), and what the capacity for columns per table is.
Let's say you have 6 conditions that are binary, this means you have 2^6 = 64 possible combinations. Then you could have one column for every combination, and one row for every action.
Or you could have 16 conditions which means you would have almost an incalculable number of combinations (actually 65536). Which is a ridiculous number of columns. Better then to have a column for each condition and a column for each action and 65536 rows of what to do in each possible situation. Each row would represent a situation and what to do in that situation. The only datatype you use would be bool. You could also package these bools into bitmasked integers.
Actually, bigger decision tables are better avoided. Divide and rule, and use more tables is a much better way. Usually a subject matter expert will get tired if asked to give opinions on too high a number of conditions.
The strength of the decision table is really in the modelling stage where the developer and the subject matter expert can find out if every possible situation is mapped, and no blind spots can exist.
I think they will contribute to the already too much declined state of what used to be "in-person" communications- enough hide behind the screen as it is..... come out of the closet, get out - got the picture.
I would look into using an Object database rather than a traditional RDBMS (Relational Database Management System). Object databases are designed to be fast at handling hierarchical relationships between objects, whereas in an RDBMS, you have to represent these relationships across multiple table rows, or even tables so your queries (tree traversals) will be slow.