If you discover a fact after the fact, how do you Datomic? - datomic

My understanding of EAVT is that T has to be when the fact is inserted into Datomic. Often in my work facts can be inserted into the system months after they occurred. Clearly I can add an "at" attribute to my schema, but this seems to defeat much of the value of Datomic. Are there patterns or techniques for conveniently handling this temporal disconnect?
The main problem I want to avoid is:
t=1: I receive a fact that at t=0 x=5
t=3: I receive a fact that at t=2 x=6
t=5: I receive a fact that at t=4 x=7
What was x #t=2.5?
To answer this question I think I have to query the entire history of x, and traverse a custom at field. Or do some sort of binary asof search. Neither seems very appealing.

In principle, the :db/txInstant is a record of when a fact became known to the system. If the fact that became known was "when some event happened," I see no problem with adding an attribute for that knowledge, e.g. :person/birthday or :historical-event/date-time.
The only time I avoid adding date attributes is when "when this became known to the system" and "when this occurred" are the same by definition. For example, "when did the user create this todo item" can be defined as the :db/txInstant for when the todo item entered the database.

Tim Pote is correct. The distinction between domain time (when something happened in the 'real world') and system time (when Datomic found out about it) is an important one, and modeling domain time explicitly is definitely a good approach when it may diverge from the system time.

Related

Is it a good practice to save "dependent value" in the database?

We are developing a game that has heroes and soldiers. All of them can level up. However, as they leveled up, different algorithms are used to get their attributes for their next level.
For heroes, their attributes are fixed values configured by designers. That is, for every level 3 mage, their attributes are totally identical.
For other soldiers, their attributes are generated using a formula in the script, with some random noise. That is, two rabbit of the same level are not the same.
As we develop, we first implement the soldier part, then the hero part. For soldiers, as their attribute values are generated, we store each soldier attributes into the database. As the development goes on, we did the same for heroes. But is it a good practice? I mean actually we can get all the attributes from designers XML file as long as level and class are provided, save all attributes to the database seems a kind of repeat. We are taught to avoid repeat, aren't we?
For more, we have some bag grids unlocked for every 5 levels as the hero levels up, is it good to save those unlock information to the database?
A good rule of the thumb coming from experience, if you store value in database, it should have priority over any default values. If it is the case, the design is correct. If no, it's a redundance, which should be avoided. How would future developers react if they notice that the values in the game are other from those stored in database? The first expected reaction would be - oh, there's an error, we need to fix it.

One or two field to represent "current step" and "is or not finished" state?

Sorry if this question is too silly or neurotic... But I can't figure it out by myself. So I want to see how others deal with it.
My question is:
I want to write a program show progress of do some thing. So I need to record which state is it currently in so that someone can check it by anytime. there are two method:
Use two field to represent the progress state: step and is_finished.
Just one filed: step. For example, if this thing need 5 step, then 6 means finished. ( 0 means not started? )
Compare above two methods.
two field:
Seems more clear. And the most important is that logically speaking step and finished or not are two concepts? I'm not sure about this.
If thing are finished. We change is_finished field to true ( or 1 as you like ). But what to do with step field now? Plus one, or just not touch it because it has no meaning any more now?
one field:
Simple, space saving. But not very intuitive. For example, we don't know what 6 really means by just looking at this field because it may represent finish or middle step. It need other information e.g. total step to determine. And potentially this meaning is not very stable if the total steps will change ( is_finished field in two field method would not affected by this).
So How do you will deal with it? Thanks!
UPDATE:
I forgot some point maybe useful in the previous post:
The story is: We provide a web-based service for customers. (This service has time limitation e.g. 1 year term) After customer purchase it our deployment programe prepare hardware(virtual machine) and deploy some software which need some time to finish. And we want to provide progress info for customer. When deployment is finished, the customer should be informed.
Database design:
It need a usage state field to represent running normal, running but owe (expired), stop. What confusing me is should it include not deployed yet and deploying information or not?
The progress info should include some other info e.g. the start time so we can tell how much time elapsed since start. But this info is no need to be persistent because we won't care about these info as long as it's finished. So I decide to store these progress info in a separate (temporary) table. Then I think it need another field in another more persistent table to tell if things are done . So can we combine it into the usage state field mentioned above?
I like the one-field approach better, for the following reasons:
(Assuming you want to search on steps) you can "cover" all steps using only one simple index.
Should you ever want to attach some additional information to each of the steps, the one-field approach can easily accommodate a FOREIGN KEY towards a new table containing that information.
Requires slightly less storage space. Storage is cheap these days, but that's not the point - caching and network performance is.
Two-field approach:
(Assuming you want to search on steps) might require a "fatter" composite index or even two indexes (which takes space, lowers the cache effectiveness and incurs maintenance cost for INSERT/UPDATE/DELETE operations).
Requires a CHECK to defend the database from "impossible" combinations. Funny enough, some DBMSes don't enforce CHECKs (I'm looking at you, MySQL).
Requires slightly more storage space (and therefore slightly less of it fits into cache, takes up slightly more network bandwidth etc.).
NOTE: Should you choose to use NULLs, that could have "interesting" consequences under certain DBMSes (for example, Oracle doesn't index NULLs).
For example, we don't know what 6 really means
That doesn't really matter, as long as the client application knows what it means.
Design the database for applications, not humans.
And potentially this meaning is not very stable if the total steps will change
True, but you have the same problem with two-field approach as well, if new step is added in the "middle" of existing steps.
Either UPDATE the table accordingly,
or never change the step values. For example, if the step 5 is the last one, then newly added step 6 is considered earlier despite having greater value - your application (or the additional table I mentioned) will know the order of steps, even if their values are not ordered. If you really want "order by value" without resorting to UPDATE, make the steps: 10, 20, 30 etc, so you can insert new steps in the gaps (the old BASIC line number trick).
It remains a matter of taste but I would suggest the second option of a single int field step. On inserting a new record, initialize the value of step to 0 which would indicate "not started yet". Any positive integer value would obviously denote the current step. As soon as the trajectory is completed I would set step to NULL. As you correctly stated this method does require solid documentation but I think that it is not too confusing

Low normal form database design on relational RDBS

A friend of mine is creating an application after she designed the database. However, she just put everything in two 1st NF tables, which have a few functional dependencies on non-PK columns. Indeed, she added the PKs after I suggested her. The 3rd+ NF design of her system may require 7 to 15 tables.
I pointed out that there will be a couple of possible update abnormal. However, she said she has stored procedures for insertion and updates so the update abnormal will never happen since she will force the users/applications insert/update the data via the stored procedures.
Is there any other good reason to persuade her to design the system with higher normal form? Or is her solution good enough practically?
She is probably unskilled and unaware of it. It has already been demonstrated at large that there is no way to help such people.
She probably thinks that she can get all that code in those SP's right, without making a single mistake. That is more than just likely a ludicrous overestimation of her own abilities.
I advise you to let her simply make her mistakes. For YOU, it will save the time of trying and the frustration of not succeeding in getting the message through, and for HER, it will give her the opportunity to learn from her own mistakes (only option left since she's apparently not really willing to learn from other peoples' mistakes), and it will save her the frustration of having to undergo your later coming 'round with that glorious "I told you so.".
But if you really insist on giving it a try, then you might point out to her that :
her 1NF tables are actually views (materialized views at that) on the 3/BC NF tables that she is willingly not implementing.
the updates that her end-user probably expects to be doing, are most likely, however, updates to those 7-15 3/BC NF tables. (I say this because you indicated as her sole reason for the lower NF that she wants to avoid such a "huge" number of tables - the cynicism in "huge" is intentional.)
but the updates that the DBMS is expecting, are updates to the 1NF tables (which are views).
Therefore, her problem boils down to "distilling the appropriate updates to 3NF tables from updates that are specified as updates to views on those tables".
Therefore, getting all her SP code to be correct, is a problem of how to do view updating.
And guess what, after more than 40 years of research on the relational model, by thousands of researchers whose intellectual abilities exceed hers probably by orders of magnitude, it is exactly this problem of view updating that still stands unsolved ... (although granted, her problem is not "view updating in general", but "view updating in some given specific situation". Nonetheless, I very much doubt that she has the skill to spot all the intricacies involved.)

Alternatives to Entity-Attribute-Value (EAV)?

Our database is designed based on EAV (Entity-Attribute-Value) model. Those who have worked with EAV models know all the crap that comes with for the purpose of flexibility.
I asked my client about the reasons why using EAV model (flexibility), and their response was: Their entities change over time. So, today they may have a table with a few attributes, but in a month time, a few new attributes may be added, or an existing attribute may be renamed. They need to produce reports to get back to any stage in time and query the data based on the shape of entities at that stage.
I understand this is not feasible with a conventional relational model, but I personally see EAV as anti-pattern. Are there any other alternative models that enables us to capture the time dimension in changes to the entities and instances?
Cheers,
Mosh
There is a difference between EAV done faithfully or badly; 5NF done by skilled people or by those who are clueless.
Sixth Normal Form is the Irreducible Normal Form (no further Normalisation is possible). It eliminates many of the problems that are common, such as The Null Problem, and provides the ultimate method identifying missing values. It is the academically and technically robust NF. There are no products to support it, and it is not commonly used. To be implemented properly and consistently, it requires a catalogue for metadata to be implemented. Of course, the SQL required to navigate it becomes even more cumbersome (SQL already being cumbersome re joins), but this is easily overcome by automating the production of SQL from the metadata.
EAV is a partial set or a subset of 6NF. The problem is, usually it is done for a purpose (to allow columns to be added without having to make DDL changes), and by people who are not aware of the 6NF, and who do not implement metadata. The point is, 6NF and EAV as principles and concepts offer substantial benefits, and performance increases; but commonly it is not implemented properly, and the benefits are not realised. Quite a few EAV implementations are disasters, not because EAV is bad, but because the implementation is poor.
Eg. Some people think that the SQL required to construct the 3NF rows from the 6NF/EAV database is complex: no, it is cumbersome but not complex. More important, an ordinary SQL VIEW can be provided, so that all users and report tools see only the straight 3NF VIEW, and the 6NF/EAV issues are transparent to them. Last, the SQL required can be automated, so the labour cost that many people endure is quite unnecessary.
So the answer really is, Sixth Normal Form, being the father of EAV, and a purer form, is the replacement for it. The Caveat is, ensure it is done properly. I have one large 6NF db, and it suffers none of the problems people post about, it performs beautifully, the customer is very happy (no further work is a sign of complete functional satisfaction).
I have already posted a very detailed answer to another question which applies to your question as well, which you may be interested in.
Other EAV Question
Regardless of the kind of relational model you use, tracking field name changes requires a lot of meta data which you must keep track of in either transaction logs or audit tables. Unfortunately, querying either of those for state at a particular date is very complicated. If your client only requires state at a particular time date however, meaning the entire state, not just with respect to name changes, you can duplicate the database and roll back the transaction log to the particular time required and run your queries on the new instance. If entities added after the specified date need to show up in the query with the old field names however, you have a very large engineering problem ahead of you. In that case, with the information you provided in your question, I would suggest either negotiating alternatives with the client or getting more information about the use of the reports to find alternative solutions.
You could move to a document based datastore, but that still wouldn't solve the problem in the second case. Sorry this isn't really an answer, but having worked through similar situations, the client likely needs a more realistic reporting solution or a number of other investors willing to front the capital for the engineering.
When this problem came up for us, we kept the db schema constant and implemented an entity mapping factory based on a timestamp. In the end, the client continually changed requirements (on a weekly to monthly basis) as to how aggregate fields were calculated and were never fully satisfied.
To add to the answers from #NickLarsen and #PerformanceDBA
If you need to track historical changes to things like field name, you may want to look into something like Slowly Changing Dimensions. It appears to me like you are using the EAV to model dynamic dimensional models (probably lookup lists).
The simplest (and probably least efficient) way of achieving this would be to include an "as of" date field on EAV tables, and whenever a change occurs, insert a new record (instead of updating an existing record) with the current date. This means that you need to alter your queries to always include or look for an "as of" date, or deafult to "now" if none provided. Your base entity that joins to the EAV objects would then have to query "top 1" from the EAV table where "as of" date is less than or equal to the 'last updated' date of the row, ordered by "as of" descending. Worst case scenario, if you need to track the most recent change to a given row where both the name (stored in the 'attribute' table) and the value have changed, you would chain this logic to the value table using 'last modified' of the row to find the appropriate value for that particular date.
This obviously has the potential to generate LARGE amounts of data if there are a lot of changes. That's why this approach is referred to as "slowly" changing. It's intended for dimensional values that may change, but not very often. To help with query performance, indexes on the "as of" and "last modified" fields should help.
If your client needs such flexibility, then a relational database might not be the right match.
Consider MongoDB where JSON structures are stored. You can add or not add fields without limitations. You can even use nesting.
Create a new table description for each Entity description Version
and one additional table that tells you which table is which version.
The query system should be updated as well.
I think creating a script that generates, tables and queries is your best shot.

Best Practices: Storing a workflow state of an item in a database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question about best practices regarding how one should approach storing complex workflow states for processing tasks in a database. I've been looking online to no avail, so I figured I'd ask the community what they thought was best.
This question comes out of the same "BoxItem" example I gave in a prior question. This "BoxItem" is being tracked in my system as various tasks are performed on it. The task may take place over several days and with human interaction, so the state of the BoxItem must be persisted. Who did the task (if applicable), and when the task was done must also be tracked.
At first, I approached this by adding three fields to the "BoxItems" table for every human-interactive task that needed to be done:
IsTaskNameComplete
DateTaskNameComplete
UserTaskNameComplete
This worked when the workflow was simple... but now that it has grown to a complex process (> 10 possible human interactions in the flow... about half of which are optional, and may or may not be done for the BoxItem, which resulted in me beginning to add "DoTaskName" fields as well for those optional tasks), I've found that what should've been a simple table now has 40 or so field devoted entirely to the retaining of this state information.
I find myself asking if there isn't a better way to do it... but I'm at a loss.
My first thought was to make a generic "BoxItemTasks" table which defined the tasks that may be done on a given box, but I still would need to save the Date and User information individually, so it didn't really help.
My second thought was that perhaps it didn't matter, and I shouldn't worry if this table has 40 or more fields devoted to state retaining... and maybe I'm just being paranoid. But it feels like that's a lot of information to retain.
Anyways, I'm at a loss as far as what a third option might be, or if one of the two options above is actually reasonable. I can see this workflow potentially getting even more complex in the future, and for each new task I'm going to need to add 3-4 fields just to support the tracking of it... it feels like it's spiraling out of control.
What would you do in this situation?
I should note that this is maintenance of an existing system, one that was built without an ORM, so I can't just leave it up to the ORM to take care of it.
EDIT:
Kev, are you talking about doing something like this:
BoxItems
(PK) BoxItemID
(Other irrelevant stuff)
BoxItemActions
(PK) BoxItemID
(PK) BoxItemTaskID
IsCompleted
DateCompleted
UserCompleted
BoxItemTasks
(PK) TaskType
Description (if even necessary)
Hmm... that would work... it would represent a need to change how I currently approach doing SQL Queries to see which items are in what state, but in the long term something like this looks like it would work better (without having to make a fundamental design change like the Serialization idea represents... though if I had the time, I'd like to do it that way I think.).
So is this what you were mentioning Kin, or am I off on it?
EDIT: Ah, I see your idea as well with the "Last Action" to determine the current state... I like it! I think that might work for me... I might have to change it up a little bit (because at some point tasks happen concurrently), but the idea seems like a good one!
EDIT FINAL: So in summation, if anyone else is looking this up in the future with the same question... it sounds like the serialization approach would be useful if your system has the information pre-loaded into some interface where it's queryable (i.e. not directly calling the database itself, as the ad-hoc system I'm working on does), but if you don't have that, the additional tables idea seems like it should work well! Thank you all for your responses!
If I'm understanding correctly, I would add the BoxItemTasks table (just an enumeration table, right?), then a BoxItemActions table with foreign keys to BoxItems and to BoxItemTasks for what type of task it is. If you want to make it so that a particular task can only be performed once on a particular box item, just make the (Items + Tasks) pair of columns be the primary key of BoxItemActions.
(You laid it out much better than I did, and kudos for correctly interpreting what I was saying. What you wrote is exactly what I was picturing.)
As for determining the current state, you could write a trigger on BoxItemActions that updates a single column BoxItems.LastAction. For concurrent actions, your trigger could just have special cases to decide which action takes recency.
As the previous answer suggested, I would break your table into several.
BoxItemActions, containing a list of actions that the work flow needs to go through, created each time a BoxItem is created. In this table, you can track the detailed dates \ times \ users of when each task was completed.
With this type of application, knowing where the Box is to go next can get quite tricky, so having a 'Map' of the remaining steps for the Box will prove quite helpful. As well, this table can group like crazy, hundreds of rows per box, and it will still be very easy to query.
It also makes it possible to have 'different paths' that can easily be changed. A master data table of 'paths' through the work flow is one solution, where as each box is created, the user has to select which 'path' the box will follow. Or you could set up so that when the user creates the box, they select tasks are required for this particular box. Depends on our business problem.
How about a hybrid of the serialization and the database models. Have an XML document that serves as your master workflow document, containing a node for each step with attributes and elements that detail it's name, order in the process, conditions for whether it's optional or not, etc. Most importantly each step node can have a unique step id.
Then in your database you have a simple two table structure. The BoxItems table stores your basic BoxItem data. Then a BoxItemActions table much like in the solution you marked as the answer.
It's essentially similar to the solution accepted as the answer, but instead of a BoxItemTasks table to store the master list of tasks, you use an XML document that allows for some more flexibility for the actual workflow definition.
For what it's worth, in BizTalk they "dehydrate" long-running message patterns (workflows and the like) by binary serializing them to the database.
I think I would serialize the Workflow object to XML and store in the database with an ID column. It may be more difficult to report on, but it sounds like it may work in your case.

Resources