Entity Attribute Value model (EAV) and how to achieve it with cfml?

Entity Attribute Value model (EAV) and how to achieve it with cfml? - database

I'm trying to figure out how to implement this relationship in coldfusion. Also if anyone knows the name for this kind of relationship I'd be curious to know it.
I'm trying to create the brown table.
Recreating the table from the values is not the problem, the problem that I've been stuck with for a couple of days now is how to create an editing environment.
I'm thinking that I should have a table with all the Tenants and TenantValues (TenantValues that match TenantID I'm editing) and have the empty values as well (the green table)
any other suggestions?

The name of this relationship is called an Entity Attribute Value model (EAV). In your case Tenant, TenantVariable, TenantValues are the entity, attribute and value tables, respectively. EAV is attempt to allow for the runtime definition or entities and is most found in my experience backing content managements systems. It has been referred to an as anti pattern database model because you lose certain RDBMS advantages, while gaining disadvantages such as having to lock several tables on delete or save. Often a suitable persistence alternative is a NoSQL solution such as Couch.
As for edits, the paradigm I typically see is deleting all the value records for a given ID and inserting inside a loop, and then updating the entity table record. Do this inside of a transaction to ensure consistency. The upshot of this approach is that it's must easier to figure out than delta detection algorithm. Another option is using the MERGE statement if your database supports it.

You may want to consider an RDF Triple Store for this problem. It's an alternative to Relational DBs that's particularly good for sparse categorical data. The data is represented as triples - directed graph edges consisting of a subject, an object, and the predicate that describes the property connecting them:
(subject) (predicate) (object)
Some example triples from your data set would look something like:
<Apple> rdf:type <Red_Fruit>
<Apple> hasWeight "1"^^xsd:integer
RDF triple stores provide the SPARQL query language to retrieve data from your store much like you would use SQL.

Related

Datomic table model

I have an application that requires a database containing a set of products where each product can have a set of tables. The end-user should be able to add new products and define new tables for a product. So each table has a set of columns that are specified by the user. The user can then fill the tables with rows of data. Each table belongs to exactly one product.
The end-user should also be able to view the tables as they were at a specific point in time (at a certain transaction).
How would I go about making a schema for this in Datomic so that querying it would be as efficient as possible?

I would go with 4 entity types: products, tables, columns, and rows.
The relationship between products and tables is best handled by a :table/product to-one ref attribute, but a :product/tables to-many component ref attribute could also work (the latter does not enforce the one-to-many relationship).
Likewise, I would use either a :column/table or :table/columns attribute. I would also have a :column/name string attribute and maybe a :column/type enumerated attribute.
The hardest part is to model rows.
One tempting solution is to just create an attribute per column - I actually think it's bad idea, Datomic attributes are not intended for such a dynamic use. In particular, schema attributes are stored in a cache on the Peer that's not meant to grow big. (I may be wrong about this, so it'd be nice if someone in the Datomic team could confirm.)
Instead, I would have a few dozens reusable :row/cell-0, :row/cell-1, :row/cell-2, etc. 'cell position' attributes, that are shared across all tables. Each actual column would be mapped to a at creation time by a to-one :column/position attribute.
If the rows can have several data types, it's a bit more difficult, you'd have to basically make an attribute for each (type,position) pair.
Then each row basically consist of a :row/table attribute and the above cell position attributes.
Here's a Datalog query that would let you read the whole table
[:find ?row ?column-name ?val :in $ ?table :where
[?column :column/table ?table]
[?row :row/table ?table]
[?row ?pos ?val]
[?column :column/position ?pos]
[?column :column/name ?column-name]]
Note that all of the above is only useful if you want to query the table with Datalog directly against your Datomic db. But it can be also completely fine to serialize your tables and store them as blobs - especially if they're small; later, you pull out the blob, deserialize it, then you can query with Datalog too. And if tables are to coarse for this use, maybe you can do it with rows.

Data modeling in Datomic

I've been looking into Datomic, and it looks really interesting. But while there seems to be very good information on how Datomic works technically, I have not seen much on how one should think about data modeling.
What are some best practices for data modeling in Datomic? Are there any good resources on the subject?

Caveat Lector
As Datomic is new and my experience with it is limited, this answer shouldn't be considered best practices in any way. Take this instead as an intro to Datomic for those with a relational background and a hankering for a more productive data store.
Getting Started
In Datomic, you model your domain data as Entities that possess Values for Attributes. Because a reference to another Entity can be the Value of an Attribute, you can model Relationships between Entities simply.
At first look, this isn't all that different from the way data is modeled in a traditional relational database. In SQL, table rows are Entities and a table's columns name Attributes that have Values. A Relationship is represented by a foreign key Value in one table row referencing the primary key Value of another table row.
This similarity is nice because you can just sketch out your traditional ER diagrams when modeling your domain. You can rely on relationships just like you would in a SQL database, but don't need to mess around with foreign keys since that's handled for you. Writes in Datomic are transactional and your reads are consistent. So you can separate your data into entities at whatever granularity feels right, relying on joins to provide the bigger picture. That's a convenience you lose with many NoSQL stores, where it's common to have BIG, denormalized entities to achieve some useful level of atomicity during updates.
At this point, you're off to a good start. But Datomic is much more flexible than a SQL database.
Taking Advantage
Time is inherently part of all Datomic data, so there is no need to specifically include the history of your data as part of your data model. This is probably the most talked about aspect of Datomic.
In Datomic, your schema is not rigidly defined in the "rectangular shape" required by SQL. That is, an entity1 can have whatever attributes it needs to satisfy your model. An entity need not have NULL or default values for attributes that don't apply to it. And you can add attributes to a particular, individual entity as you see fit.
So you can change the shape of individual entities over the course of time to be responsive to changes in your domain (or changes to your understanding of the domain). So what? This is not unlike Document Stores like MongoDB and CouchDB.
The difference is that with Datomic you can enact schema changes atomically over all affected entities. Meaning that you can issue a transaction to update the shape of all entities, based upon arbitrary domain logic, written in your language[2], that will execute without affecting readers until committed. I'm not aware of anything close to this sort of power in either the relational or document store spaces.
Your entities are not rigidly defined as "living in a single table" either. You decide what defines the "type" of an entity in Datomic. You could choose to be explicit and mandate that every entity in your model will have a :table attribute that connotes what "type" it is. Or your entities can conform to any number of "types" simply by satisfying the attribute requirements of each type.
For example, your model could mandate that:
A Person requires attributes :name, :ssn, :dob
An Employee requires :name, :title, :salary
A Resident requires :name, :address
A Member requires :id, :plan, :expiration
Which means an entity like me:
{:name "Brian" :ssn 123-45-6789 :dob 1976-09-15
:address "400 South State St, Chicago, IL 60605"
:id 42 :plan "Basic" :expiration 2012-05-01}
can be inferred to be a Person, a Resident and a Member but NOT an Employee.
Datomic queries are expressed in Datalog and can incorporate rules expressed in your own language, referencing data and resources that are not stored in Datomic. You can store Database Functions as first-class values inside of Datomic. These resemble Stored Procedures in SQL, but can be manipulated as values inside of a transaction and are also written in your language. Both of these features let you express queries and updates in a more domain-centric way.
Finally, the impedance mismatch between the OO and relational worlds has always frustrated me. Using a functional, data-centric language (Clojure) helps with that, but Datomic looks to provide a durable data store that doesn't require mental gymnastics to bridge from code to storage.
As an example, an entity fetched from Datomic looks and acts like a Clojure (or Java) map. It can be passed up to higher levels of an application without translation into an object instance or general data structure. Traversing that entity's relationships will fetch the related entities from Datomic lazily. But with the guarantee that they will be consistant with the original query, even in the face of concurrent updates. And those entities will appear to be plain old maps nested inside the first entity.
This makes data modeling more natural and much, much less of a fight in my opinion.
Potential Pitfalls
Conflicting attributes
The example above illustrates a potential pitfall in your model. What if you later decide that :id is also an attribute of an Employee? The solution is to organize your attributes into namespaces. So you would have both :member/id and :employee/id. Doing this ahead of time helps avoid conflict later on.
An attribute's definition can't be changed (yet)
Once you've defined an attribute in your Datomic as a particular type, indexed or not, unique, etc. you can't change that later. We're talking ALTER TABLE ALTER COLUMN in SQL parlance here. For now, you could create a replacement attribute with the right definition and move your existing data.
This may sound terrible, but it's not. Because transactions are serialized, you can submit one that creates the new attribute, copies your data to it, resolves conflicts and removes the old attribute. It will run without interference from other transactions and can take advantage of domain-specific logic in your native language to do it's thing. It's essentially what an RDBMS is doing behind the scenes when you issue an ALTER TABLE, but you name the rules.
Don't be "a kid in a candy store"
Flexible schema doesn't mean no data model. I'd advise some upfront planning to model things in a sane way, same as you would for any other data store. Leverage Datomic's flexibility down the road when you have to, not just because you can.
Avoid storing large, constantly changing data
Datomic isn't a good data store for BLOBs or very large data that's constantly changing. Because it keeps a historical record of previous values and there isn't a method to purge older versions (yet). This kind of thing is almost always a better fit for an object store like S3. Update: There is a way to disable history on a per-attribute basis. Update: There is now also a way to excise data; however, storing references to external objects rather than the objects themselves may still stil be the best approach to handling BLOBs. Compare this strategy with using byte arrays.
Resources
Datomic mailing list
IRC channel #datomic on Freenode
Notes
I mean entity in the row sense, not the table sense which is more properly described as entity-type.
My understanding is that Java and Clojure are currently supported, but it is possible that other JVM languages could be supported in the future.

A very nice answer from bkirkbri. I want to make some additions:
If you store many entities of similar, but not equal "types" or schemas, use a type keyword in the schema, like
[:db/add #db/id[:db.part/user] :db/ident :article.type/animal]
[:db/add #db/id[:db.part/user] :db/ident :article.type/weapon]
[:db/add #db/id[:db.part/user] :db/ident :article.type/candy]
{:db/id #db/id[:db.part/db]
:db/ident :article/type
:db/valueType :db.type/ref
:db/cardinality :db.cardinality/one
:db/doc "The type of article"
:db.install/_attribute :db.part/db}
When you read them, get the entity-ids from a query and use datomic.api/entity and the eid and parse them by multimethods dispatching on type if nescessary, since it's hard to make a good query for all attributes in some more complex schema.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?

Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.

I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Best approach to store data which attributes can vary

Please, read first my previous question: T-SQL finding of exactly same values in referenced table
The main purpose of this question is to find out if this approach of storing of data is effective.
Maybe it would be better to get rid of PropertyValues table. And use additional PropertyValues nvarchar(max) column in Entities table instead of it. For example instead of
EntityId PropertyId PropertyValue
1 4 Val4
1 5 Val5
1 6 Val6
table, I could store such data in PropertyValues column: "4:Val4;5:Val5;6Val6"
As an alternative, I could store XML in PropertyValues column....
What do you think about the best approach here?
[ADDED]
Please, keep in mind:
Set of properties must be customizable
Objects will have dozens of properties (approximately from 20 to 120). Database will contain thousands of objects
[ADDED]
Data in PropertyValues table will be changed very often. Actually, I store configured products. For example, admin configures that clothes have attributes "type", "size", "color", "buttons type", "label type", "label location" etc... User will select values for these attributes from the system. So, PropertyValues data cannot be effectively cached.

You will hate yourself later if you implement a solution using multi-value attributes (i.e. 4:Val4;5:Val5;6Val6).
XML is marginally better because there are XQuery functions to help you pull out and parse the values. But the XML type is implemented as a CLR type in SQL Server and it can get extremely slow to work with.
The best solution to this problem is one like you have. Use the sql_variant type for the column if it could be any number of data types. Ideally you'd refactor this into multiple tables / entities so that the data type can be something more concrete.

I work with the similar project (web-shop generator). So every product has attribute and every attribute has set of values. It is different tables. And for all of this there are translations in several languages. (So exists additional tables for attributes and values translations).
Why we choose such solution? Because for every client there should be database with the same scheme. So such database scheme is very elastic.
So what about this solution. As always, "it depends" -))
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
So answer on your question is not simple. If you prepare elastic scheme with unknown attributes and values, you should use different tables. I suggest to you remember about storing values in CSV strings. It is better to store it as XML (typed and indexed).
UPDATE
I think that PropertyValues will not change often , if comparing with user orders. But if you doubt, you should use denormalization tables or indexed views to speed up.Anyway, changing XML/CSV on large quantity of rows will have poor performance, so "separate table" solution looks good.

The SQL Customer Advisory Team (CAT) has a whitepaper written just for you: Best Practices for Semantic Data Modeling for Performance and Scalability. It goes through the common pitfalls of EAV modeling and recommends how to design a scalable EAV solution.

Database design rules to follow for a programmer

We are working on a mapping application that uses Google Maps API to display points on a map. All points are currently fetched from a MySQL database (holding some 5M + records). Currently all entities are stored in separate tables with attributes representing individual properties.
This presents following problems:
Every time there's a new property we have to make changes in the database, application code and the front-end. This is all fine but some properties have to be added for all entities so that's when it becomes a nightmare to go through 50+ different tables and add new properties.
There's no way to find all entities which share any given property e.g. no way to find all schools/colleges or universities that have a geography dept (without querying schools,uni's and colleges separately).
Removing a property is equally painful.
No standards for defining properties in individual tables. Same property can exist with different name or data type in another table.
No way to link or group points based on their properties (somehow related to point 2).
We are thinking to redesign the whole database but without DBA's help and lack of professional DB design experience we are really struggling.
Another problem we're facing with the new design is that there are lot of shared attributes/properties between entities.
For example:
An entity called "university" has 100+ attributes. Other entities (e.g. hospitals,banks,etc) share quite a few attributes with universities for example atm machines, parking, cafeteria etc etc.
We dont really want to have properties in separate table [and then linking them back to entities w/ foreign keys] as it will require us adding/removing manually. Also generalizing properties will results in groups containing 50+ attributes. Not all records (i.e. entities) require those properties.
So with keeping that in mind here's what we are thinking about the new design:
Have separate tables for each entity containing some basic info e.g. id,name,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity (or a table if you like) to attribute using a many-to-many relation.
Store addresses in different table called addresses link entities via foreign keys.
We think this will allow us to be more flexible when adding, removing or querying on attributes.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given university we might have a query with 20+ joins to fetch all related attributes in a single row.
We desperately need to know some opinions or possible flaws in this design approach.
Thanks for your time.

In trying to generalize your question without more specific examples, it's hard to truly critique your approach. If you'd like some more in depth analysis, try whipping up an ER diagram.
If your data model is changing so much that you're constantly adding/removing properties and many of these properties overlap, you might be better off using EAV.
Otherwise, if you want to maintain a relational approach but are finding a lot of overlap with properties, you can analyze the entities and look for abstractions that link to them.
Ex) My Db has Puppies, Kittens, and Walruses all with a hasFur and furColor attribute. Remove those attributes from the 3 tables and create a FurryAnimal table that links to each of those 3.
Of course, the simplest answer is to not touch the data model. Instead, create Views on the underlying tables that you can use to address (5), (4) and (2)

1 cannot be an issue. There is one place where your objects are defined. Everything else is generated/derived from that. Just refactor your code until this is the case.
2 is solved by having a metamodel, where you describe which properties are where. This is probably needed for 1 too.
You might want to totally avoid the problem by programming this in Smalltalk with Seaside on a Gemstone object oriented database. Then you can just have objects with collections and don't need so many joins.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight