How to design a database for unkown amount of 'meta'-data - database

I want to store certain items in the database with variable amount of properties.
For example:
An item can have 'url' and 'pdf' property both others do not en instead have 'image' and 'location' properties.
So the problem is an some items can have some properties and others a lot.
How would you design this database. How to make it searchable and performant?
What would the schema look like?
Thanks!

What you are after has a name - Entity Attribute Value (EAV). It is "a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest."

If you are not necessarily tied to SQL, a triple store is designed for precisely this task. Most are designed to be queried with the SPARQL query language.

That sounds like a perfect job for a document database.

Start with your object (item) and create a table for items. Your item can have 1 or many attributes or none at all right? So set up a table of attributes with unique ids. Now set up a table that holds many items (some can duplicate) and many attributes (can duplicate as well)
Item
ItemID
ItemDescription
...
Attributes
AttributeID
AttributeDescription
...
ItemAttributes
rowID
ItemID
AttributeID
Now when you want to query you can simply join the tables and filter however you desire...

The Entity Attribute Value (EAV) model is very flexible. The semantic web and its query language sparql are based on EAV too. But some people don't like it because there is a performance penalty with this model.
Start with doing some high load performance tests on your database. Don't do them when you are done coding, because then it is too late.
edit: Focus on the speed of you select statements. Users expect quick results when they search.

I have designed tables like this in the past to have the following fields:
id
type
subtype
value
And then I would have another table that would define the type and subtypes used, and possibly give the datatype for that type and subtype combination so that you could programatically enforce it.
Its not pretty, and you don't want to do it unless you have to. But its the best way I have found when you do.
update: even if you leave subtype blank, I find its a good thing to have, because its too often that you want to subcategorize something that already exists. Example you create type: address, now you need mailing address and billing address and physical address.

For this kind of scenario's I use the XML-type column in MS SQL 2005...
you'll have all the advantages of XML + SQL. That is use an XPath expression as part of an SQL-statement.
It's a feature of MS SQL 2005, I am not sure which other RDBMS support this.
I am not sure what the implications are performance wise.

Create a properties table with the following fields:
item_id int(or whatever the ID type is in the item table)
property_name varchar(500)
property_value varchar(500)
Set a foreign key between item_id and the item's id field, and you're done.
That's how you do a many-to-one relationship in SQL.

Looks like an "items" table with primary key "item_id", a "properties" table with primary key "property_id" and a foreign key "item_id" with the "items" table. "properties" will have columns "name" and "value", both of type varchar.
Performant? Don't know.

Related

Database theory: best way to have a "flags" table which could apply to many entities?

I'm building a data model for a new project I'm developing. Part of this data model involves entities having "flags".
A flag is simply a boolean value - either an entity has the flag or it does not have the flag. To that end I have a table simply called "flags" that has an ID, a string name, and a description. (An example of a flag might be "is active" or "should be displayed" or "belongs to group".)
So for example, any user in my users table could have none, one, or many flags. So I create a userFlags bridge table with user ID and flag ID. If the table contains a row for the given flag ID and user ID, that user has that flag.
Ok, so now I add another entity - say "section". Each section can also have flags. So I create a sectionFlags table to accommodate this.
Now I have another entity - "content", so again, "contentFlags".
And so on.
My final data model has basically two tables per entity, one to hold the entity and one for flags.
While this certainly works, it seems like there may be a better way to design my model, so I don't have to have so many bridge tables. One idea I had was a master "hasFlags" table with flag ID, item ID and item type. The item type could be an enumerated field only accepting values corresponding to known entities. The only problem there is that my foreign key for the entity will not work because each "item ID" could refer to a different entity. (I have actually used this technique in other data models, and while it certainly works, you lose referential integrity as well as things like cascade updates.)
Or, perhaps my data model is fine as-is and that's just the nature of the beast.
Any more-advanced experienced DB devs care to chime in?
The many-to-many relationships are one way to do it (and possibly faster than what I'm about to suggest because they can use integer key indexes).
The other way to do this is with a polymorphic relationship.
Your entity-to-flag table needs 2 columns as well as the foreign key link to the flag table;
other_key integer not null
other_type varchar(...) not null
And in those fields you store the foreign key of the relation in the integer and the type of the relation in the varchar. Full-on ORMs that support this sometimes store the class name of the foreign relation in the type column, to aid with object loading.
The downside here is that the integer can't be a true foreign key as it will contain duplicates from many tables. It also makes your querying a bit more interesting per-join than the many-to-many tables, but it does allow you to generalise your join in code.

NoSQL DBMS to query and intersect data by sparse properties

I am in a research phase for a project, where the subject is to identify/select objects (e.g. email address or phone number) by querying for any number of sparsely populated properties associated with each of the objects.
First, I was thinking of Cassandra, with something like:
CREATE TABLE data (
property text,
property_value text,
email_id int,
PRIMARY KEY (property, property_value)
) WITH COMPACT STORAGE;
Where it is then easy to retrieve email_id for given property value.
But the need is to query the data by multiple properties and values. I know it is possible to do it client-side by intersecting, but with possibly millions of rows to intersect, it does not seem very efficient to me.
What is the right approach and technology to execute this kind of queries?
Even if C* has good support for sparse data tables (you can add columns dynamically), it seems to me that your query model doesn't fit well. This could be a good fit for relational databases instead.

What is a good approach in creating a non-NoSQL, relational multi-schema database?

Consider a situation where the schema of a database table may change, that is, the fields, number of fields, and types of those fields may vary based on, say a client ID.
Take, for example a Users table. Typically we would represent this in a horizontal table with the following fields:
FirstName
LastName
Age
However, as I mentioned, each client may have different requirements.
I was thinking that to represent a multi-schema approach to Users in a relational database like SQL Server, this would be done with two tables:
UsersFieldNames - {FieldNameId, ClientId, FieldName, FieldType}
UsersValues - {UserValueId, FieldNameId, FieldValue}
To retrieve the data (using EntityFramework DB First), I'm thinking pivot table, and the use of something like LINQ Extentions - Pivot Extensions may be useful.
I would like to know of any other approaches that would satisfy this requirement.
I'm asking this question for my own curiosity, as I recall similar conversations coming up in the past, and in relation to this question posed.
Thanks.
While I think a NoSQL data base would work best for this, I once tried something like this.
Have a table named something like METATABLES, like this
METATABLE = {table_name, field name}
and another,
ACTUAL_DATA ={table_name, field_name, actual_data_id, float_value, string_value, double_value, varchar_value}
in actual_data, the fields table_name and field_name would be foreign keys, pointing to METATABLES. In METATABLES you define the specific fields each client requires. the ACTUAL_DATA table holds the actual values of those fields, stored in the appropiate value field, depending on the data type (if the field value is a string, it would be stored in the string_Value field).
This approach is probably not the most efficient, though. Hope it helps.
I think it would be a mistake to have the schema vary. It is typically something you want to be standard.
In this case you may have users that have different attributes. In the user table you store attributes that are common across all users:
USER {id(primary key), username, first, last, DOB, etc...}
Note: Age is something that should not be stored, it should be calculated.
Then you could have a USER_ATTRIBUTE table:
{userId,key,value}
So users can have multiple attributes that are unrelated to one another without the schema changing.
Changing the schema often breaks the application.

Best approach to store data which attributes can vary

Please, read first my previous question: T-SQL finding of exactly same values in referenced table
The main purpose of this question is to find out if this approach of storing of data is effective.
Maybe it would be better to get rid of PropertyValues table. And use additional PropertyValues nvarchar(max) column in Entities table instead of it. For example instead of
EntityId PropertyId PropertyValue
1 4 Val4
1 5 Val5
1 6 Val6
table, I could store such data in PropertyValues column: "4:Val4;5:Val5;6Val6"
As an alternative, I could store XML in PropertyValues column....
What do you think about the best approach here?
[ADDED]
Please, keep in mind:
Set of properties must be customizable
Objects will have dozens of properties (approximately from 20 to 120). Database will contain thousands of objects
[ADDED]
Data in PropertyValues table will be changed very often. Actually, I store configured products. For example, admin configures that clothes have attributes "type", "size", "color", "buttons type", "label type", "label location" etc... User will select values for these attributes from the system. So, PropertyValues data cannot be effectively cached.
You will hate yourself later if you implement a solution using multi-value attributes (i.e. 4:Val4;5:Val5;6Val6).
XML is marginally better because there are XQuery functions to help you pull out and parse the values. But the XML type is implemented as a CLR type in SQL Server and it can get extremely slow to work with.
The best solution to this problem is one like you have. Use the sql_variant type for the column if it could be any number of data types. Ideally you'd refactor this into multiple tables / entities so that the data type can be something more concrete.
I work with the similar project (web-shop generator). So every product has attribute and every attribute has set of values. It is different tables. And for all of this there are translations in several languages. (So exists additional tables for attributes and values translations).
Why we choose such solution? Because for every client there should be database with the same scheme. So such database scheme is very elastic.
So what about this solution. As always, "it depends" -))
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
So answer on your question is not simple. If you prepare elastic scheme with unknown attributes and values, you should use different tables. I suggest to you remember about storing values in CSV strings. It is better to store it as XML (typed and indexed).
UPDATE
I think that PropertyValues will not change often , if comparing with user orders. But if you doubt, you should use denormalization tables or indexed views to speed up.Anyway, changing XML/CSV on large quantity of rows will have poor performance, so "separate table" solution looks good.
The SQL Customer Advisory Team (CAT) has a whitepaper written just for you: Best Practices for Semantic Data Modeling for Performance and Scalability. It goes through the common pitfalls of EAV modeling and recommends how to design a scalable EAV solution.

Single column/primary key only table for referential integrity?

Maybe i'm going about this wrong but my working on a database design for one of my projects.
I have an entity with a classification column which groups up entities into convenient categories for the user. These classifications are predefined and unchangeable by the user (at least thats the current design).
I'm trying to decide if I should have a 'EntityClassification' table which contains simply an 'Id' column as the primary key with no other information in order to have an enforced relationship between the Entity:Classification -> EntityClassification:Id.
I don't plan to have a name/description column in EntityClassification since my current thought is that I'll need to support localization of these pre-defined names which will be done with static string table like resource files downloaded to the client based on their country/language. There really isn't any other data which is associated with this EntityClassfication that I would want and a table seems like it might be an overkill?
Is this common/recommend practice for this type of problem? We're using SQL Server 2008 and don't have an enum datatype for the database which would seem to be really what i'm trying to achieve.
You should have the table with name and description not only for end user display, but internal documentation so when the users say 'my query based on this classification doesn't work!' someone hired in the future will know which ID they're talking about.
Do you just want to ensure that the values in Entity:Classification are restricted to your pre-determined list? If so a check constraint might be what you need.
Such constraints aren't as flexible as foreign keys: to alter the checked values we have to drop and recreate the constraint, but then you say there are no plans to change the values so that shouldn't matter.

Resources