SQL Server Dynamic Columns Problem - sql-server

I use a table GadgetData to store the properties of gadgets in my application. There gadgets are basically sort of custom control which have 80% of the properties common like height, width, color, type etc. There are are some set of properties per gadget type that the unique to them. All of this data has to store in database. Currently I am storing only common properties. What design approach should I use to store this kind data where the columns are dynamic.
Create table with common properties as Columns and add extra column of type Text to store the all unique properties of each gadget type in XML format.
Create a table with all possible columns in all of the gadget types.
Create separate table for each type of gadgets.
Any other better way you recommend?
(Note: The number of gadget types could grow even beyond 100 and )

Option 3 is a very normalized option, but will come back and bite you if you have to query across multiple types - every SELECT will have another join if a new type is added. A maintenance nightmare.
Option 2 (sparse table) will have a lot of NULL values and take up extra space. The table definition will also need updating if another type is added in the future. Not so bad but still painful.
I use Option 1 in production (using an xml type instead of text). It allows me to serialize any type derived from my common type, extracting the common properties and leaving the unique ones in the XmlProperties column. This can be done in the application or in the database (e.g. a stored procedure).

Your options:
Good one. Could even force schema etc
You cannot make those column NOT NULL, so you will loose some data integrity there
As long as you do not allow search for more then one type of gadget, it is good enough, but option 1 is better
I would use ORM
Notes:
If you would like to keep your database 'relational', but are not afraid to use ORM tools, then I would use one. I which case you can store the data (almost) as you want, but have it properly handled as long as you map them correctly.
See:
Single Table Inheritance
Concrete Table Inheritance
If you need SQL-only solution, then depending on your RDBMS, I would probably use XML column to store all the data that is specific to the gadget type: you can have validation, extend easily with new attributes. Then you can have all in one table, search quickly on all common attributes, and also pretty easily search for one gadget' type attributes as well

If all types of gadgets have many common mandatory properties that can be stored in one table and just several optional properties, you'd better use first approach: thus you'll use best of relational schema and ease your life with XML. And don't forget to use XML Schema collection linking XML column to it: you'll have full indexing and XQuery capabilities.
If gadget types has very different descriptions and only 1-3 common columns among 5 or more different set of properties, use 3rd approach.
But concerning the situation of 100+ types of gadgets I'd use 1st approach: it has flexibility supported with good performance and ease of support and further development.

Depending on how different the "Gadgets" are I wouldn't like option 2 there would be a lot of nulls floating around, which could get bad if you had a column which was mandatory for one gadget but not even used for another.
I would only go option 3 if the number of gadgets changes infrequently since it would require altering the database each time.
The unmentioned option is to store the Gadgets with a child table which holds the gadgets unique values. But this would require a fair amount of work to return gadgets details, or multiple Database calls.
Leaving option 1, except I would use SQL servers XML type instead of text, you can then use XQuery within your stored procedures.

Related

Ideal data type / structure / model for storing device data with different parameters / attributes in snowflake

We are in the process of designing a dimensional data model in snow flake to store data from different devices (from solar and wind plants) for reporting / analytical purposes. The data is currently residing in influxDB as time series data, one of the challenges in designing the target DB model is - different devices emit data for different parameters (even though the devices have a super set of parameters, it can vary and chances are there that new parameters can be added to the super set).
One of the key ask is to not have any development efforts (coding) when new parameters / devices are added, hence the model and design needs to have the flexibility to store the data accordingly, with certain configurations, Following are the options,
Create wide fact tables with all the superset parameters and store nulls for devices which do not send the data.
Pros: Lesser data volume compared to #2.
Cons: a) Will have some effort when new parameters are added.
b) Depending on the reporting tool (which will be mostly custom built and not a BI tool) the selection of data from different parameters might not be straight forward like using a where clause based on the needed parameters.
Create narrow fact tables, the parameter will become a dimension table along with other dimensions and will have a reference to it by ID column, and the value will be present in one column.
Pros: a) No efforts / schema changes when new parameters are added.
b) Ease of selecting and filtering data based on the selected parameters.
Cons: a) Data Volume - There are 1000's of devices and multiple parameters under them, so approximately per day it will go to 90M records (~1GB - the base data itself is huge and the unpivot would increase the record count dramatically).
b) Performance considerations due to increased data volume especially while querying data.
Use the support provided by snowflake for semi structured data. OBJECT datatype seems to be a good fit, the parameter name and value can be stored as a key value pair.
Pros: a) No efforts / schema changes when new parameters are added.
b) Data volume not increased.
c) Ease of selecting and filtering data based on functions provided by SQL - Is this true, based on the documentation, the querying looks straight forward especially for OBJECT datatype. However need confirmation.
Cons: a) Performance considerations due to the usage of semi structured data types - From the documentation , it mentions that the VARIANT data type stores the data in columnar format wherever possible (data remain in json where it is not able to convert) , but there is no mention about the OBJECT data type and how it is handled with this data type. So want to understand whether this will have a considerable performance impact or not.
So, considering the above, what would be the ideal way to store this kind of data where the structure changes dynamically based on different devices.
Option 3 is my favorite for laziness, cost, and performance reasons:
Snowflake uses the same storage ideas for OBJECTs and VARIANTs, it will be optimized for columnar - as long as your underlying object/variant is well suited for it. This means good performance and compression.
Object/variant will need the less maintenance when adding new parameters.
But option 1 has some advantages too:
It's a good idea for governability to know all your columns and their purposes.
3rd party tools understand navigating columns much better than figuring out objects.
Then you could have a great mix of 3+1:
Store everything as object/variant.
Create a view that parses the object and names columns for 3rd party tools.
When new fields are added, you will just need to update the definition of the view.

Database performance: Using one entity/table with the max. possible properties or split to different entities/tables?

im need to design some database tables but im not sure about the performance impact. In my case its more about the read performance than for saving the data.
The situation
With the help of pattern recognition im finding out how many values of a certain object needs to be saved in my postgresql database.
Amount other lets say fixed properties the only difference is if 1, 2 or 3 values of the same type needs to be saved.
Currently im having 3 entities/tables which differ only in having having 1, 2 or 3 not nullable properties of the same type.
For example:
EntityTestOne/TableOne {
... other (same) properties
String optionOne;
}
EntityTestTwo/TableTwo {
... other (same) properties
String optionOne;
String optionTwo;
}
EntityTestThree/TableThree {
... other (same) properties
String optionOne;
String optionTwo;
String optionThree;
}
I expect to have several million records in production and im thinking what could be the performance impact of this variant and what could be alternatives.
Alternatives
Other options which come into my mind:
Use only one entity class or table with 3 options (optionTwo and optionThree will be nullable then). If to talk of millions of expected records
plus caching im asking myself isn't it a kind of 'waste' to save millions of null values in at least two (caching) layers (database itself and hibernate). In a another answer i read yesterday saving a null value in postgresql need only 1 bit what i think isnt that much if we talk about several millions of records which can contain some nullable properties (link).
Create another entity/table and use a collection (list or set) relationship instead
For example:
EntityOption {
String value;
}
EntityTest {
... other (same) properties
List<EntityOption> options;
}
If to use this relationship: What would give a better performance in case of creating new records:
Creating for every new EntityTest new EntityOption's or doing a
lookup before and reference a existing EntityOption if exists? What about the read performance while fetching them later and the joins which will be needed then?
Compared to the variant with one plain Entity with three options i can imagine it could be slightly slower...
As im not that strong in database design and working with hibernate im interested of the pros and cons of these approaches and if there are even more alternatives.
I even would like to ask the question if postgresql is the right choice for this or if should think about using another (free) database.
Thanks!
The case is pretty clear in my opinion: If you have an upper limit of three properties per object, use a single table with nullable attributes.
A NULL value does not take up any space in the database. For every row, PostgreSQL stores a bitmap that contains which attributes are NULL. This bitmap is always stored, except when all attributes are not nullable. See the documentation for details.
So don't worry about storage space in this case.
Using three different tables or storing the attributes in a separate table will probably lead to UNIONs or JOINs in your queries, which will make the queries more complicated and slow.
There are many inheritance strategy for creating entity class, I think you should go with single table strategy, where there will be a discriminator column (managed by hibernate itself), and all common filed will be used by each entity and some specific fields will be use by specific entity and remain null for other entity.
This will get improved read performance.
For your ref. :
http://www.thejavageek.com/2014/05/14/jpa-single-table-inheritance-example/

Database storage design of large amounts of heterogeneous data

Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.

Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

I am doing a project which need to store 30 distinct fields for a business logic which later will be used to generate report for each
The 30 distinct fields are not written at one time, the business logic has so many transactions, it's gonna be like:
Transaction 1, update field 1-4
Transaction 2, update field 3,5,9
Transaction 3, update field 8,12, 20-30
...
...
N.B each transaction(all belong to one business logic) would be updating arbitrary number of fields & not in any particular order.
I am wondering what's my database design would be best:
Have 30 columns in postgres database representing those 30 distinct
field.
Have 30 filed store in form of xml or json and store it in just one
column of postgres.
1 or 2 which one is better ?
If I choose 1>:
I know for programming perspective is easier Because in this way I don't need to read the overall xml/json and update only a few fields then write back to database, I can only update a few columns I need for each transaction.
If I choose 2>:
I can potentially generic reuse the table for something else since what's inside the blob column is only xml. But is it wrong to use the a table generic to store something totally irrelevant in business logic just because it has a blob column storing xml? This does have the potential to save the effort of creating a few new table. But is this kind of generic idea of reuse a table is wrong in a RDBMS ?
Also by choosing 2> it seem I would be able to handle potential change like change certain field /add more field ? At least it seems I don't need to change database table. But I still need to change c++ & c# code to handle the change internally , not sure if this is any advantage.
I am not experiences enough in database design, so I cannot make the decision which one to choose. Any input is appreciated.
N.B there is a good chance I probabaly don't need to do index or search on those 30 columsn for now, a primary key will be created on a extra column is I choose 2>. But I am not sure if later I will be required to do search based on any of those columns/field.
Basically all my fields are predefined from requirement documents, they generally like simple field:
field1: value(max len 10)
field2: value(max len 20)
...
field20: value((max len 2)
No nest fields. Is it worth to create 20 columns for each of those fields(some are string like date/time, some are string, some are integer etc).
2>
Is putting different business logic in a shared table a bad design idea? If it only being put in a shared table because they share the same structure? E.g. They all have Date time column , a primary key & a xml column with different business logic inside ? This way we safe some effort of creating new tables... Is this saving effort worth doing ?
Always store your XML/JSON fields as separate fields in a relational database. Doing so you will keep your database normalized, allowing the database to do its thing with queries/indices etc. And you will save other developers the headache of deciphering your XML/JSON field.
It will be more work up front to extract the fields from the XML/JSON and perhaps to maintain it if fields need to be added, but once you create a class or classes to do so that hurdle will be eliminated and it will more than make up for the cryptic blob field.
In general it's wise to split the JSON or XML document out and store it as individual columns. This gives you the ability to set up constraints on the columns for validation and checking, to index columns, to use appropriate data types for each field, and generally use the power of the database.
Mapping it to/from objects isn't generally too hard, as there are numerous tools for this. For example, Java offers JAXB and JPA.
The main time when splitting it out isn't such a great idea is when you don't know in advance what the fields of the JSON or XML document will be or how many of them there will be. In this case you really only have two choices - to use an EAV-like data model, or store the document directly as a database field.
In this case (and this case only) I would consider storing the document in the database directly. PostgreSQL's SQL/XML support means you can still create expression indexes on xpath expressions, and you can use triggers for some validation.
This isn't a good option, it's just that EAV is usually an even worse option.
If the document is "flat" - ie a single level of keys and values, with no nesting - the consider storing it as hstore instead, as the hstore data type is a lot more powerful.
(1) is more standard, for good reasons. Enables the database to do heavy lifting on things like search and indexing for one thing.

Stringly typed values table in sql, is there a better way to do this? (we're using MSSQL)

We have have a table layout with property names in one table, and values in a second table, and items in a third. (Yes, we're re-implementing tables in SQL.)
We join all three to get a value of a property for a specific item.
Unfortunately the values can have multiple data types double, varchar, bit, etc. Currently the consensus is to stringly type all the values and store the type name in the column next to the value.
tblValues
DataTypeName nvarchar
Is there a better, cleaner way to do this?
Clarifications:
Our requirements state that we must add new "attributes" at run time without modifying the db schema
I would prefer not to use EAV, but that is the direction we are headed right now.
This system currently exists in SQL server using a traditional db design, but I can't see a way to fulfill our requirement of not modifying the db schema without moving to EAV.
There are really only two patterns for implementing an 'EAV model' (assuming that's what you want to do):
Implement it as you've described, where you explicitly store the property value type along with the value, and use that to convert the string values stored into the appropriate 'native' types in the application(s) that access the DB.
Add a separate column for each possible datatype you might store as a property value. You could also include a column that indicates the property value type, but it wouldn't be strictly necessary.
Solution 1 is a simpler design, but it incurs the overhead of converting the string values stored in the table into the appropriate data type as needed.
Solution 2 has the benefit of storing values as the appropriate native type, but it will necessarily require more, though not necessarily much more, space. This may be moot if there aren't a lot of rows in this table. You may want to add a check constraint that only allows one non-NULL value in the different value columns, or if you're including a type column (so as to avoid checking for non-NULL values in the different value columns), prevent mismatches between the value stored in the type column and which value column contains the non-NULL value.
As HLGEM states in her answer, this is less preferred than a standard relational design, but I'm more sympathetic to the use of EAV model table designs for data such as application option settings.
Well don't do that! You lose all the values of having datatypes if you do. You can't properly constrain them (and will, I guarantee it, get bad data eventually) and you have to cast them back to the proper type to use in mathematical or date calculations. All in all a performance loser.
Your whole design will not scale well. Read up on why you don't want to use EAV tables in a relational database. It is not only generally slower but unusually difficult to query especially for reporting.
Perhaps a noSQL database would better suit your needs or a proper relational design and NOT an EAV design. Is it really too hard to figure out what fields each table would really need or are your developers just lazy? Are you sacrificing performance for flexibility - a flexibility that most users will hate? Especially when it means bad performance? Have you ever used a database designed that way to try to do anything?

Resources