Best practices for analyzing/reporting database with 'flexible' schema - sql-server

I am given a task to create views (Excel, websites, etc. not database 'view') for a SQL Server table with 'flexible' schema like below:
Session(guid) | Key(int) | Value(string)
My first thought is to create a series of 'standard' relational data tables/views that speak to the analysis/reporting requests. They can be either new tables updated by a daemon service who transforms data on a schedule, or just a series of views with deeply nested queries. Then, use SSAS, SSRS and other established ways to do the analysis and reporting. But I'm totally uncertain if that's the right line of thinking.
So my questions are:
Is there a terminology for this kind of 'flexible' schema so that I can search for related information?
Do my thoughts make sense or they're totally off?
If my thoughts make sense, should I create views with deep queries or new tables + data transform service?

I would start with an SSAS cube to expose all the values , presuming you can get some descriptive info from the key. The cube might have one measure (count) and three dimensions for each of your attributes.
This cube would have little value for end users (too confusing), but I would use it to validate whether any particular data is actually usable before proceeding. I think this is important because usually this data structure masks weak data validation and integrity in the source system.
Once a subject has been validated I would build physical tables via SSIS in preference to views - I find them easier to test and tune.

Finally found the terminology - it's called entity-attribute-value pattern (EAV) and there are a lot of discussions and resources around it.

Related

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

Design database based on EAV or XML for objects with variable features in SQL Server?

I want to make a database that can store any king of objects and for each classes of objects different features.
Giving some of the questions i asked on different forums the solution is http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Xml with some kind of validation before storage.
Can you please give me an alternative to the ones above or some advantages or examples that would help decide which of the two methods is the best one in my case?
Thanks
UPDATE 1 :
Is your db read or write intensive?
will be both -> auction engine
Will you ever conceivably move off SQL Server and onto another platform?
I won't move it, I will use a WCF Service to expose functionality to mobile devices.
How do you plan to surface your data to the application?
Entity Framework for DAL and WCF Service Layer for Bussiness
Will people connect to your data through means other than those you control?
No
While #marc_s is correct in his cautions, there unarguably are situations where the relational model is just not flexible enough. For quite a number of years now, I've been working with a database that is straightforwardly relational for the largest part, but has a small EAV part. This is because users can invent new properties any time for observation purposes in trials.
Admittedly, it is awkward wrt querying and reporting, to name a few, but no other strategy would suffice here. We use stored procedures with T-Sql's pivot to offer flattened data structures for reporting and grids with dynamic columns for display. Once the infrastructure stands it's pretty comfortable altogether.
We never considered using XML data because it wasn't there yet and, apart from its common limitations, it has some drawbacks in our context:
The EAV data is queried heavily. A development team needs more than standard sql knowledge because of the special syntax. Indexing is possible but "there is a cost associated with maintaining the index during data modification" (as per MSDN).
The XML datatype is far less accessible than regular tables and fields when it comes to data processing and reporting.
Hardly ever do users fetch all attribute values of an entity, but the whole XML would have to be crunched anyway.
And, not unimportant: XML datatype is not (yet) supported by Entity Framework.
So, to conclude, I would go for a design that is relational as much as possible but EAV where necessary. Auction items could have a number of fixed fields and EAV's for the flexible data.
I will use my answer from another question:
EAV:
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
XML storage is more like NoSQL: you will abdicate database functionality and you wisely prepare your solution to:
Do not lose data integrity.
Do not rewrite all database functionality in application (it is senseless)
I think there is way too much context missing for anyone to add any kind of valid comment to the discussion.
Is your db read or write intensive?
Will you ever conceivably move off SQL Server and onto another platform?
How do you plan to surface your data to the application?
Will people connect to your data through means other than those you control?
First do not go either route unless the structure truly cannot be known in advance. Using EAV or XML because you don't want to actually define the requirements will result in an unmaintainable mess and a badly performing mess at that. Usually at least 90+% (a conservative estimate based on my own experience) of the fields can be known in advance and should be in ordinary relational tables. Only use special techiniques for structures that can't be known in advance. I can't stress this strongly enough. EAV tables look simple but are actually very hard to query especially for complex reporting queries. Sure it is easy to get data into them, but very very difficult to get the data back out.
If you truly need to go the EAV route, consider using a nosql database for that part of the application and a relational database for the rest. Nosql databases simply handle EAV better.

Can you provide some advice on setting up my database?

I'm working on a MUD (Multi User Dungeon) in Python and am just now getting around to the point where I need to add some rooms, enemies, items, etc. I could hardcode all this in, but it seems like this is more of a job for a database.
However, I've never really done any work with databases before so I was wondering if you have any advice on how to set this up?
What format should I store the data in?
I was thinking of storing a Dictionary object in the database for each entity. In htis way, I could then simply add new attributes to the database on the fly without altering the columns of the database. Does that sound reasonable?
Should I store all the information in the same database but in different tables or different entities (enemies and rooms) in different databases.
I know this will be a can of worms, but what are some suggestions for a good database? Is MySQL a good choice?
1) There's almost never any reason to have data for the same application in different databases. Not unless you're a Fortune500 size company (OK, i'm exaggregating).
2) Store the info in different tables.
As an example:
T1: Rooms
T2: Room common properties (aplicable to every room), with a row per **room*
T3: Room unique properties (applicable to minority of rooms, with a row per property per room - thos makes it easy to add custom properties without adding new columns
T4: Room-Room connections
Having T2 AND T3 is important as it allows you to combine efficiency and speed of row-per-room idea where it's applicable with flexibility/maintanability/space saving of attribute-per-entity-per-row (or Object/attribute/value as IIRC it's called in fancy terms) schema
Good discussion is here
3) Implementation wise, try to write something re-usable, e.g. have generic "Get_room" methods, which underneath access the DB -= ideally via transact SQL or ANSI SQL so you can survive changing of DB back-end fairly painlessly.
For initial work, you can use SQLite. Cheap, easy and SQL compatible (the best property of all). Install is pretty much nothing, DB management can be done by freeware tools or even FireFox plugin IIRC (all of FireFox 3 data stores - history, bookmarks, places, etc... - are all SQLite databases).
For later, either MySQL or Postgres (I don't do either one professionally so can't recommend one). IIRC at some point Sybase had free personal db server as well, but no idea if that's still the case.
This technique is called entity-attribute-value model. It's normally preferred to have DB schema that reflects the structure of the objects, and update the schema when your object structure changes. Such strict schema is easier to query and it's easier to make sure that the data is correct on the database level.
One database with multiple tables is the way to do.
If you want a database server, I've recommend PostgreSQL. MySQL has some advantages, like easy replication, but PostgreSQL is generally nicer to work with. If you want something smaller that works directly with the application, SQLite is a good embedded database.
Storing an entire object (serialized/encoded) as a value in the database is bad for querying - I am sure that some queries in your mud will NOT need to know 100% of attributes, or may retrieve a list of object by a value of attributes.
it seems like this is more of a job
for a database
True, although 'database' doesn't have to mean 'relational database'. Most existing MUDs store all data in memory, and read it in from flat-file saved in a plain-text data format. I'm not necessarily recommending this route, just pointing out that a traditional database is by no means necessary. If you do want to go the relational route, recent versions of Python come with sqlite which is a lightweight embedded relational database with good SQL support.
Using relational databases with your code can be awkward. Any change to a game logic class can require a parallel change to the database, and changes to the code that read and write to the database. For this reason good planning will help you a lot, but it's hard to plan a good database schema without experience. At least get your entity classes planned first, then build a database schema around it. Reading up on normalizing a database and understanding the principles there will help.
You may want to use an 'object-relational mapper' which can simplify a lot of this for you. Examples in Python include SQLObject, SQLAlchemy, and Autumn. These hide a lot of the complexities for you, but as a result can hide some of the important details too. I'd recommend using the database directly until you are more familiar with it, and consider using an ORM in the future.
I was thinking of storing a Dictionary
object in the database for each
entity. In htis way, I could then
simply add new attributes to the
database on the fly without altering
the columns of the database. Does that
sound reasonable?
Unfortunately not - if you do that, you waste 99% of the capabilities of the database and are effectively using it as a glorified data store. However, if you don't need aforementioned database capabilities, this is a valid route if you use the right tool for the job. The standard shelve module is well worth looking at for this purpose.
Should I store all the information in
the same database but in different
tables or different entities (enemies
and rooms) in different databases.
One database. One table in the database per entity type. That's the typical approach when using a relational database (eg. MySQL, SQL Server, SQLite, etc).
I know this will be a can of worms,
but what are some suggestions for a
good database? Is MySQL a good choice?
I would advise sticking with sqlite until you're more familiar with SQL. Otherwise, MySQL is a reasonable choice for a free game database, as is PostGreSQL.
One database. Each database table should refer to an actual data object.
For instance, create a table for all items, all creatures, all character classes, all treasures, etc.
Spend some time now and figure out how objects will relate to each other, as this will affect your database structure. For example, can a character have more than one character class? Can monsters have character classes? Can monsters carry items? Can rooms have more than one monster?
It seems pedantic, but you'll save yourself a whole lot of trouble early by figuring out what database objects "belong" to which other database objects.

How would you design your database to allow user-defined schema

If you have to create an application like - let's say a blog application, creating the database schema is relatively simple. You have to create some tables, tblPosts, tblAttachments, tblCommets, tblBlaBla… and that's it (ok, i know, that's a bit simplified but you understand what i mean).
What if you have an application where you want to allow users to define parts of the schema at runtime. Let's say you want to build an application where users can log any kind of data. One user wants to log his working hours (startTime, endTime, project Id, description), the next wants to collect cooking recipes, others maybe stock quotes, the weekly weight of their babies, monthly expenses they spent for food, the results of their favorite football teams or whatever stuff you can think about.
How would you design a database to hold all that very very different kind of data? Would you create a generic schema that can hold all kind of data, would you create new tables reflecting the user data schema or do you have another great idea to do that?
If it's important: I have to use SQL Server / Entity Framework
Let's try again.
If you want them to be able to create their own schema, then why not build the schema using, oh, I dunno, the CREATE TABLE statment. You have a full boat, full functional, powerful database that can do amazing things like define schemas and store data. Why not use it?
If you were just going to do some ad-hoc properties, then sure.
But if it's "carte blanche, they can do whatever they want", then let them.
Do they have to know SQL? Umm, no. That's your UIs task. Your job as a tool and application designer is to hide the implementation from the user. So present lists of fields, lines and arrows if you want relationships, etc. Whatever.
Folks have been making "end user", "simple" database tools for years.
"What if they want to add a column?" Then add a column, databases do that, most good ones at least. If not, create the new table, copy the old data, drop the old one.
"What if they want to delete a column?" See above. If yours can't remove columns, then remove it from the logical view of the user so it looks like it's deleted.
"What if they have eleventy zillion rows of data?" Then they have a eleventy zillion rows of data and operations take eleventy zillion times longer than if they had 1 row of data. If they have eleventy zillion rows of data, they probably shouldn't be using your system for this anyway.
The fascination of "Implementing databases on databases" eludes me.
"I have Oracle here, how can I offer less features and make is slower for the user??"
Gee, I wonder.
There's no way you can predict how complex their data requirements will be. Entity-Attribute-Value is one typical solution many programmers use, but it might be be sufficient, for instance if the user's data would conventionally be modeled with multiple tables.
I'd serialize the user's custom data as XML or YAML or JSON or similar semi-structured format, and save it in a text BLOB.
You can even create inverted indexes so you can look up specific values among the attributes in your BLOB. See http://bret.appspot.com/entry/how-friendfeed-uses-mysql (the technique works in any RDBMS, not just MySQL).
Also consider using a document store such as Solr or MongoDB. These technologies do not need to conform to relational database conventions. You can add new attributes to any document at runtime, without needing to redefine the schema. But it's a tradeoff -- having no schema means your app can't depend on documents/rows being similar throughout the collection.
I'm a critic of the Entity-Attribute-Value anti-pattern.
I've written about EAV problems in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Here's an SO answer where I list some problems with Entity-Attribute-Value: "Product table, many kinds of products, each product has many parameters."
Here's a blog I posted the other day with some more discussion of EAV problems: "EAV FAIL."
And be sure to read this blog "Bad CaRMa" about how attempting to make a fully flexible database nearly destroyed a company.
I would go for a Hybrid Entity-Attribute-Value model, so like Antony's reply, you have EAV tables, but you also have default columns (and class properties) which will always exist.
Here's a great article on what you're in for :)
As an additional comment, I knocked up a prototype for this approach using Linq2Sql in a few days, and it was a workable solution. Given that you've mentioned Entity Framework, I'd take a look at version 4 and their POCO support, since this would be a good way to inject a hybrid EAV model without polluting your EF schema.
On the surface, a schema-less or document-oriented database such as CouchDB or SimpleDB for the custom user data sounds ideal. But I guess that doesn't help much if you can't use anything but SQL and EF.
I'm not familiar with the Entity Framework, but I would lean towards the Entity-Attribute-Value (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model) database model.
So, rather than creating tables and columns on the fly, your app would create attributes (or collections of attributes) and then your end users would complete the values.
But, as I said, I don't know what the Entity Framework is supposed to do for you, and it may not let you take this approach.
Not as a critical comment, but it may help save some of your time to point out that this is one of those Don Quixote Holy Grail type issues. There's an eternal quest for probably over 50 years to make a user-friendly database design interface.
The only quasi-successful ones that have gained any significant traction that I can think of are 1. Excel (and its predecessors), 2. Filemaker (the original, not its current flavor), and 3. (possibly, but doubtfully) Access. Note that the first two are limited to basically one table.
I'd be surprised if our collective conventional wisdom is going to help you break the barrier. But it would be wonderful.
Rather than re-implement sqlservers "CREATE TABLE" statement, which was done many years ago by a team of programmers who were probably better than you or I, why not work on exposing SQLSERVER in a limited way to the users -- let them create thier own schema in a limited way and leverage the power of SQLServer to do it properly.
I would just give them a copy of SQL Server Management Studio, and say, "go nuts!" Why reinvent a wheel within a wheel?
Check out this post you can do it but it's a lot of hard work :) If performance is not a concern an xml solution could work too though that is also alot of work.

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.
Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...
It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.
The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.
Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.
Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.
What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Resources