Application of mutable database structures? - database

I'm wondering if there are any benefits one could gain by taking advantage of the possibility to alter the database's table structures on the fly and querying the DB engine for the current structure before making queries (or just using the fields that are sure to be in the table). Are there any examples of systems that use such approach?

Check out http://blog.mongodb.org/post/119945109/why-schemaless.
From my personal experience I've found the following benefits:
Easier to change
Some conceptual structures are just too much of a pain to use in a normalized set of tables. For example, at my job we have a custom internal CMS system. To represent one "article" we have around 35 tables. When we have to run queries to generate one full article it is very painful.
When that same article is represented as a document, all the information is still there, it's just considerably easier to change the object in code and then serialize and de-serialize instead of writing queries with 20 joins.
Versioning
It makes versioning so much easier. You might start with one schema for your system and later decide you need to add/remove fields. In a traditional RDBMS this can be painful to deploy. With a document database, you just insert new documents with a different schema and everything is fine. (Keep in mind document data stores give you tools to handle this!)
Faster Development
This flows out of things being easier to change. The gain is significant from what I've seen. No more spending time making tables and making sure the data types are perfect and everything is very normalized.

Related

Use Event Sourcing to get rid of relational database?

I've read some articles about CQRS and Event Sourcing recently. While the first seemed to me like a highly complex and risky workaround to fix poorly performing business and poorly designed data access layers and data models, the last seemed like a solution to many problems.
Problems to solve using Event Sourcing:
Get rid of Relational Database and Object Relational Mappers, like NHibernate and Entity Framework. Hardly anybody in the programming area wants to pay attention to such stuff like indices, table/index fragmentation or normalization, how to design relational data and how to code/configure the ORM (a science on its own).
Have Business Model and in-memory "database" united, an entity/aggregate service keeping all relevant items in memory, maintaining integrity by simply dumping the CUD events somewhere without much pain. Old items can be evicted from memory and dumped to a NoSQL (or whatever) store and used for aggregate calculations, reporting, search and, if necessary, be re-activated. If I understand right, in-memory databases like VoltDB use event dumping in a similar way, but are still relational databases, separated from the business logic.
This would also make concurrency easier: instead of locking (with possible complete system deadlocks) or optimistic locking with a general "success or fail" logic, depending on whether the data has changed meanwhile (or rather complex DB code), merge rules can be implemented in code.
History: no more pain with implementing auditing functions, cemetary tables or "deleted" marker columns, or possibly deleted data still being required.
Data Duplication/Search/Reporting: use full-text indices instead of chasing missing relational indices, create proper viewing areas, preparing the data for the user in a required format, instead of using ugly copy routines in relational databases, with triggers, followup stored procedures or even program code copying data to half a dozen different tables.
Versioning: it's a pain to get many modules running with a number of different relational database versions, each having different tables and columns and needs appropriate ORM mappings. Could be much easier in a single layer model, with the event dump accepting any object format (typically schema-less or loose-schema NoSQL documents, represented as JSON or XML). It might also be possible to upgrade old data through a "data schema change event" chain (instead of having to maintain migration scripts for relational DBs).
N-tier Business Model / Relational DB / ORM mess
An n-tier approach a decade or longer ago might have been a business layer and data access layer. In order to keep separation really strict, many relational features were omitted, to implement them in the business layer instead: relational integrity, normalization, with the DB being what I call a "trash dump": looking like a kid playing around with SQL Server Management Studio or Access. Extremly un-normalized, polymorphic references ("Foreign Key" columns referencing different source tables, identified by a "ReferenceSource" marker), abuse of same tables for different kinds of business objects and duplication of data to numerous other tables (and from there again elsewhere), because performance wasnt good and this was supposed to improve queries. ORM usage was without object references too, reduced to single object load and save operations. Loading an aggregate (a graph of entities/table rows) would iterate through the graph and send a query for every set of sub-entities.
When performance got worse and, possibly, orphaned references caused serious trouble, attempts to implement classic relational design might have been made, but it was impossible to adapt the grown system to a complete data redesign (nobody would pay for it), hardly anybody would know how to map object relations or even optimized loading in the ORM. Such attempts were limited to a few places in the design, possibly making the data model and access even harder to maintain.
CQRS on top of n-tier?
To get acceptable performance, separate SQL queries were possibly built for certain modules, bypassing the business model with it's single object iterative access. This whole structure was suddenly called a de-facto CQRS, because of separate query access (which -could- have been handled by a well-implemented, relational data model and ORM usage, as long as it wasnt supposed to be a "Big data" Google- or Stackoverflow-like workload), and the plenty of duplicated data in relational tables, made up for immediate application access.
Something better than the inappropriate table format?
OK, so I read into CQRS, and while I didn't like the use of "CQRS" as described before, the concept of an event storage instead of a relational DB looked very useful: It is unlikely to successfully enforce the introduction of the original, state-of-art, relational DB design and OR mapping, and even if, it would be extremly costly. In fact, ordinary, object-oriented programming is much more "normalized" than most DB tables, due to the need to press all into the table format or create tons of tables for object graphs/aggregates. And I agree: having to take care of search indices and defragmentation, schema management and data history tracking manually, is like stone age IT, like running Ford T models and steam locomotives besides modern day cars and electric high speed trains.
Any good Experiences?
How are the experiences about using event sourcing (not necessarily full CQRS)? Does it eliminate much of the pain with relational databases? I really look forward for a kind of in-memory database with all business logic integrated, and possibly fast enough to make separate query modules dispensable!
There's a lot going on in this question and so a specific, actionable answer is not possible, but if you're looking for one then it is...
It depends on your domain.
CQRS/ES/DDD is not appropriate for solving every single problem - it is not a silver bullet. If the domain suggests that CRUD/NTier will be good enough, then that's what you should use. All of the concerns you list in your question are infrastructural or system traits and say nothing about the very thing that should inform your choice of tool or practice; what are you trying to build?
Although CQRS, ES, and DDD are very often used together they are separate concepts that are very powerful on their own.
CQRS (Command Query Responsibility Segregation): This is a very useful pattern to design software in general. The idea is to keep things that change state (commands) from things that do not (queries). In many systems queries modify the state of the database, this makes it very difficult for developers to reason about what is going on.
Imagine doing a query to find out some information and realizing that the information changed because you queried it.
CQRS prohibits those kind of behaviors. Commands (which cannot return information) change state and Queries (which return information) cannot modify state. That way, you have certainty in which parts of the code are idempotent (and therefore can be called as much as you want with no side effect) and which parts change state.
DDD (Domain Driven Design): This is a Design methodology for the "Data Structure" of the code. It does not prescribe techniques for database access or many technical details. What it does is provide guidelines and concepts to structure data in an application in a way that makes it much more responsive to the actual user's needs. It also simplifies development (although it is more work than just slapping something together).
ES (Event Sourcing): Event sourcing is a data storage strategy which shifts data storage from state (the actual values of a piece of data at the current point in time) into transitions (the changes that have happened to a piece of data during its lifetime) which are called events.
There are several advantages of using ES.
First, it allows the business to store much more information regarding what happened before (a boon to Data Scientists). In traditional systems, a lot of information is lost to updates of the data, and unless those updates are explicitly logged, the information is gone forever. This does not happen in ES.
Second, storing all events makes debugging much more simple because now a developer can follow the processing of the data since its beginning. An update to a piece of data that happened a long time ago (and would have been rewritten by another update and lost) but corrupted processing can be identified and fixed. Furthermore, the effects of the fix can even span all calculations that happened between the wrong event and the last event. In a traditional system, this would be impossible as we are only storing the latest state only.
While it is theoretically possible to write an Event Sourced system without CQRS or DDD, it is remarkably more difficult to do so.

Design database based on EAV or XML for objects with variable features in SQL Server?

I want to make a database that can store any king of objects and for each classes of objects different features.
Giving some of the questions i asked on different forums the solution is http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Xml with some kind of validation before storage.
Can you please give me an alternative to the ones above or some advantages or examples that would help decide which of the two methods is the best one in my case?
Thanks
UPDATE 1 :
Is your db read or write intensive?
will be both -> auction engine
Will you ever conceivably move off SQL Server and onto another platform?
I won't move it, I will use a WCF Service to expose functionality to mobile devices.
How do you plan to surface your data to the application?
Entity Framework for DAL and WCF Service Layer for Bussiness
Will people connect to your data through means other than those you control?
No
While #marc_s is correct in his cautions, there unarguably are situations where the relational model is just not flexible enough. For quite a number of years now, I've been working with a database that is straightforwardly relational for the largest part, but has a small EAV part. This is because users can invent new properties any time for observation purposes in trials.
Admittedly, it is awkward wrt querying and reporting, to name a few, but no other strategy would suffice here. We use stored procedures with T-Sql's pivot to offer flattened data structures for reporting and grids with dynamic columns for display. Once the infrastructure stands it's pretty comfortable altogether.
We never considered using XML data because it wasn't there yet and, apart from its common limitations, it has some drawbacks in our context:
The EAV data is queried heavily. A development team needs more than standard sql knowledge because of the special syntax. Indexing is possible but "there is a cost associated with maintaining the index during data modification" (as per MSDN).
The XML datatype is far less accessible than regular tables and fields when it comes to data processing and reporting.
Hardly ever do users fetch all attribute values of an entity, but the whole XML would have to be crunched anyway.
And, not unimportant: XML datatype is not (yet) supported by Entity Framework.
So, to conclude, I would go for a design that is relational as much as possible but EAV where necessary. Auction items could have a number of fixed fields and EAV's for the flexible data.
I will use my answer from another question:
EAV:
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
XML storage is more like NoSQL: you will abdicate database functionality and you wisely prepare your solution to:
Do not lose data integrity.
Do not rewrite all database functionality in application (it is senseless)
I think there is way too much context missing for anyone to add any kind of valid comment to the discussion.
Is your db read or write intensive?
Will you ever conceivably move off SQL Server and onto another platform?
How do you plan to surface your data to the application?
Will people connect to your data through means other than those you control?
First do not go either route unless the structure truly cannot be known in advance. Using EAV or XML because you don't want to actually define the requirements will result in an unmaintainable mess and a badly performing mess at that. Usually at least 90+% (a conservative estimate based on my own experience) of the fields can be known in advance and should be in ordinary relational tables. Only use special techiniques for structures that can't be known in advance. I can't stress this strongly enough. EAV tables look simple but are actually very hard to query especially for complex reporting queries. Sure it is easy to get data into them, but very very difficult to get the data back out.
If you truly need to go the EAV route, consider using a nosql database for that part of the application and a relational database for the rest. Nosql databases simply handle EAV better.

is EAV - Hybrid a bad database design choice

We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.

Database alternatives?

I was wondering the trade-offs for using databases and what the other options were? Also, what problems are not well suited for databases?
I'm concerned with Relational Databases.
The concept of database is very broad. I will make some simplifications in what I present here.
For some tasks, the most common database is the relational database. It's a database based on the relational model. The relational model assumes that you describe your data in rows, belonging to tables where each table has a given and fixed number of columns. You submit data on a "per row" basis, meaning that you have to provide a row in a single shot containing the data relative to all columns of your table. Every submitted row normally gets an identifier which is unique at the table level, sometimes at the database level. You can create relationships between entities in the relational database, for example by saying that a given cell in your table must refer to another table's row, so to preserve the so called "referential integrity".
This model works fine, but it's not the only one out there. In some cases, data are better organized as a tree. The filesystem is a hierarchical database. starts at a root, and everything goes under this root, in a tree like structure. Another model is the key/value pair. Sleepycat BDB is basically a store of key/value entities.
LDAP is another database which has two advantages: stores rather generic data, it's distributed by design, and it's optimized for reading.
Graph databases and triplestores allow you to store a graph and perform isomorphism search. This is typically needed if you have a very generic dataset that can encompass a broad level of description of your entities, so broad that is basically unknown. This is in clear opposition to the relational model, where you create your tables with a very precise set of columns, and you know what each column is going to contain.
Some relational column-based databases exist as well. Instead of submitting data by row, you submit them by whole column.
So, to answer your question: a database is a method to store data. Technically, even a text file is a database, although not a particularly nice one. The choice of the model behind your database is mostly relative to what is the typical needs of your application.
Setting the answer as CW as I am probably saying something strictly not correct. Feel free to edit.
This is a rather broad question, but databases are well suited for managing relational data. Alternatives would almost always imply to design your own data storage and retrieval engine, which for most standard/small applications is not worth the effort.
A typical scenario that is not well suited for a database is the storage of large amounts of data which are organized as a relatively small amount of logical files, in this case a simple filesystem-like system can be enough.
Don't forget to take a look at NOSQL databases. It's pretty new technology and well suited for stuff that doesn't fit/scale in a relational database.
Use a database if you have data to store and query.
Technically, most things are suited for databases. Computers are made to process data and databases are made to store them.
The only thing to consider is cost. Cost of deployment, cost of maintenance, time investment, but it will usually be worth it.
If you only need to store very simple data, flat files would be an alternative (text files).
Note: you used the generic term 'database', but there are many many different types and implementations of these.
For search applications, full-text search engines (some of which are integrated to traditional DBMSes, but some of which are not), can be a good alternative, allowing both more features (various linguistic awareness, ability to have semi-structured data, ranking...) as well as better performance.
Also, I've seen applications where configuration data is stored in the database, and while this makes sense in some cases, using plain text files (or YAML, XML and such) and loading the underlying objects during initialization, may be preferable, due to the self-contained nature of such alternative, and to the ease of modifying and replicating such files.
A flat log file, can be a good alternative to logging to DBMS, depending on usage of course.
This said, in the last 10 years or so, the DBMS Systems, in general, have added many features, to help them handle different forms of data and different search capabilities (ex: FullText search a fore mentioned, XML, Smart storage/handling of BLOBs, powerful user-defined functions, etc.) which render them more versatile, and hence a fairly ubiquitous service. Their strength remain mainly with relational data however.

Defining the database schema in the application or in the database?

I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).

Resources