I love the flexible schema capabilities of CouchDB and MongoDB, but I also love the relational 'join' capability of SQL Server. What I really want is the ability to have tables such as PERSON, COMPANY and ORDER that are basically 'open-schema' where each table has an ID but the rest of the columns are defined json-style {ID:12,firstname:"Pete",surname:"smith",height:"180"}, but where I can efficiently join PERSON to COMPANY either directly or via a many-to-many xref table. Does anyone know if SQL Server has any plans to incorporate 'open schema' in SQL, or whether Mongo or Couch have plans to support efficient joining? Thanks very much.
CouchDB offers a number of ways to establish relationships between your various documents/entities. Check out this article on the wiki to get started.
The tendency, when coming from a relational background, is to continue using the same terminology and mindset whenever you try to solve problems. It's very important to understand that NoSQL solutions are very different, otherwise they have no real purpose for existing. You should really seek to understand how these various NoSQL solutions work so you can compare them with your application's requirements to see if it's an appropriate fit.
MongoDB = NoSQL = No Joins - never ever.
If you need JOINs due to your data model or project requirements: stay with a RDBMS.
Alternatives in MongoDB:
denormalization
using embedded documents
multiple queries
As much as this would be inefficient to Query on a large scale, from a technical standpoint, using the XML datatype would allow you to store whatever structure you wanted that can vary by row.
Not that I'm aware of, but it's not that hard to role your own EAV, it's only 3 tables after all :)
Entity stores the associated table name.
Attribute stores the column name, data type and whether it's nullable.
Value contains one nullable column for each required data type.
Entity 1..* Attribute 1..* Values
Assuming you're using .NET, define your EAV interfaces, create some POCO's and let Entity Framework or your ORM of choice wire up the associations for you. LINQ works great for this sort of operation.
It also allows you to create a hybrid model, where parts of the schema are known but you still want flexibility for custom data. If you design your domain model with this in mind (i.e. use the EAV interfaces in your model) the EAV can be baked in to the EF data context (or whatever) to automate the loading of attributes and their values for each entity. Your EF entity just needs to know which table entity it belongs to.
Of course it's not the perfect solution, as you're (potentially) trading performance for flexibility. Depending on the amount of data you want to persist and the performance requirements, it may be more suited to models where most of the schema is known and a smaller percentage is unknown. YMMV.
Related
I am building a software platform for mobile electronic data collection. It should support any type of data. For example, the government might use it for a population survey; a manufacturing company might use it to evaluate plant condition at their factories; a research organizations might use it for clinical trials, e.t.c
As such, the software is powered by a database, with standard relational design for the metadata and entity attribute value for the actual data. Client software then reads the metadata and renders the appropriate user interface, complete with rules, validations, skip logic and so on. I believe the choice of EAV is a good one owing to the diversity of data that might be collected, but ...
Once the data is submitted from the mobile clients to the customer's server, the EAV model is no longer useful because the customer expects just his set of (usually very few) tables, for visualization and processing.
I have considered two options for pivoting the data.
1) Pivot the data immediately it is submitted to the server (via a JSON web service) and save it straightaway into a relational model.
2) Save the data in a similar schema on the server but have a background process that pivots it periodically and saves it in a relational model.
The first alternative seems more efficient as pivoting one record at a time is obviously quicker and less CPU intensive. The disadvantage is that if the metadata is changed, this process needs to adapt immediately by changing the relational model for the data accordingly. Depending on the extent of the changes, this can take some time. Worse, if it fails for any reason, upload requests might start being declined. If using the second approach, such failure would not "break" anything as urgent.
Are there other potential pitfalls I might be missing or design considerations I should make? What are some good reasons to do it one way or the other? Are there other alternatives I should be exploring to solve this problem?
Just define a straightforward relational schema of tables for their data using DDL. EAV is just an encoding of a proper schema & its metadata. Which, of course, the DBMS can't understand so you lose practically all the benefits of a DBMS. The only possible reason to use EAV is when tables are not known at compile time and DDL isn't fast enough or able to hold enough tables.
The EAV requests are just textual rearrangements of the DDL requests. (EAV configuration is typically a table for multiple entity-attribute-value requests given a table and key column(s) of the entities having virtual tables.) Moreover one only has to write a single interface easily implemented to map EAV configuration-then-updates to whichever of the two implementations one chooses. (It is better to use a pure relational interface and hide the chosen implementation but the nature of interfaces to SQL DBMSes, namely SQL, makes that difficult. Ie it would be easy if one is using a relational API rather than SQL.)
The EAV configuration without such an interface is only simpler if you don't declare the appropriate constraints or transactions on the virtual per-entity tables. Also every EAV version update or query must reconstruct the virtual tables then embed those expressions in the DDL version's update or query. (Only in the case of simply inserting or deleting or retrieving a single triple is the EAV DML as simple.)
Only if you showed that creating & deleting new tables was infeasible and the corresponding horrible integrity-&-concurrency-challenged mega-joining table-and-metadata-encoded-in-table EAV information-equivalent design was feasible should you even think of using EAV.
I want to make a database that can store any king of objects and for each classes of objects different features.
Giving some of the questions i asked on different forums the solution is http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Xml with some kind of validation before storage.
Can you please give me an alternative to the ones above or some advantages or examples that would help decide which of the two methods is the best one in my case?
Thanks
UPDATE 1 :
Is your db read or write intensive?
will be both -> auction engine
Will you ever conceivably move off SQL Server and onto another platform?
I won't move it, I will use a WCF Service to expose functionality to mobile devices.
How do you plan to surface your data to the application?
Entity Framework for DAL and WCF Service Layer for Bussiness
Will people connect to your data through means other than those you control?
No
While #marc_s is correct in his cautions, there unarguably are situations where the relational model is just not flexible enough. For quite a number of years now, I've been working with a database that is straightforwardly relational for the largest part, but has a small EAV part. This is because users can invent new properties any time for observation purposes in trials.
Admittedly, it is awkward wrt querying and reporting, to name a few, but no other strategy would suffice here. We use stored procedures with T-Sql's pivot to offer flattened data structures for reporting and grids with dynamic columns for display. Once the infrastructure stands it's pretty comfortable altogether.
We never considered using XML data because it wasn't there yet and, apart from its common limitations, it has some drawbacks in our context:
The EAV data is queried heavily. A development team needs more than standard sql knowledge because of the special syntax. Indexing is possible but "there is a cost associated with maintaining the index during data modification" (as per MSDN).
The XML datatype is far less accessible than regular tables and fields when it comes to data processing and reporting.
Hardly ever do users fetch all attribute values of an entity, but the whole XML would have to be crunched anyway.
And, not unimportant: XML datatype is not (yet) supported by Entity Framework.
So, to conclude, I would go for a design that is relational as much as possible but EAV where necessary. Auction items could have a number of fixed fields and EAV's for the flexible data.
I will use my answer from another question:
EAV:
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
XML storage is more like NoSQL: you will abdicate database functionality and you wisely prepare your solution to:
Do not lose data integrity.
Do not rewrite all database functionality in application (it is senseless)
I think there is way too much context missing for anyone to add any kind of valid comment to the discussion.
Is your db read or write intensive?
Will you ever conceivably move off SQL Server and onto another platform?
How do you plan to surface your data to the application?
Will people connect to your data through means other than those you control?
First do not go either route unless the structure truly cannot be known in advance. Using EAV or XML because you don't want to actually define the requirements will result in an unmaintainable mess and a badly performing mess at that. Usually at least 90+% (a conservative estimate based on my own experience) of the fields can be known in advance and should be in ordinary relational tables. Only use special techiniques for structures that can't be known in advance. I can't stress this strongly enough. EAV tables look simple but are actually very hard to query especially for complex reporting queries. Sure it is easy to get data into them, but very very difficult to get the data back out.
If you truly need to go the EAV route, consider using a nosql database for that part of the application and a relational database for the rest. Nosql databases simply handle EAV better.
My application has a complex schema for the domain entity. It is required use SQL Server 2008. Following are the complexities:
Domain Entity is Hierarchical: The data structure is a tree; it is nested to many levels. Few nodes in the tree are repeatable (multi-valued). For example, the entity can have unlimited addresses (home, billing, shipping, office, etc.)
Domain Entity is Expandable: The schema may expand (not shrink) in future.
Designing such a schema directly as related SQL Server tables is quite challenging. If not designing, quering will surely be so.
I am thinking of using XML type to store the domain entity records. However I have following queries:
Due to peculiar reporting needs, each field should be query-able (within and across entity records). This applies to even the fields that are added in future to the schema.
While using XML type, since I lose the structure, what is the best Data Access Layer I can design?
Can I use Entity Framework effectively in this situation?
Any best practices recommended?
One advice: DO NOT DO IT. Seriously. You are already down a slippery slope - etter learn to use databases.
The "Domain Entity" you define here will be large, which means that querying it will be a challenge.Unlimited addresses means 100.000 plus that you ahve to be prepared to. Anyone stupid enough to ask for the xml document will get a bad surprise, as will the server.
You also loose a lot of tooling left and right - from ORM's to reporting tools. Simply because you abuse wthe XML support the databae has (which is planned to store documents, not act as pseudo database).
Your queries:
Due to peculiar reporting needs, each field should be query-able (within and across
entity records). This applies to even the fields that are added in future to the schema.
In the english language, this is not a query, you know. It is also not possible.
While using XML type, since I lose the structure, what is the best Data Access Layer I can design?
Start writing SQL. By hand. Or develop your own. You are way out of what people use XML For, so no predefined tooling support.
Can I use Entity Framework effectively in this situation?
Obviously no.
Any best practices recommended?
Yes, learn using SQL Server properly. This is NOT a good approach.
I'm working on an abstraction layer for this:
http://rogeralsing.com/2011/02/28/linq-to-sqlxml/
Code is available on https://github.com/rogeralsing/linq-to-sqlxml
You can query and select/project entities from Sql server XML columns.
We are using it for evolving entity schemas while keeping old versions intact.
That beeing said, we only use it for special cases and go O/R mapping as a default approach.
In all honesty, and whilst I see #TomTom 's point, but it depends whether it is just ONE xml document or not. With 2008, you can setup XML schema's and map them to an XML field.
In contrary to TomTom 's answer, you can query an xml data field like you would do normally. Check the following SO answer for more information: https://stackoverflow.com/questions/966441/xml-query-in-sql-server-2008
You can use the entity framework (my knowledge is a bit short on this), by making some sproc's to query your data, then call the sproc from code and cast it to an XDocument. Not the prettiest way of doing it but it should work. Note: there might be another way of doing this, but that's as far as my knowledge of EF goes, perhaps add a tag for EF in the question?
I guess you need to come back to us and state whether you need to query 1 xml document (in which case an relational DB would possibly be better, suggested by #TomTom) or multiple documents (which I would use SQL Server to do the work. Chances are you'll have some way of linking these documents together anyway).
XML indexing tips can be found here
And some more info on XML in SQL 2008 here
Hth,
Stu
Did you try SisoDb? If you have any questions about it I would happily answer them. Use the contact form at http://www.sisodb.com or ping me at Twitter.
We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.
I was wondering the trade-offs for using databases and what the other options were? Also, what problems are not well suited for databases?
I'm concerned with Relational Databases.
The concept of database is very broad. I will make some simplifications in what I present here.
For some tasks, the most common database is the relational database. It's a database based on the relational model. The relational model assumes that you describe your data in rows, belonging to tables where each table has a given and fixed number of columns. You submit data on a "per row" basis, meaning that you have to provide a row in a single shot containing the data relative to all columns of your table. Every submitted row normally gets an identifier which is unique at the table level, sometimes at the database level. You can create relationships between entities in the relational database, for example by saying that a given cell in your table must refer to another table's row, so to preserve the so called "referential integrity".
This model works fine, but it's not the only one out there. In some cases, data are better organized as a tree. The filesystem is a hierarchical database. starts at a root, and everything goes under this root, in a tree like structure. Another model is the key/value pair. Sleepycat BDB is basically a store of key/value entities.
LDAP is another database which has two advantages: stores rather generic data, it's distributed by design, and it's optimized for reading.
Graph databases and triplestores allow you to store a graph and perform isomorphism search. This is typically needed if you have a very generic dataset that can encompass a broad level of description of your entities, so broad that is basically unknown. This is in clear opposition to the relational model, where you create your tables with a very precise set of columns, and you know what each column is going to contain.
Some relational column-based databases exist as well. Instead of submitting data by row, you submit them by whole column.
So, to answer your question: a database is a method to store data. Technically, even a text file is a database, although not a particularly nice one. The choice of the model behind your database is mostly relative to what is the typical needs of your application.
Setting the answer as CW as I am probably saying something strictly not correct. Feel free to edit.
This is a rather broad question, but databases are well suited for managing relational data. Alternatives would almost always imply to design your own data storage and retrieval engine, which for most standard/small applications is not worth the effort.
A typical scenario that is not well suited for a database is the storage of large amounts of data which are organized as a relatively small amount of logical files, in this case a simple filesystem-like system can be enough.
Don't forget to take a look at NOSQL databases. It's pretty new technology and well suited for stuff that doesn't fit/scale in a relational database.
Use a database if you have data to store and query.
Technically, most things are suited for databases. Computers are made to process data and databases are made to store them.
The only thing to consider is cost. Cost of deployment, cost of maintenance, time investment, but it will usually be worth it.
If you only need to store very simple data, flat files would be an alternative (text files).
Note: you used the generic term 'database', but there are many many different types and implementations of these.
For search applications, full-text search engines (some of which are integrated to traditional DBMSes, but some of which are not), can be a good alternative, allowing both more features (various linguistic awareness, ability to have semi-structured data, ranking...) as well as better performance.
Also, I've seen applications where configuration data is stored in the database, and while this makes sense in some cases, using plain text files (or YAML, XML and such) and loading the underlying objects during initialization, may be preferable, due to the self-contained nature of such alternative, and to the ease of modifying and replicating such files.
A flat log file, can be a good alternative to logging to DBMS, depending on usage of course.
This said, in the last 10 years or so, the DBMS Systems, in general, have added many features, to help them handle different forms of data and different search capabilities (ex: FullText search a fore mentioned, XML, Smart storage/handling of BLOBs, powerful user-defined functions, etc.) which render them more versatile, and hence a fairly ubiquitous service. Their strength remain mainly with relational data however.