Horizontal Database and Vertical Database - database

I am working on a social networking site with a family tree that is GEDCOM-compliant. We need to decide if we should use a horizontal or vertical database structure for User profiles. So, I would like to know if anybody can answer when to use a horizontal database structure and when to use vertical database structure.
I found some answers for shopping sites where fields are not decided: a vertical database structure should be used. But I am confused about what to use for a family tree site. Should I use vertical or horizontal?

I assume you use a relational database like Mysql, Ms sql, Sqlite, Postgresql or Oracle for storage?
Gedcom is a standard for information exchange so you know how many columns you will have. Maybe the standard is extended with new properties in the future but it probably wont be a lot of new properties. You can easily extend a table with a few new columns.
I would use a 'horizontal' table and not a entity-attribyte-value-system (vertical table). Vertical table systems tend to be slow. They can't be properly indexed and confuse the query optimizer.
It becomes a different story when your users can define new properties in their profiles like eye colo(u)r or favourite colo(u)r themselves. How flexible do you want those profiles to be?

Vertical databases are great for detawarehousing and read/only reporting. Normally you re-generate them overnight. Their write performance is usually very bad however SELECTs are 10-100 times faster.
A typical scenario for using a vertical database is olap reporting when you create a (daily) snapshot of data and then run queries against it. Most of benefit comes from the queries that request only a relatively small number of fields, e.g. when you select only a handful of fields from a wide and large table. Such a query query against millions of records (e.g. calculating SUM/COUNT/AVG) will take only a second or two.
Your case doesn't seem to be a good candidate for a vertical database.

I agree with tuinstoel, vertical table/EAV system is not only slow but also some time very complex. Sometimes it is required to write some of your own api methods that deal with those tables and developers only deal with those methods to avoid complexity.
So if you don't need to add more fields then stay with horizontal table. However you may be need to different table if you also if you are going to support multilingual capability. But I advise to still stick with horizontal tables.
I am also developing an site involving user profile and I am using horizontal tables and if there will be different languages support required in future then I will modify only for fields where language will matter.

Related

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Which database is best to work with graphs and tree structured data?

I am planning to work with Dapper.NET for a family site.
A lot of tree like data will be present in the structure.
Which database provides the best queries to work with cyclic/acyclic tree relations?
I want to know the easiness & performance comparison of hierarchical queries.
ie. like CTE in SQL Server, Connect By/Start with in Oracle etc..
Is dapper be the best choice as a Micro ORM for these kind of tree structured data?
I need opinion in choosing the right database and right Micro ORM for this.
Sorry for my bad English.
My question still stands: How much data do you expect?
But apart from that it's not just the type of database you're choosing for your data it's also table structure. Hierarchy trees can be stored in various different ways depending on your needs.
Table structure
Particular structures may be very fast on traversal reads but slow on inserts/updates (i.e. nested sets), others (adjacency lists) the other way around. For a 99:1 read:write ratio (vast majority of today's applications read much more than write) I would likely choose a modified nested set structure that has left, right, depth and parent. This gives you best possibility for read scenarios.
Database type
Unless you're aiming at huge amounts of data I suggest you go with any of the SQL databases that you know best (MSSQL, MySQL, Oracle). But if your database will contain enormous number of hierarchy nodes then flirting with a specialised graph-oriented database may be a better option.
80 million nodes
If you'd be opting for a modified nested set solution (also using negative values, so number of updates on insert/update halves) you'd have hierarchy table having left. right, ID and ParentID columns that would result in approx 1.2 GB table. But that's your top estimation after at least two years of usage.
My suggestion
Go quick & go light - Don't overengineer by using best possible database to store your hierarchy if it turns out it's not needed after all. Therefore I'd suggest you use relational DB initially so you can get on the market quickly even though solution will start to struggle after some millions of records. But before your database starts to struggle (we're talking years here) you'll gain two things:
You'll see whether your product will take off in the first place (there's many genealogy services already) so you won't invest in learning new technology; Because you'd be using proven and supported technology would get you on the market quickly
If your product does succeed (and I genuinely hope it does) it will still give you enough time to learn a different storage solution and implement it; with proper code layers it shouldn't be to hard to switch storage later on when required

Design database based on EAV or XML for objects with variable features in SQL Server?

I want to make a database that can store any king of objects and for each classes of objects different features.
Giving some of the questions i asked on different forums the solution is http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Xml with some kind of validation before storage.
Can you please give me an alternative to the ones above or some advantages or examples that would help decide which of the two methods is the best one in my case?
Thanks
UPDATE 1 :
Is your db read or write intensive?
will be both -> auction engine
Will you ever conceivably move off SQL Server and onto another platform?
I won't move it, I will use a WCF Service to expose functionality to mobile devices.
How do you plan to surface your data to the application?
Entity Framework for DAL and WCF Service Layer for Bussiness
Will people connect to your data through means other than those you control?
No
While #marc_s is correct in his cautions, there unarguably are situations where the relational model is just not flexible enough. For quite a number of years now, I've been working with a database that is straightforwardly relational for the largest part, but has a small EAV part. This is because users can invent new properties any time for observation purposes in trials.
Admittedly, it is awkward wrt querying and reporting, to name a few, but no other strategy would suffice here. We use stored procedures with T-Sql's pivot to offer flattened data structures for reporting and grids with dynamic columns for display. Once the infrastructure stands it's pretty comfortable altogether.
We never considered using XML data because it wasn't there yet and, apart from its common limitations, it has some drawbacks in our context:
The EAV data is queried heavily. A development team needs more than standard sql knowledge because of the special syntax. Indexing is possible but "there is a cost associated with maintaining the index during data modification" (as per MSDN).
The XML datatype is far less accessible than regular tables and fields when it comes to data processing and reporting.
Hardly ever do users fetch all attribute values of an entity, but the whole XML would have to be crunched anyway.
And, not unimportant: XML datatype is not (yet) supported by Entity Framework.
So, to conclude, I would go for a design that is relational as much as possible but EAV where necessary. Auction items could have a number of fixed fields and EAV's for the flexible data.
I will use my answer from another question:
EAV:
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
XML storage is more like NoSQL: you will abdicate database functionality and you wisely prepare your solution to:
Do not lose data integrity.
Do not rewrite all database functionality in application (it is senseless)
I think there is way too much context missing for anyone to add any kind of valid comment to the discussion.
Is your db read or write intensive?
Will you ever conceivably move off SQL Server and onto another platform?
How do you plan to surface your data to the application?
Will people connect to your data through means other than those you control?
First do not go either route unless the structure truly cannot be known in advance. Using EAV or XML because you don't want to actually define the requirements will result in an unmaintainable mess and a badly performing mess at that. Usually at least 90+% (a conservative estimate based on my own experience) of the fields can be known in advance and should be in ordinary relational tables. Only use special techiniques for structures that can't be known in advance. I can't stress this strongly enough. EAV tables look simple but are actually very hard to query especially for complex reporting queries. Sure it is easy to get data into them, but very very difficult to get the data back out.
If you truly need to go the EAV route, consider using a nosql database for that part of the application and a relational database for the rest. Nosql databases simply handle EAV better.

is EAV - Hybrid a bad database design choice

We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.

Few database design questions relating to user content site

Designing a user content website (kind of similar to yelp but for a different market and with photo sharing) and had few databse questions:
Does each user get their own set of
tables or are we storing multiple
user data into common tables? Since
this even a social network, when
user sizes grows for scalability
databases are usually partitioned
off. Different sets of users are
sent separately, so what is the best
approach? I guess some data like
user accounts can be in common
tables but wall posts, photos etc
each user will get their own table?
If so, then if we have 10 million
users then that means 10 million x
what ever number of tables per user?
This is currently being designed in
MySQL
How does the user tables know what
to create each time a user joins the
site? I am assuming there may be a
system table template from which it
is pulling in the fields?
In addition to the above question,
if tomorrow we modify tables,
add/remove features, to roll the
changes down to all the live user
accounts/tables - I know from a page
point of view we have the master
template, but for the database, how
will the user tables be updated? Is
that something we manually do or the
table will keep checking like every
24 hrs with the system tables for
updates to its structure?
If the above is all true, that means we are maintaining 1 master set of tables with system default values, then each user get the same value copied to their tables? Some fields like say Maximum failed login attempts before system locks account. One we have a system default of 5 login attempts within 30 minutes. But I want to allow users also to specify their own number to customize their won security, so that means they can overwrite the system default in their own table?
Thanks.
Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.
Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm
It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.
It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.
I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Resources