Referential Integrity and HBase - database

One of the first sample schemas you read about in the HBase FAQ is the Student-Course example for a many-many relationship. The schema has a Courses column in the Student table and a Students column in the Course table.
But I don't understand how in HBase you guarantee integrity between these two objects. If something were to crash between updating one table and before another, we'd have a problem.
I see there is a transaction facility, but what is the cost of using this on what might be every Put? Or are there other ways to think about the problem?

We hit the same issue.
I have developed a commercial plugin for hbase that handles transactions and the relationship issues that you mention. Specifically, we utilize DataNucleus for a JDO Compliant environment. Our plugin is listed on this page http://www.datanucleus.org/products/accessplatform_3_0/datastores.html or you can go directly to our small blog http://www.inciteretail.com/?page_id=236.
We utilize JTA for our transaction service. So in your case, we would handle the relationship issue and also any inserts for index tables (Hard to have an app without index lookup and sorting!).

Without an additional log you won't be able to guarantee integrity between these two objects. HBase only has atomic updates at the row level. You could probably use that property though to create a Tx log that could recover after a failure.

If you have to perform two INSERTs as a single unit of work, that means you have to use a transaction manager to preserve ACID properties. There's no other way to think about the problem that I know of.
The cost is less of a concern that referential integrity. Code it properly and don't worry about performance. Your code will be the first place to look for performance problems, not the transaction manager.

Logical relational models use two main varieties of relationships: one-to-many and
many-to-many. Relational databases model the former directly as foreign keys (whether
explicitly enforced by the database as constraints, or implicitly referenced by your
application as join columns in queries) and the latter as junction tables (additional
tables where each row represents one instance of a relationship between the two main
tables). There is no direct mapping of these in HBase, and often it comes down to de-
normalizing the data.
The first thing to note is that HBase, not having any built-in joins or constraints,
has little use for explicit relationships. You can just as easily place data that is one-to-
many in nature into HBase tables:. But
this is only a relationship in that some parts of the row in the former table happen to
correspond to parts of rowkeys in the latter table. HBase knows nothing of this rela-
tionship, so it’s up to your application to do things with it (if anything).

Related

Aggregating all relations into one table SQL Server

I'm trying to design an enterprise level database architecture. In ERD level I have an Issue.
Many of my tables have relations which each other. there may be some developments in the future and my design should be flexible and also fast on gathering the results.
In recent days I have created a Parent Table which is named Node and all of my Functional Tables has an one-to-one relation with this table.
(Functional Tables are those who keep real life datas like Content, User, Folder, Role, .... and not those who related to applications life-cycle)
So before adding a record to each table, We must add a Node into the Node Table and take the new NodeId to add into secondary table.
Node table alone, has a Many-To-Many relation with itself. so I designed this table to keep whole of my relation concerns.
All of the other entities are like the User and are related to the Node table as shown above.
Problem is: Does this design makes my relational queries faster on NodeAssoc table or It's better to keep relations separately ?
You say:
There may be some developments in the future and my design should be flexible and also fast on gathering the results.
Flexibility and performance are two separate things. Which have different ways to approach them or solve them. When you are designing a database, you have to concider database principles. Normalization is very important to keep in mind. Relations one-to-one and many-to-many are by design not common. In your case you are mentioning one-to-one and many-to-many relations, on which I have my worries.
Advice one -> Denormalize (merge) one-to-one tables to one table.
This reduces the amount of joins.
Advice two -> Introduce a bridge table on many-to-many table,
because there could be multiple matches. Fixing multiple matches means
complex queries, which leads to performance drop.
Advice three -> Use proper indexes in order to improve the performance
Increasing of flexibility can be through using Database Views, which is a query. The structure of the database may change in the future, while modifieing the view can be very fast too.

Pivoting EAV data from the client to a Relational Model on the server

I am building a software platform for mobile electronic data collection. It should support any type of data. For example, the government might use it for a population survey; a manufacturing company might use it to evaluate plant condition at their factories; a research organizations might use it for clinical trials, e.t.c
As such, the software is powered by a database, with standard relational design for the metadata and entity attribute value for the actual data. Client software then reads the metadata and renders the appropriate user interface, complete with rules, validations, skip logic and so on. I believe the choice of EAV is a good one owing to the diversity of data that might be collected, but ...
Once the data is submitted from the mobile clients to the customer's server, the EAV model is no longer useful because the customer expects just his set of (usually very few) tables, for visualization and processing.
I have considered two options for pivoting the data.
1) Pivot the data immediately it is submitted to the server (via a JSON web service) and save it straightaway into a relational model.
2) Save the data in a similar schema on the server but have a background process that pivots it periodically and saves it in a relational model.
The first alternative seems more efficient as pivoting one record at a time is obviously quicker and less CPU intensive. The disadvantage is that if the metadata is changed, this process needs to adapt immediately by changing the relational model for the data accordingly. Depending on the extent of the changes, this can take some time. Worse, if it fails for any reason, upload requests might start being declined. If using the second approach, such failure would not "break" anything as urgent.
Are there other potential pitfalls I might be missing or design considerations I should make? What are some good reasons to do it one way or the other? Are there other alternatives I should be exploring to solve this problem?
Just define a straightforward relational schema of tables for their data using DDL. EAV is just an encoding of a proper schema & its metadata. Which, of course, the DBMS can't understand so you lose practically all the benefits of a DBMS. The only possible reason to use EAV is when tables are not known at compile time and DDL isn't fast enough or able to hold enough tables.
The EAV requests are just textual rearrangements of the DDL requests. (EAV configuration is typically a table for multiple entity-attribute-value requests given a table and key column(s) of the entities having virtual tables.) Moreover one only has to write a single interface easily implemented to map EAV configuration-then-updates to whichever of the two implementations one chooses. (It is better to use a pure relational interface and hide the chosen implementation but the nature of interfaces to SQL DBMSes, namely SQL, makes that difficult. Ie it would be easy if one is using a relational API rather than SQL.)
The EAV configuration without such an interface is only simpler if you don't declare the appropriate constraints or transactions on the virtual per-entity tables. Also every EAV version update or query must reconstruct the virtual tables then embed those expressions in the DDL version's update or query. (Only in the case of simply inserting or deleting or retrieving a single triple is the EAV DML as simple.)
Only if you showed that creating & deleting new tables was infeasible and the corresponding horrible integrity-&-concurrency-challenged mega-joining table-and-metadata-encoded-in-table EAV information-equivalent design was feasible should you even think of using EAV.

Database FK Constraints vs Programmatic FK Constraints

Although I am targeting MySQL/PHP, for the sake of my questions, I'd like to just apply this generally to any relational database that is being used in conjunction with a modern programming language. Another assumption would be that the language is leveraging a modern framework, which, on some level would handle foreign key constraints implicitly or have a means to do so explicitly.
My questions:
What are the pros and cons of creating FK constraints in the database itself as opposed to managing them at the application level?
From a design standpoint, should they ever both be used together or would that cause conflict?
If they should not be used together, what is considered the "best practice" in regards to which approach to use?
Note: This is a design theory question. Because of the wide variety of technology that could be used to satisfy an implementation, I'm not really interested in any specifics regarding an implementation.
What are the pros and cons of creating FK constraints in the database itself as opposed to managing them at the application level?
In a concurrent environment, it is surprisingly difficult to implement referential integrity in the application code, such that it is both correct and with good performance.
Unless you very carefully use locking, you are open to race conditions, such as:
Imagine there is currently one row in the parent table and no corresponding rows in the child.
Transaction T1 inserts a row in the child table, but does not yet commit. It can do that since there is a corresponding row in the parent table.
Transaction T2 deletes the parent row. It can do that since there are no child rows from its perspective (T1 hasn't committed yet).
T1 and T2 commit.
At this point, you have a child row without parent (i.e. broken referential integrity).
To remedy that, you can lock the parent row from both transactions, but that's likely to be less performant compared to the highly optimized FK implemented in the DBMS itself.
On top of that, all your clients have to adhere to the same "locking protocol" (one misbehaving client is enough to currupt the data). And the complexity rapidly raises if you have several levels of nested FKs or diamond-shaped FKs. Even if you implement referential integrity in triggers, you are only solving the "one misbehaving client" problem, but the rest remains.
Another nice thing about database-level FKs is that they usually support referential actions such as ON DELETE CASCADE. And all that is simple and self-documenting, unlike referential integrity burried inside application code.
From a design standpoint, should they ever both be used together or would that cause conflict?
You should always use database-level FKs. You could also use application level "pre-checks" if that benefits your user experience (i.e. you don't want to wait until the actual INSERT/UPDATE/DELETE to warn the user), but you should always code as if the INSERT/UPDATE/DELETE can fail even if your application-level check has passed.
If they should not be used together, what is considered the "best practice" in regards to which approach to use?
As I stated, always use database-level FKs. Optionally, you may also use application-level FKs "on top" of them.
See also: Sql - Indirect Foreign Key
Just how familiar you are with database design and the foreign key concept in general? FK is a column(s) in one table that identifies a row in another table. (I'm pretty sure you already know this.) So FK constraint is something that exists in DB, not in application. Managing FK constraints in application requires manual coding for the functionalities that are already available in DB. So why would you want to do all that manual labor? Also the DB/application interaction and development is much more difficult because of all that extra manual coding.
Best practice IMHO is to use the tools for what they are created to do. DB takes care of the FKs referential integrity and application doesn't need to concern itself with DBs inner functionalities. However, if referential integrity is your main concern and you're for example using MySQL with MyISAM engine which doesn't support FK constraints then you have to some manual checking in application (or maybe with DB triggers which I am not familiar with). Just keep in mind that when you do all kind of checking in application you still have to access the DB and thus you use more resources than what really is needed if the DB could handle the referential integrity checks. (The easy solution of course would be start using InnoDB engine but I'll stop here before this answer gets too product oriented).
So some the pros for letting the DB handle the FK constraint would be:
You don't have to think about it.
You don't have to manually code anything extra.
Application uses less resources and contains less code and thus...
... maintaining and developing both the DB and the application is a lot easier (for example the application developers don't need to understand database oriented concepts and functionalities so deeply, let the DB experts do the FK etc. thinking...).
What are the pros and cons of creating FK constraints in the database
itself as opposed to managing them at the application level?
Some of the pros of using db-enforced FKs:
Separation of schmea from code.
Making application code smaller
No chance for programmer to mess with FK rules.
Forces other applications that integrate with the db to follow the fk rules.
Some of the cons of having db-enforced FKs.
Not easy to break if you have a special case
If data is not valid, errors could be thrown. Application should be coded to gracefully handle errors such as those (specially batch ones).
Definition of FK with Referential integrity rules must be defined and coded carefully. You don't want to cascade delete 1000000 rows online.
They cause an implicit check, even if you don't want that check to occur because you know the parent row must exist. This has probably a trivial impact on performance. Performance is an issue when loading huge data volumes in batch loads and in OLAP/Data Warehouse systems. Special load tools are used and constraints such as database enforced FKs are usually disabled during the load.
From a design standpoint, should they ever both be used together or
would that cause conflict?
You could use them together for a reason. As I mentioned before, you may have special cases in your data that you can't define FKs for. Also, there are certain cases such as many-to-many self referencing relationships between tables that could not be handled by FKs (for some db engines at least).

ORM and Database Constraints

How compatible is ORM and existing databases that have a lot of constraints (particularly unique key constraints/unique indexes beyond primary keys) enforced within the database itself?
(Often these are preexisting databases, shared by numerous legacy applications. But good database modeling practice is to define as many constraints as possible in the database, as a double-check on the applications. Also note that the database engine I am working with does not support deferred constraint checking.)
The reason I am asking is that the ORMs I have looked into, NHibernate and Linq to SQL, don't seem to hold up very well in the presence of database unique constraints. For example, deleting a row and re-inserting one with the same business key results in a foreign key exception. (There are subtle, harder to avoid examples as well.) The ORMs observe primary key and foreign key constraints, but tend to be oblivious to unique constraints.
I understand that there are workarounds, such as the NHibernate flush method. However, I feel this is an extremely leaky abstraction and makes it hard to design the application with regards to a separation of concerns. Ideally, all of the objects can be manipulated in memory by subroutines and then the main routine can take responsibility for the call to actually sync the database. This isolates the update and allowes for custom logic to inspect all of the updates before they are actually submitted to the database.
Executing the commands in the correct order is non-trivial. See my question here. Nonetheless, I was expecting better support for the common cases among the popular ORMs. This seems so important for introducing an ORM into an existing environment.
What have been your experiences with using ORM technologies is light of these issues?
This is of course IMHO...
ORM in general treats databases as merely a storage medium for data and is geared towards maintaining the constraints/business logic in the "O" side and not the "R" side. I haven't seen any ORM products that make use of some of the more "hardcore" relational database concepts like alternate keys, composite unique indexes, and exclusive subtypes. In a sense, ORM makes the database a second class citizen.
Call me old fashioned, but ORM seems to be good for reading data but for writing data back to a non-trivial relational design, I've always found it falls short. I prefer to do all my updates through SQL and/or stored procedures.
Good ORMs, and NHibernate is one, will enforce referential integrity and proper order execution if the database is mapped correctly. As far as I know, none of them support check or unique constraints. Check constraints are business rules that should be enforced in the business objects. I usually only enforce critical business rules (i.e. the business would lose money and/or I would lose my job if these rules were violated) in the database using check constraints and/or triggers.
Unique constraints usually represent an alternate key. With ORMs, it's common practice to use a surrogate key (identity) as the primary key and enforce a unique constraint on the natural key. It would be challenging for an ORM to implement unique constraint checking because it would require a select and lock before every insert or update. In general, the best practice is to always perform operations in a transaction that can be rolled back if it fails and provide a meaningful error message to the user.
For example, deleting a row and re-inserting one with the same business key results in a foreign key exception.
Were you trying to do this in the scope of a single ISession? I could see that being problematic.

Is it acceptable to cross between databases?

I'm not sure what this practice is actually called, so perhaps someone can edit the title to more accurately reflect my question.
Let's say we have a site that stores objects of different types. Each type of object has its own database (a database of books and assorted information with its tables, a database of CDs and information with its tables, and so on). However, all of the objects have keywords and the keywords should be uniform across all objects, regardless of type. A new database with a few tables is made to store keywords, however each object database is responsible for mapping the object ID to a keyword.
Is that a good practice?
Is there a reason to have separate databases for each type of object? You would be better off using multiple tables, and joining them. For example, you may have a table GENERIC_OBJECT which holds things that are common across all types, and then a table called BOOK_OBJECT where BOOK_OBJECT.ID = GENERIC_OBJECT.ID for a given book. Another table would be CD_OBJECT where CD_OBJECT.ID = GENERIC_OBJECT.ID for a given CD. Then things like keywords that are common across all objects would be stored in the GENERIC_OBJECT table, and things that are specific to the item would go in the item's corresponding table.
By separating them into different databases, you lose:
the ability to do ACID transactions (assuming you aren't using a two-phase commit solution).
the ability to have referential integrity.
JOINs across tables.
Thomas, what you're missing in your comment responses to our concerns about referential integrity is that you can't do a foriegn key across two databases. If the two tables are in one database, then you can use foriegn key constraints to ensure that when you delete an object, anything that relies upon its object id is also deleted, and other similar things.
While it is possible to do joins across databases, I wouldn't generally split the data across databases just because they are of slightly different categories. Others have also mentioned the inability to use referential integrity across databases.
On the other hand, if each type of product has radically different front-end applications, or if you expect each database to become massively large, those might be reasons to consider leaving them in separate databases. (Although scaling isn't a problem for most modern databases).
Syntax example for cross-database joins:
SELECT *
FROM books b
INNER JOIN KeywordDB.dbo.Keywords k
ON b.keywordID = k.keywordID
In this example, you are performing the query from the local database that contains the books table, and you are joining to the other database. (This is a MS SQL syntax example)
No, it's a bad idea. By separating them into different databases, you significantly impair your ability to do JOIN queries.
It does seem a little bit too seperated but with some well designed views it could work especially if the views are simply lookups.
Why such seperation in the first place?
As everyone has mentioned this is, in general, not a good idea. However, to play devils advocate, I've seen other developers do this. I'm sure that there are some reasons that one might want accomplish this however if absolutely needed (not sure if your asking for solutions but) you might want to use some sort of synchronization to keep the data synchronized. Have all (or what is needed) of the data in both databases.
This also isn't an ideal solution, but if you must uses two different database types, this might be a better way to go about such a thing.
It could at least solve the issues that everyone has been outlining – though keep in mind that it does present a new problem… Is everything in sync?
Good Luck,Frank
If the decision about whether to use one database or two is yours, I recommend going with just one database. The data in the two tables appears closely related, judging from your question. The size and complexity doesn't seem to merit splitting into two databases.
What's your DBMS? If it's Oracle, DB2, SQL Server, or even MS Access, you shouldn't have any trouble administering a single database with keyword data and object data in logically related tables.

Resources