Recommended pattern for domain entities containing data from multiple databases - database

I maintain an application which has many domain entities that draw data from more than one database. The way this normally works is that the entities are loaded from Database A (in which most of their fields are stored). when a property corresponding to data in Database B is called, the entity fires off SQL to Database B to get all the relevant data.
I'm currently using a 'roll-your-own' ORM, which is ugly, but effective (and easy to understand). I've recently started using NHibernate for entities drawn solely from Database A, but I'm wondering how I might use NHibernate for entities drawn from both Databases A and B.
The best way I can think of do this is as follows. I continue to use a NHibernate-based class library for entities in Database A. Those entities which also need data from Database B expose all their data from Database B in a single class accessed via a property. When this property is called, it invokes the appropriate repository, and the object is returned. The class library for accessing Database B would therefore need to be referenced from the class library for accessing Database A.
Does this make any sense, and is there a more established pattern for this situation (which must be fairly common).
Thanks
David

I don't know how well it maps to your situation, or how mature the NHibernate porting for it is at this point, but you might want to look into Shards.
If it doesn't work for you as-is, it might at least supply some interesting patterns to consider.
EDIT (based on comments):
This indeed doesn't seem to map to your situation, as Shards is about horizontal splitting of data.
If you need to split vertically, you'll probably need to define multiple persistence units. Queries and transactions involving both databases will probably get interesting. I'm afraid I can't really help much with this. This question is definitely related though.

Related

Is it a good practice to write all database access code in one class?

I created for a project a single class, that contains all access code to the database.
Is this a good practice , under the assumption that this class doesn't contain any logic, or should i use several classes? If yes, how should i partition my code? I use C# .Net.
Actually Under the concept of MVC framework, it is a good practice to create a different class for database access, seperate class for logic and seperate class for your views.
You are doing good if you are writing a seperate class for database access under the assumption that it does not contain any logic.
In Agile Developement there is a term named as Database Encapsulation Layers.
A database encapsulation layer hides the implementation details of your database, including their physical schemas, from your business code. In effect this layer provides your business objects with persistence services – the ability to read data from, write data to, and delete data from – data sources. Ideally your business objects should know nothing about how they are persisted, it just happens. Database encapsulation layers aren’t magic and they aren’t academic theories; database encapsulation layers are commonly used practice by both large and small applications as well as in both simple and complex applications. Database encapsulation layers are an important technique that every agile software developer should be aware of and be prepared to use.
An effective database encapsulation layer will provide several benefits:
-> It reduces the coupling between your object schema and your data schema, increasing your ability to evolve either one.
-> It implements all database-related code in one place.
-> It simplifies the job of application programmers.
-> It allows application programmers to focus on the business problem and Agile DBA(s) can focus on the database.
-> It gives you a common place to implement data-oriented business rules and logic.
-> It takes advantage of specific database features, increasing application performance.
Hope this helps.
If your database is quite small, say, only a couple of tables, you could write all your queries in one class. otherwise I would suggest that per Entity/Table one class. for example, StudentDao.class will only focus on the queries to database table "STUDENT", and TeacherDao.class will only contain queries to table "TEACHER". if you are gonna implement a complex business logic, you may want to have a service class, to weave StudentDao and TeacherDao together.
Unless your data access is very simple, probably not.
you probably shouldn't need to write this code yourself. Take a look at some Object Relational Mapping tools. NHibernate is a popular .Net solution. http://en.wikipedia.org/wiki/NHibernate
If you really do want to write it yourself look up design patterns in this area, like the Data Transfer Object pattern. http://martinfowler.com/eaaCatalog/dataTransferObject.html
These are some of the suggestions while accessing database.
1.) Always keep your database access parameters in a properties file. Use a handler to get those data. Because when you change your database then you need not change your code just make a change in the properties file it's enough.
-- So here you need a handler class.
2.) Never create a single class (a god class) which performs all the actions. Disperse your behaviour in to different classes depending on the intent. For example Keep all read behavior in one class, Write behavior in another class ... so on.
3.) You can create a class which deals with connection creations and pooling stuff...
Hope this helps.

is EAV - Hybrid a bad database design choice

We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.

NoSQL vs Relational Coding Styles

When building objects that make use of data stored in a RDBMS, it's normally pretty clear what you're getting back, as dictated by the tables and columns being queried. However, when dealing with NoSQL, document-based systems, it's less clear what is being retrieved.
What are common methods of keeping track of structure in which data is stored?
It depends on the driver. With the NORM driver you can "serialize" and "deserialize" an instance of an object in and out the db. It will throw an error when there is an extra field in the db that isn't present in the class definition. This is the default behaviour of NORM but they are adding the possibility to make it more flexible.
Read here: http://groups.google.com/group/norm-mongodb/browse_thread/thread/31102ec553a50e19
Not only does this depend on what database you're using, but it also depends on the language/framework you're coding with.
Most opinionated frameworks expect an ODM of some sort where you define a schema that is enforced in your models - like Rails, for example - and other frameworks let you do whatever you want, which puts you at risk of having data in multiple formats and not knowing what to do with it...
For MongoDB I've toyed with the notion of a soft schema, where every collection (table) has a document with a title of "schema" and defines the different elements and their datatypes in an embedded array called "definition." This allows me to generate dynamic scaffolds based on each collection, and can come in very handy when integrating with non-ODM platforms - in my case, Joomla.
Another approach is to store those schema definitions in a separate collection called schemas or schemata or some such.
You most certainly want to lock down some sort of schema in your code to ensure your data is in a predictable format; this is also important to address whenever your schemas change, and they invariably will.
There are also frameworks where your coding style does not change too much like playOrm which allows you to store relational data in a noSQL store and perform joins. The trick is partitioning of the data and Scalable SQL so it scales just fine and you can still query your data like you did in the past.

User defined data objects - what is the best data storage strategy?

I am building a system that allows front-end users to define their own business objects. Defining a business object involves creating data fields for that business object and then relating it to other business objects in the system - fairly straight forward stuff. My question is, what is the most efficient storage strategy?
The requirements are:
Must support business objects with potentially 100+ fields (of all common data types)
The system will eventually support hundreds of thousands of business object instances
Business objects sometimes display data and aggregates from their relationships with other business objects
Users must be able to search for business objects by their data fields (and fields from related business objects)
The two possible solutions I can envisage are:
Have a dynamic schema such that when a new business object type is created a new table is created for storing instances of that object. The object's fields become columns in the storage table.
Have a fixed schema where instance data fields are stored as rows in basically a big long table.
I can see pros and cons to both approaches:
the dynamic schema allows me to index search columns
the dynamic tables are potentially limited in width by the max column size
dynamic schemas rule out / cause issues with replication
the static schema means less or even no dynamic sql generation
my guess is the static schema may perform like a dog when it comes to searching across 100,000+ objects
So what is the best soution? Is there another approach I haven't thought of?
Edit: The requirement I have been given is to build a generic system capable of supporting front-end user defined business objects. There will of course be restrictions on how these objects can be constructed and related, but the requirement itself is not up for negotiation.
My client is a service provider and requires a degree of flexibility in servicing their own clients, hence the need to create business objects.
I think your problem matches very well to a graph database like Neo4j, as it's built for the requested kind of flexibility from the beginning. It stores data as nodes and relationships/edges, and both nodes and relationships can hold arbitrary properties (in a key/value fashion). One important difference to a RDBMS is that a graph database won't need to lookup the relationships in a big long table (like in your fixed schema solution), so there should be a significant performance gain there. You can find out about language bindings for Neo4j in the wiki and read what others say about it in this stackoverflow thread. Disclaimer: I'm part of the Neo4j team.
Without much understanding of your situation...
Instead of writing a general purpose one-size-fits-all business objects system (which is the holy grail for Oracle, Microsoft, SAS, etc.), why not do it the typical way, where the requirements are gathered, and a developer designs and implements the users' business objects in an effective manner?
If your users are typical, they will create a monster, which will end up running slow, and they will hate it. Most users will view the data as an Excel sheet, and not understand relationships like: parent/child. As a result there will be some crazy objects built, and impossible-to-solve reports. You'll be forced to create scripts to manually convert many old objects to better and properly defined ones, etc...
Your requirements sound a little bit like an associative database with a front end to compose and edit entities.
I agree with KM above, unless you have a very compelling reason not to, you would be better off using a traditional approach. There are a lot of development tools and practices that allow you to build a robust and scalable system. Otherwise you will have to implement much of this yourself.
I don't know the best way to do this, because it sounds like something that has already been implemented by others. If I were asked to implement this feature, I would recommend buying a wheel instead of reinventing it.
Perhaps there are reasons you have to invent your own? If so, then you should add those reasons to the requirements you listed.
If you absolutely must be this generic, I still recommend buying a system that has been architected for this requirement. Not just the storage requirements, which are the least of the problems your customer will have; but also: how do you keep the customer from screwing up totally when given this much freedom. Some of the commercial systems already meet this challenge without going out of business because of customers messing up.
If you still need to do this on your own, then I suggest that your requirements (or perhaps those of another vendor?) must include: allow the customer to get it right, and help keep the customer from getting it wrong. You'll need some sort of UI to allow the customer to define these business objects, and the UI should validate the model that the customer builds.
I recommend a UI that works at a conceptual level. As an example, see NORMA, a Visual Studio add-in for Object-Role Modeling (the "other" ORM). Consider it as a example only, if your end users cannot afford a Visual Studio Standard license. Otherwise, you'll find that it is extensible, already produces many types of artifact (from SQL in various dialects to code), and will validate the model to see that it makes sense. End users would also be able to enter sample data that they believe should be valid, and the system will validate the data against the model.
If your customers are producing sensible (if dynamic) business objects, then the question of storage will be much simpler.
Have you thought about an XML based solution? The requirements suggested to me "Build a system that allows users to dynamically generate an XML Schema and work with XML documents based on that schema." I don't know enough about storing and querying XML documents to comment on your original question.
Another possibility might be to leverage NHibernate's ability to generate database schemas. If you can dynamically generate business objects, then you can generate XML mappings or Fluent mappings and use that to generate a normalized database schema.
Every user that I have ever talked to has always wanted "everything" in their project. Part of the job of gathering requirements is to guide the user, not just write down everything they say.
Your only hope is to build several template objects, that they can add properties to, you could code your application to handle each type of these objects, but allow the user to still slightly modify each as necessary.
You need to inform the user upfront of the major flaws this type of design has. This will help you in the end, when it runs slow, or if they screw up and need help fixing something. I'd put this in writing.
How many possible objects would they really need? Perhaps you could set these up using your system first. I have developed several very customizable systems over the years and when the user is sitting at an empty screen, it is like a deer in the headlights.
In any event, good luck.

Marrying up consumer-defined aggregates (e.g. SQL counts) with 'pure' model objects?

What is the best practice of introducing custom (typically volatile) data into entity model classes? This may sound like a bad practice first, but it seems to be quite a common scenario. In our recent web application we have developed a proper model and in most cases we are fine with loading model entities. But there are cases where we cannot afford loading an entire hierarchy of entities; we need to load, say, results of a couple of SQL COUNT’s or possibly some additional information alongside (or embedded inside) the model entities. So basically, the requirements and conditions are:
It’s a web application where 99.9999999999% of all operations are read operations.
They don’t need to process or do any complicated business logic. We just need to get data quickly to HTML.
In several performance critical cases, we need to load results of SQL aggregates which don’t fit any model properties.
We need an extensible way to introduce any new custom data if needed.
How do you usually solve this issue without working too much around your ORM (for instance raw data from db)? I’m sure this has been discussed many times, but I cannot figure out a good Google query to find anything useful.
Edit: Since I later realized the question was not very well formed, I decided to reformulate it and start a new one.
If you're just getting relational data to and from a browser, with little or no behavior in between, it sounds like your trying to solve a relational problem with an OO paradigm.
I might be inclined to dispense with the Object Oriented approach altogether.
Me team recently rewrote an application by asking "What is the simplest thing that can possibly work?" and "What is the closest language to the problem?". Our new app, replacing an OO one, ended up being 10 times smaller, faster, and cheaper.
We used SQL, stored procedures, XML libraries on the DB server, XSLT (to get the HTML), and javascript.
OOP purist like myself would go to the Decorator pattern.
http://en.wikipedia.org/wiki/Decorator_pattern
But the thing is, some people may not need the flexibility it offers. Plus, creating new classes for each distinct operation may seem overkill, but it provide good compile type checking.
The best practice in my view is that your application consumes data using the Domain Model pattern. The Domain Model can offer business-logic methods for doing the type of queries that make sense and are relevant to your application needs.
These can fetch "live" results that map directly to database rows and can therefore be edited and "saved."
But additionally, the Domain Model can provide methods that fetch read-only results that are too complex to be easily saved back to the database. This includes your example of grouped aggregate query results, and also includes joined query result sets, expressions as columns, etc.
The Domain Model pattern offers a way to decouple the OO design of an application from the design of the physical database.

Resources