Related
Looking for any advice from anyone who has migrated their repositories from relational DB to a NoSQL?
We are currently building an App using a Postgres database & ORM (SQLAlchemy). However, there is a possibility that at a later date we may need to migrate the App to an environment that currently only supports a couple of NoSQL solutions.
With that in mind, we're following the Persistence-Orietated approach to repositories covered in Vaughn Vernon's Implementing Domain-Driven Design. This results in the following API:
save(aggregate)
save_all(aggregates)
remove(aggregate)
get_by_...
Without going into detail, the ORM specific code has been hidden away in the repository itself. The Session is only used for the short span of time when data is retrieved, or updated, and then immediately committed and closed (in the repos methods). This means lots of merging on save, and not the most efficient use of the Session.
def save(aggregate):
try:
session.merge(aggregate)
commit
except:
rollback
def get():
try:
aggregate = session.query_by(id)
session.expunge
commit
except:
rollback
return aggregate
etc etc
The advantages:
We are limiting ourselves to updating a single Aggregate per Use Case, so the lack of fully utilising the UOW Transaction Control in the Application Service is minimal (outside of performance). Transaction Control is enabled in the repos while the aggregate is written to ensure the full aggregate is persisted.
No ORM specific code leaks outside of the Repositories, which would need to be re-coded in the advent of switching to a NoSQL db anyway.
So if we do have to switch to a NoSQL DB, we should have a minimal amount of work to do.
However, almost everything I have read encourages Transactional Behaviour to live in the Application Service Layer. Although I believe there is a distinction here between Business Transactional and DB Transactional.
Likewise, we're taking performance hit, in that we are asking the session factory for a session on every call to the repository. Most services contain about 3 or so calls to a repository.
So, the question to anyone who has migrated from Relational to a NoSQL DB?
Does the concept of a Unit of Work / Session mean anything in a NoSQL world?
Should we fully embrace the ORM in the meantime, and move the UOW/Session outside of the Repository into the Application Service?
If we do that, what was the level of effort to re-engineer the Application Service, if we need to migrate to a NoSQL solution in the end. (The repositories will need to be re-written in any instance).
Finally, anyone had much experience writing a implementation agnostic repository?
PS. Understand we could drop the ORM entirely and go pure SQL in the meantime, but we have beed asked to ensure we are using an ORM.
EDIT: In this answer I focus on document db's based on the questions title. Of course other NoSQL stores exist with vastly different characteristics (for example graph db's, using event sourcing and others).
It should not be a problem really.
In document db's your entire aggregate should be a single document. This way you have exactly the same guarantees that you need for transactional consistency. Regardless of how many entities change within the aggregate, you're still storing a document. You will need to make sure you enforce some form of optimistic concurrency (through an etag or version or similar), and not a Unit of Work pattern, but after that your transactional requirements are covered.
I can't really comment whether you fully embrase a UoW pattern now, vs rely on ORM implementation etc. This really depends a lot on your current situation and details about implementation. What I can say though is that it is quite probable that you won't need to migrate from normal form (SQL) to documents all in one go. Start from a simple one so that you can see what works for you and what doesn't.
I don't know if implementation-agnostic repositories exist, but that doesn't make a lot of sense to me. The whole point of a repository is encapsulating persistence, so you can't abstract it: there won't be any other responsibility allocated to them. Also, you can't assume that the repository will need to compose different models into the aggregate model: this is specific to platform, so it's not agnostic.
Another final comment: I see in your question that for documents you wrote save_all(aggregates). I'm not sure what you're referring to, but at minimum, each aggregate save should be wrapped in its own transaction, otherwise this operation violates transactional boundary characteristic of Aggregate.
Does the concept of a Unit of Work / Session mean anything in a NoSQL
world?
Yes, it can still be an interesting concept to have. Just because you're using a NoSQL storage doesn't mean that the need for some sort of business transaction management disappears. Many NoSQL databases have drivers or third party libraries that manage change tracking. See RavenDB for instance.
Sure, if you're only ever loading one aggregate per transaction and if your NoSQL unit of storage matches an aggregate perfectly, most of a Unit of Work's features will be less important, but you'll still be facing exceptions to that rule. Besides, the part of a UoW that's relevant in any case is Commit and possibly Abort.
Should we fully embrace the ORM in the meantime, and move the
UOW/Session outside of the Repository into the Application Service?
What I recommend instead is materializing the concept of Unit of Work in a full fledged class:
class UnitOfWork {
void Commit()
{
// Call your ORM persistence here
}
}
Application Services are just the place where the Unit of Work is called, not where it is implemented.
If we do that, what was the level of effort to re-engineer the
Application Service, if we need to migrate to a NoSQL solution in the
end. (The repositories will need to be re-written in any instance).
It depends on a lot of other parameters such as Unit of Work support by your NoSQL API or third party libraries, and similarity in shape between Aggregates and the NoSQL storage. It can range from practically no work to writing a full UoW/change tracking implementation yourself. If the latter, extracting UoW logic from the Repository to a separate class won't be the hardest part of the job.
Finally, anyone had much experience writing a implementation agnostic
repository?
I concur with SKleanthous here - implementation agnostic repos don't make much sense IMO. You've got your repository abstractions (interfaces) which are of course agnostic but when it comes to implementations, you have to address a particular persistent storage.
I've read some articles about CQRS and Event Sourcing recently. While the first seemed to me like a highly complex and risky workaround to fix poorly performing business and poorly designed data access layers and data models, the last seemed like a solution to many problems.
Problems to solve using Event Sourcing:
Get rid of Relational Database and Object Relational Mappers, like NHibernate and Entity Framework. Hardly anybody in the programming area wants to pay attention to such stuff like indices, table/index fragmentation or normalization, how to design relational data and how to code/configure the ORM (a science on its own).
Have Business Model and in-memory "database" united, an entity/aggregate service keeping all relevant items in memory, maintaining integrity by simply dumping the CUD events somewhere without much pain. Old items can be evicted from memory and dumped to a NoSQL (or whatever) store and used for aggregate calculations, reporting, search and, if necessary, be re-activated. If I understand right, in-memory databases like VoltDB use event dumping in a similar way, but are still relational databases, separated from the business logic.
This would also make concurrency easier: instead of locking (with possible complete system deadlocks) or optimistic locking with a general "success or fail" logic, depending on whether the data has changed meanwhile (or rather complex DB code), merge rules can be implemented in code.
History: no more pain with implementing auditing functions, cemetary tables or "deleted" marker columns, or possibly deleted data still being required.
Data Duplication/Search/Reporting: use full-text indices instead of chasing missing relational indices, create proper viewing areas, preparing the data for the user in a required format, instead of using ugly copy routines in relational databases, with triggers, followup stored procedures or even program code copying data to half a dozen different tables.
Versioning: it's a pain to get many modules running with a number of different relational database versions, each having different tables and columns and needs appropriate ORM mappings. Could be much easier in a single layer model, with the event dump accepting any object format (typically schema-less or loose-schema NoSQL documents, represented as JSON or XML). It might also be possible to upgrade old data through a "data schema change event" chain (instead of having to maintain migration scripts for relational DBs).
N-tier Business Model / Relational DB / ORM mess
An n-tier approach a decade or longer ago might have been a business layer and data access layer. In order to keep separation really strict, many relational features were omitted, to implement them in the business layer instead: relational integrity, normalization, with the DB being what I call a "trash dump": looking like a kid playing around with SQL Server Management Studio or Access. Extremly un-normalized, polymorphic references ("Foreign Key" columns referencing different source tables, identified by a "ReferenceSource" marker), abuse of same tables for different kinds of business objects and duplication of data to numerous other tables (and from there again elsewhere), because performance wasnt good and this was supposed to improve queries. ORM usage was without object references too, reduced to single object load and save operations. Loading an aggregate (a graph of entities/table rows) would iterate through the graph and send a query for every set of sub-entities.
When performance got worse and, possibly, orphaned references caused serious trouble, attempts to implement classic relational design might have been made, but it was impossible to adapt the grown system to a complete data redesign (nobody would pay for it), hardly anybody would know how to map object relations or even optimized loading in the ORM. Such attempts were limited to a few places in the design, possibly making the data model and access even harder to maintain.
CQRS on top of n-tier?
To get acceptable performance, separate SQL queries were possibly built for certain modules, bypassing the business model with it's single object iterative access. This whole structure was suddenly called a de-facto CQRS, because of separate query access (which -could- have been handled by a well-implemented, relational data model and ORM usage, as long as it wasnt supposed to be a "Big data" Google- or Stackoverflow-like workload), and the plenty of duplicated data in relational tables, made up for immediate application access.
Something better than the inappropriate table format?
OK, so I read into CQRS, and while I didn't like the use of "CQRS" as described before, the concept of an event storage instead of a relational DB looked very useful: It is unlikely to successfully enforce the introduction of the original, state-of-art, relational DB design and OR mapping, and even if, it would be extremly costly. In fact, ordinary, object-oriented programming is much more "normalized" than most DB tables, due to the need to press all into the table format or create tons of tables for object graphs/aggregates. And I agree: having to take care of search indices and defragmentation, schema management and data history tracking manually, is like stone age IT, like running Ford T models and steam locomotives besides modern day cars and electric high speed trains.
Any good Experiences?
How are the experiences about using event sourcing (not necessarily full CQRS)? Does it eliminate much of the pain with relational databases? I really look forward for a kind of in-memory database with all business logic integrated, and possibly fast enough to make separate query modules dispensable!
There's a lot going on in this question and so a specific, actionable answer is not possible, but if you're looking for one then it is...
It depends on your domain.
CQRS/ES/DDD is not appropriate for solving every single problem - it is not a silver bullet. If the domain suggests that CRUD/NTier will be good enough, then that's what you should use. All of the concerns you list in your question are infrastructural or system traits and say nothing about the very thing that should inform your choice of tool or practice; what are you trying to build?
Although CQRS, ES, and DDD are very often used together they are separate concepts that are very powerful on their own.
CQRS (Command Query Responsibility Segregation): This is a very useful pattern to design software in general. The idea is to keep things that change state (commands) from things that do not (queries). In many systems queries modify the state of the database, this makes it very difficult for developers to reason about what is going on.
Imagine doing a query to find out some information and realizing that the information changed because you queried it.
CQRS prohibits those kind of behaviors. Commands (which cannot return information) change state and Queries (which return information) cannot modify state. That way, you have certainty in which parts of the code are idempotent (and therefore can be called as much as you want with no side effect) and which parts change state.
DDD (Domain Driven Design): This is a Design methodology for the "Data Structure" of the code. It does not prescribe techniques for database access or many technical details. What it does is provide guidelines and concepts to structure data in an application in a way that makes it much more responsive to the actual user's needs. It also simplifies development (although it is more work than just slapping something together).
ES (Event Sourcing): Event sourcing is a data storage strategy which shifts data storage from state (the actual values of a piece of data at the current point in time) into transitions (the changes that have happened to a piece of data during its lifetime) which are called events.
There are several advantages of using ES.
First, it allows the business to store much more information regarding what happened before (a boon to Data Scientists). In traditional systems, a lot of information is lost to updates of the data, and unless those updates are explicitly logged, the information is gone forever. This does not happen in ES.
Second, storing all events makes debugging much more simple because now a developer can follow the processing of the data since its beginning. An update to a piece of data that happened a long time ago (and would have been rewritten by another update and lost) but corrupted processing can be identified and fixed. Furthermore, the effects of the fix can even span all calculations that happened between the wrong event and the last event. In a traditional system, this would be impossible as we are only storing the latest state only.
While it is theoretically possible to write an Event Sourced system without CQRS or DDD, it is remarkably more difficult to do so.
What is the best way or recommended best practice in the flow of database driven asp.net web application? I mean the database first or coding first or side by side?
Your data access code won't compile without an existing database - unless you stub (or Mock) it. So probably the database comes first.
But it is a bad idea to do whole chunks of the application in isolation. Ideally you should design and build slivers of the system - database and application - hand-in-hand. These slivers should be cohesive sub-sets of functionality, probably smaller than sub-systems. Inevitably, the act of coding screens and business rules will throw up problems in the data model. So it is good to have a data modeller or DBA who is happy to work incrementally alongside the developers.
edit
Stephanie makes an extremely pertinent point:
"the core tables which are persisting
your app's data really can't be
piecemealed. Most of the data is known
at project start. It has a form, you
need to find it."
I agree that the core entities are knowable at project start, and the physical data model can be derived from that logical data model. But I don't think it is ever possible to nail down completely the structure of any table, even a core table, at the start. This is because at the start of the design/build phase all we have to go on are the Requirements, and if there's one thing history tells us about the Requirements it is that they will change.
So, new tables will be needed and some existing tables will become obsolete. There will be columns which need to be added, columns which need to be modified, columns which need to be dropped. This is why Nature gave us the ALTER TABLE statement.
I am not suggesting that we don't design our tables, or assemble them piecemeal. I am merely suggesting that when we start designing the HR sub-system we need to worry about the EMPLOYEES table and the SALARIES table. We don't need to concern ourselves with INVENTORY or ORDERS until we commence work on Sales.
We personally start with the Domain and do things side-by-side. The important part is that we implement vertical slices of the application (fully working end-to-end features), not horizontal slices (e.g. first the whole database layer, then the data access, then the services, then the presentation): we build the application incrementally and demonstrate progress with working code after each iteration.
Applications are all about features.
You don't build apps to store data,
but to provide functionality. If we
can't agree on that, the discussion is
moot of course. Software should be
developed to satisfy the needs of its
users and not of its developers.
Well I have really no understanding of the second sentence. If you think my company pays me a good salary to write code that satisfies me and not my users you're crazy. So that argument is a strawman. Back to the first.
This is a common view point of application centric people (they), vs. database centric people (We). They see the entire point of the exercise to "provide features". Those are things the clients know they want and ask for them. To them, the database is just persistence required for these features. And when they are done, that's it, features delivered, database is sufficient for those features. Could be an entire Rube Goldberg inside the database with redundant data, severe violations of normal forms, constraints enforced by the application, what have you.
think overall usability alone outweighs database design
If the design of your database is affecting your usability than the design was bad. I have no doubt that one who strives for features will leave the database in such a state that it severely hampers usability.
Data Centric people, don't look at a system as a place to provide only what's been asked for, but a repository of Intellectual Capital that can be exploited by more than whatever the Application-du-jour is. I can't begin to describe the number of cases where one team has used the database of some other team's app to enhance their apps value. Just look at all the medical research that is nothing more that the meta-analysis of existing studies. None of that is possible if you believe that only the features of your app matter and subsequent uses of your apps data do not.
A good data model isn't inviolate. Sure you'll add to it, change it when requirements change. But if you don't completely understand your data, I don't know how anyone can begin to write code.
I guess you need first to define datamodel and only then going coding. You should plan everything carefully before actually writting the code.
First is a feature list.
Then, detailed spec.
Then test plan and design of all, including databases.
Then, it wouldn't matter which to implement first.
You'll probably end up doing it "side by side".
You need some data to be able to test the application, but you need the application to be able to verify that you're storing the correct data.
Do some modelling first and then build the minimum you can for one or two features. Then when these are working add the next feature and so on.
You'll need to write some database update procedures (both the code and the rules about what and when to update) as you will have to extend your tables, but you'll need those for the final system anyway as it will have to change as new requirements come along.
Having done it quite a few times, I find myself invariably doing it like so:
Define the problem I'm trying to solve.
Write out some use-cases.
Have my significant other or a friend tell me if this is even a problem.
Sketch out a few sample screens.
Write flow diagrams for the use cases.
Ask my Rubber-duck questions.
Use questions to refine 1-6.
Write out the 'nouns'. Those become my data Model.
Write out the actions. Those become application logic.
Code data Model.
Code Application Logic.
Realize I've gotten it a little wrong.
Repeat 10-12 as many times as needed.
Ask, "Have I solved the problem"?
If not, rinse, lather and repeat 1-15.
This is a trick question. IMO, they both come in parallel during your planning and design phase. They are so closely related that it make sense to do them together. Just keep in mind that your database design will be almost fully developed while your code is still in its infancy (though your application logic should be almost fully mapped out in you head or on paper)
The idea is that you're designing your solution in the context of the problem. When you're planning out your solution you will be (or should be) defining your application as a set of things and actions (nouns and verbs).
For example, a very basic helpdesk program has people and tickets. People need to create tickets, update tickets, and close tickets. The nouns that require persistent storage will comprise your database, and the nouns + actions will be contained in your application.
Sometimes your table mappings and the relationship between tables will be obvious (IE people create tickets, ticket.creatorID = people.personID) and other times the relationship doesn't really click in your head until you start working through use cases or until you start writing your code (IE different ppl have different access levels defining what they can do. At a glance this would seem like a simple field in a table, but in practice it is better as a separate table).
It occurs to me that state control in languages like C# is not well supported.
By this, I mean, it is left upto the programmer to manage the state of in-memory objects. A common use-case is that instance variables in the domain-model are copies of information residing in persistent storage (i.e. the database). Clearly this violates the single point of authority principle, and "synchronisation" has to be managed by the developer.
I envisage a system where instead of instance variables, we have simple public access/mutator methods marked with attributes that link them to the database, and where reads and writes are mediated by a framework that decides whether to hit the database. Does such a system exist?
Am I completely missing the point, or is there some truth to this idea?
If i understand correctly what you want: Any OR-Mapper with Lazy Loading is working this way. For example i use Genome and there every entity is a pure proxy and you have quite much influence to tell the OR-Mapper how to cache the fields.
Actually there's the concept of data prevalence (as implemented by prevayler in Java) where the in-memory objects are the single point of authority (SPA) for the data.
Also, some object databases (as db4o) blur lines a bit between the object representation and the "store" representation.
On the other hand, by bringing the SPA for the data inside the application, you need to handle transactions and/or data persistence by yourself. There is some work done on transactional memory systems such as JVSTM (currently in use by the information system of my old college) but it's not in widespread use.
On the other hand, if the data lives in a database, you can just commit the data when everything is good (or use the support for transactions built in the database) and be sure that data isn't corrupted or lost. You trade in the SPA principle for better data reliability and transactions (and other advantages of using a separate data store)
I am building a system that allows front-end users to define their own business objects. Defining a business object involves creating data fields for that business object and then relating it to other business objects in the system - fairly straight forward stuff. My question is, what is the most efficient storage strategy?
The requirements are:
Must support business objects with potentially 100+ fields (of all common data types)
The system will eventually support hundreds of thousands of business object instances
Business objects sometimes display data and aggregates from their relationships with other business objects
Users must be able to search for business objects by their data fields (and fields from related business objects)
The two possible solutions I can envisage are:
Have a dynamic schema such that when a new business object type is created a new table is created for storing instances of that object. The object's fields become columns in the storage table.
Have a fixed schema where instance data fields are stored as rows in basically a big long table.
I can see pros and cons to both approaches:
the dynamic schema allows me to index search columns
the dynamic tables are potentially limited in width by the max column size
dynamic schemas rule out / cause issues with replication
the static schema means less or even no dynamic sql generation
my guess is the static schema may perform like a dog when it comes to searching across 100,000+ objects
So what is the best soution? Is there another approach I haven't thought of?
Edit: The requirement I have been given is to build a generic system capable of supporting front-end user defined business objects. There will of course be restrictions on how these objects can be constructed and related, but the requirement itself is not up for negotiation.
My client is a service provider and requires a degree of flexibility in servicing their own clients, hence the need to create business objects.
I think your problem matches very well to a graph database like Neo4j, as it's built for the requested kind of flexibility from the beginning. It stores data as nodes and relationships/edges, and both nodes and relationships can hold arbitrary properties (in a key/value fashion). One important difference to a RDBMS is that a graph database won't need to lookup the relationships in a big long table (like in your fixed schema solution), so there should be a significant performance gain there. You can find out about language bindings for Neo4j in the wiki and read what others say about it in this stackoverflow thread. Disclaimer: I'm part of the Neo4j team.
Without much understanding of your situation...
Instead of writing a general purpose one-size-fits-all business objects system (which is the holy grail for Oracle, Microsoft, SAS, etc.), why not do it the typical way, where the requirements are gathered, and a developer designs and implements the users' business objects in an effective manner?
If your users are typical, they will create a monster, which will end up running slow, and they will hate it. Most users will view the data as an Excel sheet, and not understand relationships like: parent/child. As a result there will be some crazy objects built, and impossible-to-solve reports. You'll be forced to create scripts to manually convert many old objects to better and properly defined ones, etc...
Your requirements sound a little bit like an associative database with a front end to compose and edit entities.
I agree with KM above, unless you have a very compelling reason not to, you would be better off using a traditional approach. There are a lot of development tools and practices that allow you to build a robust and scalable system. Otherwise you will have to implement much of this yourself.
I don't know the best way to do this, because it sounds like something that has already been implemented by others. If I were asked to implement this feature, I would recommend buying a wheel instead of reinventing it.
Perhaps there are reasons you have to invent your own? If so, then you should add those reasons to the requirements you listed.
If you absolutely must be this generic, I still recommend buying a system that has been architected for this requirement. Not just the storage requirements, which are the least of the problems your customer will have; but also: how do you keep the customer from screwing up totally when given this much freedom. Some of the commercial systems already meet this challenge without going out of business because of customers messing up.
If you still need to do this on your own, then I suggest that your requirements (or perhaps those of another vendor?) must include: allow the customer to get it right, and help keep the customer from getting it wrong. You'll need some sort of UI to allow the customer to define these business objects, and the UI should validate the model that the customer builds.
I recommend a UI that works at a conceptual level. As an example, see NORMA, a Visual Studio add-in for Object-Role Modeling (the "other" ORM). Consider it as a example only, if your end users cannot afford a Visual Studio Standard license. Otherwise, you'll find that it is extensible, already produces many types of artifact (from SQL in various dialects to code), and will validate the model to see that it makes sense. End users would also be able to enter sample data that they believe should be valid, and the system will validate the data against the model.
If your customers are producing sensible (if dynamic) business objects, then the question of storage will be much simpler.
Have you thought about an XML based solution? The requirements suggested to me "Build a system that allows users to dynamically generate an XML Schema and work with XML documents based on that schema." I don't know enough about storing and querying XML documents to comment on your original question.
Another possibility might be to leverage NHibernate's ability to generate database schemas. If you can dynamically generate business objects, then you can generate XML mappings or Fluent mappings and use that to generate a normalized database schema.
Every user that I have ever talked to has always wanted "everything" in their project. Part of the job of gathering requirements is to guide the user, not just write down everything they say.
Your only hope is to build several template objects, that they can add properties to, you could code your application to handle each type of these objects, but allow the user to still slightly modify each as necessary.
You need to inform the user upfront of the major flaws this type of design has. This will help you in the end, when it runs slow, or if they screw up and need help fixing something. I'd put this in writing.
How many possible objects would they really need? Perhaps you could set these up using your system first. I have developed several very customizable systems over the years and when the user is sitting at an empty screen, it is like a deer in the headlights.
In any event, good luck.