What are the differences between Datomic and EventStoreDB? - database

I am currently working on a research project that requires us to keep history of the data to access it later on. Event sourcing, naturally, falls exactly into the category of data management patterns because it allows us to replay certain events in specific points in time. Kafka or RabbitMQ probably could do this job, but they do not exactly fit our needs. So I came across EventStoreDB, a more lightweight solution to event sourcing.
While diving deeper into database models that keep change history, I also stumbled across this video of Rich Hickey who created Datomic. The concept behind Datomic sounds quite interesting and now I want to know what the difference between those two databases is. It would be great to hear some insights from people who worked with both technologies or know more about them.

The underlying data model is quite different.
Datomic :
Datoms are the core of datomic : describing the value change of a certain attribute of the entity.
EventStoreDB:
Streams & events,
Streams represent the history of entities .
Events are tuples of (type, data , metatdata)
Both have strong total ordering guarantess.

Related

Who should save object to persistent memory

I am currently learning UML. I looked everywhere but couldn't find an answer. Should the creator of an object be the one to always saves it in persistent memory via a data access object or by an expert, is an object better to save itself via a data access object in persistent memory?
Main alternatives
UML is agnostic in this regard. It depends of your architectural choices:
A popular approach is Repository objects that act as a kind of collection that hides the database. You’d insert, retrieve, suppress or update elements in the repository and the repository takes care of the database.
Another popular approach is the active record. Each persistent object is in charge of inserting, updating or deleting itself in the database.
There are several other approaches.
How to chose?
The scenario you describe corresponds to the second one. It is a tempting approach for CRUD oriented application with little domain logic. But it has several drawbacks that limit their suitability:
it does not enforce proper separation of concerns (see also Songle Responsibility Principle).
it couples your system tightly to the database and it might be difficult at a later stage to change the inderlying database.
Moreover, you’d need to care also for transactional logic (i.e.either complete a set of related changes or cancel them all), or to avoid that several objects in memory correspond to the same object in the database. While not impossible, this is more difficult with active records.
How to deepen your understanding of these alternatives?
The most comprehensive book on this topic is Fowler’s “Patterns of enterprise application architecture”. Another book worth to invest in for deepening what’s behind repositories, is Evans’ DDD bible. (Imho, you will save years of on the job experimentation and discovery by reading both books. You can then use the saved time to deepen your skills on other innovative domains).

Use Event Sourcing to get rid of relational database?

I've read some articles about CQRS and Event Sourcing recently. While the first seemed to me like a highly complex and risky workaround to fix poorly performing business and poorly designed data access layers and data models, the last seemed like a solution to many problems.
Problems to solve using Event Sourcing:
Get rid of Relational Database and Object Relational Mappers, like NHibernate and Entity Framework. Hardly anybody in the programming area wants to pay attention to such stuff like indices, table/index fragmentation or normalization, how to design relational data and how to code/configure the ORM (a science on its own).
Have Business Model and in-memory "database" united, an entity/aggregate service keeping all relevant items in memory, maintaining integrity by simply dumping the CUD events somewhere without much pain. Old items can be evicted from memory and dumped to a NoSQL (or whatever) store and used for aggregate calculations, reporting, search and, if necessary, be re-activated. If I understand right, in-memory databases like VoltDB use event dumping in a similar way, but are still relational databases, separated from the business logic.
This would also make concurrency easier: instead of locking (with possible complete system deadlocks) or optimistic locking with a general "success or fail" logic, depending on whether the data has changed meanwhile (or rather complex DB code), merge rules can be implemented in code.
History: no more pain with implementing auditing functions, cemetary tables or "deleted" marker columns, or possibly deleted data still being required.
Data Duplication/Search/Reporting: use full-text indices instead of chasing missing relational indices, create proper viewing areas, preparing the data for the user in a required format, instead of using ugly copy routines in relational databases, with triggers, followup stored procedures or even program code copying data to half a dozen different tables.
Versioning: it's a pain to get many modules running with a number of different relational database versions, each having different tables and columns and needs appropriate ORM mappings. Could be much easier in a single layer model, with the event dump accepting any object format (typically schema-less or loose-schema NoSQL documents, represented as JSON or XML). It might also be possible to upgrade old data through a "data schema change event" chain (instead of having to maintain migration scripts for relational DBs).
N-tier Business Model / Relational DB / ORM mess
An n-tier approach a decade or longer ago might have been a business layer and data access layer. In order to keep separation really strict, many relational features were omitted, to implement them in the business layer instead: relational integrity, normalization, with the DB being what I call a "trash dump": looking like a kid playing around with SQL Server Management Studio or Access. Extremly un-normalized, polymorphic references ("Foreign Key" columns referencing different source tables, identified by a "ReferenceSource" marker), abuse of same tables for different kinds of business objects and duplication of data to numerous other tables (and from there again elsewhere), because performance wasnt good and this was supposed to improve queries. ORM usage was without object references too, reduced to single object load and save operations. Loading an aggregate (a graph of entities/table rows) would iterate through the graph and send a query for every set of sub-entities.
When performance got worse and, possibly, orphaned references caused serious trouble, attempts to implement classic relational design might have been made, but it was impossible to adapt the grown system to a complete data redesign (nobody would pay for it), hardly anybody would know how to map object relations or even optimized loading in the ORM. Such attempts were limited to a few places in the design, possibly making the data model and access even harder to maintain.
CQRS on top of n-tier?
To get acceptable performance, separate SQL queries were possibly built for certain modules, bypassing the business model with it's single object iterative access. This whole structure was suddenly called a de-facto CQRS, because of separate query access (which -could- have been handled by a well-implemented, relational data model and ORM usage, as long as it wasnt supposed to be a "Big data" Google- or Stackoverflow-like workload), and the plenty of duplicated data in relational tables, made up for immediate application access.
Something better than the inappropriate table format?
OK, so I read into CQRS, and while I didn't like the use of "CQRS" as described before, the concept of an event storage instead of a relational DB looked very useful: It is unlikely to successfully enforce the introduction of the original, state-of-art, relational DB design and OR mapping, and even if, it would be extremly costly. In fact, ordinary, object-oriented programming is much more "normalized" than most DB tables, due to the need to press all into the table format or create tons of tables for object graphs/aggregates. And I agree: having to take care of search indices and defragmentation, schema management and data history tracking manually, is like stone age IT, like running Ford T models and steam locomotives besides modern day cars and electric high speed trains.
Any good Experiences?
How are the experiences about using event sourcing (not necessarily full CQRS)? Does it eliminate much of the pain with relational databases? I really look forward for a kind of in-memory database with all business logic integrated, and possibly fast enough to make separate query modules dispensable!
There's a lot going on in this question and so a specific, actionable answer is not possible, but if you're looking for one then it is...
It depends on your domain.
CQRS/ES/DDD is not appropriate for solving every single problem - it is not a silver bullet. If the domain suggests that CRUD/NTier will be good enough, then that's what you should use. All of the concerns you list in your question are infrastructural or system traits and say nothing about the very thing that should inform your choice of tool or practice; what are you trying to build?
Although CQRS, ES, and DDD are very often used together they are separate concepts that are very powerful on their own.
CQRS (Command Query Responsibility Segregation): This is a very useful pattern to design software in general. The idea is to keep things that change state (commands) from things that do not (queries). In many systems queries modify the state of the database, this makes it very difficult for developers to reason about what is going on.
Imagine doing a query to find out some information and realizing that the information changed because you queried it.
CQRS prohibits those kind of behaviors. Commands (which cannot return information) change state and Queries (which return information) cannot modify state. That way, you have certainty in which parts of the code are idempotent (and therefore can be called as much as you want with no side effect) and which parts change state.
DDD (Domain Driven Design): This is a Design methodology for the "Data Structure" of the code. It does not prescribe techniques for database access or many technical details. What it does is provide guidelines and concepts to structure data in an application in a way that makes it much more responsive to the actual user's needs. It also simplifies development (although it is more work than just slapping something together).
ES (Event Sourcing): Event sourcing is a data storage strategy which shifts data storage from state (the actual values of a piece of data at the current point in time) into transitions (the changes that have happened to a piece of data during its lifetime) which are called events.
There are several advantages of using ES.
First, it allows the business to store much more information regarding what happened before (a boon to Data Scientists). In traditional systems, a lot of information is lost to updates of the data, and unless those updates are explicitly logged, the information is gone forever. This does not happen in ES.
Second, storing all events makes debugging much more simple because now a developer can follow the processing of the data since its beginning. An update to a piece of data that happened a long time ago (and would have been rewritten by another update and lost) but corrupted processing can be identified and fixed. Furthermore, the effects of the fix can even span all calculations that happened between the wrong event and the last event. In a traditional system, this would be impossible as we are only storing the latest state only.
While it is theoretically possible to write an Event Sourced system without CQRS or DDD, it is remarkably more difficult to do so.

At What Point Does a Form Lose its "Model-ness" and Become a Document?

I have been thinking and learning a lot about forms lately trying to add advanced extensions to the Spree ECommerce Platform: Subscriptions, Events, Donations, and all kinds of Surveys.
Every example I have ever encountered (in blogs, in the docs, in screencasts, in source code, etc.) make forms out of Models, but they never go to anything semi-structured or unstructured (or just really dynamic). So you have forms like:
Contact Form (User Model, maybe divided into an Address Model too)
Registration Form (User Model, Account Model, Address Model, etc.)
Blog Post Form (Post Model, Tag Model, etc.)
Checkout Form (Shipping Model, Order Model, LineItem Model, etc.)
All of those make perfect sense: They are the culmination of 10's of thousands, millions even, of man hours. Tons of people have slowly abstracted those things down into nearly universal "models" that could be saved into a database table. So now we all create models for them and make database tables for them.
But there are so many other things that can't be boiled down to those specific models. Things like a survey for a specific event, with form fields such as:
Are you Pregnant?
How many kids do you have?
Have you ever been sick?
What's your fastest mile?
If we started to save those things to the database in tables, we would have 100s and 1000s of database tables, one for each set of questions, or "survey".
So my thinking is, there's got to be some point at which you stop creating specific models like the "Post" and the "Order" and start just making a "Form" or "Survey" model (Form ~ Survey ~ Questionnaire to some degree).
Everything would be boiled down to these few models:
Survey
Question
Answer
ResponseSet (answers to questions in survey)
Response (specific response in response set)
And from those you could create any type of "Form" you wanted.
So my question is basically: When do you, in the most practical, day-to-day client projects, stop making forms with a bunch of models in them (a "Checkout" form is a form for the "Order" basically in Spree, but that easily requires 10 database models), and just start using Question/Answer or Field/Input or Key/Value? Practically?
I'm just looking for something like "when we built our online tutoring system, we didn't end up creating a bunch of SomeTutorialModel objects which extend TutorialModel, because that would've added too many tables to our database. Instead, we just used the Surveyor gem". Something along those lines :).
There's not much out there on this semi-structured type of data, but lots when you can boil it down to something super concrete.
It seems that if you used a Document Database, like CouchDB, you'd end up being able to create all kinds of Model objects in ruby for example, and could get them out with some clever view tricks. But with MySQL and the-like, it seems insane.
Your question is quite broad, so I will instead of giving direct answers mention these points:
1.) models often reflect the target (core) domain of the application, so the boundary between key/value and model is about the domain
2.) AFAIK e.g. Google uses relational databases even to store key/value data, so they can store everything as using document database
3.) all your questions are basically about modeling and abstraction, which is hard to explain shortly or in general

User defined data objects - what is the best data storage strategy?

I am building a system that allows front-end users to define their own business objects. Defining a business object involves creating data fields for that business object and then relating it to other business objects in the system - fairly straight forward stuff. My question is, what is the most efficient storage strategy?
The requirements are:
Must support business objects with potentially 100+ fields (of all common data types)
The system will eventually support hundreds of thousands of business object instances
Business objects sometimes display data and aggregates from their relationships with other business objects
Users must be able to search for business objects by their data fields (and fields from related business objects)
The two possible solutions I can envisage are:
Have a dynamic schema such that when a new business object type is created a new table is created for storing instances of that object. The object's fields become columns in the storage table.
Have a fixed schema where instance data fields are stored as rows in basically a big long table.
I can see pros and cons to both approaches:
the dynamic schema allows me to index search columns
the dynamic tables are potentially limited in width by the max column size
dynamic schemas rule out / cause issues with replication
the static schema means less or even no dynamic sql generation
my guess is the static schema may perform like a dog when it comes to searching across 100,000+ objects
So what is the best soution? Is there another approach I haven't thought of?
Edit: The requirement I have been given is to build a generic system capable of supporting front-end user defined business objects. There will of course be restrictions on how these objects can be constructed and related, but the requirement itself is not up for negotiation.
My client is a service provider and requires a degree of flexibility in servicing their own clients, hence the need to create business objects.
I think your problem matches very well to a graph database like Neo4j, as it's built for the requested kind of flexibility from the beginning. It stores data as nodes and relationships/edges, and both nodes and relationships can hold arbitrary properties (in a key/value fashion). One important difference to a RDBMS is that a graph database won't need to lookup the relationships in a big long table (like in your fixed schema solution), so there should be a significant performance gain there. You can find out about language bindings for Neo4j in the wiki and read what others say about it in this stackoverflow thread. Disclaimer: I'm part of the Neo4j team.
Without much understanding of your situation...
Instead of writing a general purpose one-size-fits-all business objects system (which is the holy grail for Oracle, Microsoft, SAS, etc.), why not do it the typical way, where the requirements are gathered, and a developer designs and implements the users' business objects in an effective manner?
If your users are typical, they will create a monster, which will end up running slow, and they will hate it. Most users will view the data as an Excel sheet, and not understand relationships like: parent/child. As a result there will be some crazy objects built, and impossible-to-solve reports. You'll be forced to create scripts to manually convert many old objects to better and properly defined ones, etc...
Your requirements sound a little bit like an associative database with a front end to compose and edit entities.
I agree with KM above, unless you have a very compelling reason not to, you would be better off using a traditional approach. There are a lot of development tools and practices that allow you to build a robust and scalable system. Otherwise you will have to implement much of this yourself.
I don't know the best way to do this, because it sounds like something that has already been implemented by others. If I were asked to implement this feature, I would recommend buying a wheel instead of reinventing it.
Perhaps there are reasons you have to invent your own? If so, then you should add those reasons to the requirements you listed.
If you absolutely must be this generic, I still recommend buying a system that has been architected for this requirement. Not just the storage requirements, which are the least of the problems your customer will have; but also: how do you keep the customer from screwing up totally when given this much freedom. Some of the commercial systems already meet this challenge without going out of business because of customers messing up.
If you still need to do this on your own, then I suggest that your requirements (or perhaps those of another vendor?) must include: allow the customer to get it right, and help keep the customer from getting it wrong. You'll need some sort of UI to allow the customer to define these business objects, and the UI should validate the model that the customer builds.
I recommend a UI that works at a conceptual level. As an example, see NORMA, a Visual Studio add-in for Object-Role Modeling (the "other" ORM). Consider it as a example only, if your end users cannot afford a Visual Studio Standard license. Otherwise, you'll find that it is extensible, already produces many types of artifact (from SQL in various dialects to code), and will validate the model to see that it makes sense. End users would also be able to enter sample data that they believe should be valid, and the system will validate the data against the model.
If your customers are producing sensible (if dynamic) business objects, then the question of storage will be much simpler.
Have you thought about an XML based solution? The requirements suggested to me "Build a system that allows users to dynamically generate an XML Schema and work with XML documents based on that schema." I don't know enough about storing and querying XML documents to comment on your original question.
Another possibility might be to leverage NHibernate's ability to generate database schemas. If you can dynamically generate business objects, then you can generate XML mappings or Fluent mappings and use that to generate a normalized database schema.
Every user that I have ever talked to has always wanted "everything" in their project. Part of the job of gathering requirements is to guide the user, not just write down everything they say.
Your only hope is to build several template objects, that they can add properties to, you could code your application to handle each type of these objects, but allow the user to still slightly modify each as necessary.
You need to inform the user upfront of the major flaws this type of design has. This will help you in the end, when it runs slow, or if they screw up and need help fixing something. I'd put this in writing.
How many possible objects would they really need? Perhaps you could set these up using your system first. I have developed several very customizable systems over the years and when the user is sitting at an empty screen, it is like a deer in the headlights.
In any event, good luck.

Pros/cons of document-based databases vs. relational databases

I've been trying to see if I can accomplish some requirements with a document based database, in this case CouchDB. Two generic requirements:
CRUD of entities with some fields which have unique index on it
ecommerce web app like eBay (better description here).
And I'm begining to think that a Document-based database isn't the best choice to address these requirements. Furthermore, I can't imagine a use for a Document based database (maybe my imagination is too limited).
Can you explain to me if I am asking pears from an elm when I try to use a Document oriented database for these requirements?
You need to think of how you approach the application in a document oriented way. If you simply try to replicate how you would model the problem in an RDBMS then you will fail. There are also different trade-offs that you might want to make. ([ed: not sure how this ties into the argument but:] Remember that CouchDB's design assumes you will have an active cluster of many nodes that could fail at any time. How is your app going to handle one of the database nodes disappearing from under it?)
One way to think about it is to imagine you didn't have any computers, just paper documents. How would you create an efficient business process using bits of paper being passed around? How can you avoid bottlenecks? What if something goes wrong?
Another angle you should think about is eventual consistency, where you will get into a consistent state eventually, but you may be inconsistent for some period of time. This is anathema in RDBMS land, but extremely common in the real world. The canonical transaction example is of transferring money from bank accounts. How does this actually happen in the real world - through a single atomic transactions or through different banks issuing credit and debit notices to each other? What happens when you write a cheque?
So lets look at your examples:
CRUD of entities with some fields with unique index on it.
If I understand this correctly in CouchDB terms, you want to have a collection of documents where some named value is guaranteed to be unique across all those documents? That case isn't generally supportable because documents may be created on different replicas.
So we need to look at the real world problem and see if we can model that. Do you really need them to be unique? Can your application handle multiple docs with the same value? Do you need to assign a unique identifier? Can you do that deterministically? A common scenario where this is required is where you need a unique sequential identifier. This is tough to solve in a replicated environment. In fact if the unique id is required to be strictly sequential with respect to time created it's impossible if you need the id straight away. You need to relax at least one of those constraints.
ecommerce web app like ebay
I'm not sure what to add here as the last comment you made on that post was to say "very useful! thanks". Was there something missing from the approach outlined there that is still causing you a problem? I thought MrKurt's answer was pretty full and I added a little enhancement that would reduce contention.
Is there a need to normalize the data?
Yes: Use relational.
No: Use document.
I am in the same boat, I am loving couchdb at the moment, and I think that the whole functional style is great. But when exactly do we start to use them in ernest for applications. I mean, yes we can all start to develop applications extremely quickly, cruft free with all those nasty hang-ups about normal form being left in the wayside and not using schemas. But, to coin a phrase "we are standing on the shoulders of giants". There is a good reason to use RDBMS and to normalise and to use schemas. My old oracle head is reeling thinking about data without form.
My main wow factor on couchdb is the replication stuff and the versioning system working in tandem.
I have been racking my brain for the last month trying to grok the storage mechanisms of couchdb, apparently it uses B trees but doesn't store data based on normal form. Does this mean that it is really really smart and realises that bits of data are replicated so lets just make a pointer to this B tree entry?
So far I am thinking of xml documents, config files, resource files streamed to base64 strings.
But would I use couchdb for structural data. I don't know, any help greatly appreciated on this.
Might be useful in storing RDF data or even free form text.
A possibility is to have a main relational database that stores definitions of items that can be retrieved by their IDs, and a document database for the descriptions and/or specifications of those items. For example, you could have a relational database with a Products table with the following fields:
ProductID
Description
UnitPrice
LotSize
Specifications
And that Specifications field would actually contain a reference to a document with the technical specifications of the product. This way, you have the best of both worlds.
Document based DBs are best suiting for storing, well, documents. Lotus Notes is a common implementation and Notes email is an example. For what you are describing, eCommerce, CRUD, etc., realtional DBs are better designed for storage and retrieval of data items/elements that are indexed (as opposed to documents).
Re CRUD: the whole REST paradigm maps directly to CRUD (or vice versa). So if you know that you can model your requirements with resources (identifiable via URIs) and a basic set of operations (namely CRUD), you may be very near to a REST-based system, which quite a few document-oriented systems provide out of the box.

Resources