which data model storage strategy? [closed] - database

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am trying to come out with a good design for the storage of a data model. The language is python, but I guess this is fairly agnostic.
At the moment I envision three possible strategies:
Object database
The datamodel is a network of objects. When I create them, I specify them as descendant from a persistence object. Example:
class Tyres(PersistentObject):
def __init__(self,brand):
self._brand = brand
class Car(PersistentObject):
def __init__(self,model):
self._model = model
self._tyres = None
def addTyres(self,tyres):
self._tyres = tyres
def model(self):
return model
The client code is not aware of the persistence, it manipulates the object as it was in memory, and the persistence object takes care of everything without the client code knowing. Retrieval can be done via a keyed lookup of a database object. This is the approach that Zope Object Database (among many others) use. Advantages are lazy retrieval and changes are operated only in the objects that are changed, without retrieving the ones that are untouched.
Shelving objects
The data model given above is represented in memory, but a database is then used to push or pull data as monolitic entities. for example:
car = Car("BMW")
tyres = Tyres("Bridgestone")
car.setTyres(tyres)
db.store(car)
This is what a pickle-based solution does. It is, in some sense, similar to the previous solution, with the only difference that you store the object as a single bundle and retrieve it again as a single bundle.
The Facade
A single database class with convenience methods. Client code never handles objects, only ids. Example
class Database:
def __init__(self):
# setup connection
def createCar(self, model):
# creates the car, returns a numeric key car_id
def createTyresForCar(self, car_id, brand):
# creates the tyres for car_id, returns a numeric id tyres_id
def getCarModel(self, car_id):
# returns the car model from the car identifier
def getTyresBrand(self, car_id, tyre_id):
# returns the tyre brand for tyres_id in car_id.
# if tyres_id is not in car_id, raises an error.
# apparently redundant but it's not guaranteed that
# tyres_id identifies uniquely the tyres in the database.
This solution is rather controversial. The database class can have a lot of responsibilities, but I kind have the feeling that this is the philosophy used in SOAP: you don't get to manipulate an object directly, you perform inquires for object properties to a remote server. In absence of SQL, this would likely to be the interface to a relational database: db.createTable(), db.insert(), db.select(). SQL simplifies this to obtain a very simple db interface, db.query(sql_string) at the price of a language (SQL) parsing and execution. You still get to operate on the subparts of the data model you are interested in, without touching the others.
I would like to ask your opinion about the three designs, and in particular the third. When is it a good design, if ever ?
The inverted logic
This is something I've seen on MediaWiki code. Instead of having something like
db.store(obj)
they have
obj.storeOn(db)
Edit : The example datamodel I show is a bit simple. My real aim is to create a graph based datamodel (if anyone want to participate to the project I would be honored). What worries me of the third solution strongly encapsulate the written datamodel (as opposed to the in-memory one) and masks the backend, but it risk to blow up as there's only one central class with all the methods exposed. I must be honest, I don't like the third case, but I thought about it as a possible solution, so I wanted to put it on the dish of the question. There could be good in it.
Edit 2 : added the inverted logic entry

The first design is most compatible with Domain-Driven Design. Having the implementation persistence be fully private to an object means you can use the object without regard to its relational representation. It can be helpful for an object only to expose methods that relate to its domain-specific behavior, not low-level CRUD operations. The high-level methods are the only API contract you want to offer to consumers of that object (i.e. you don't want just anyone to be able to delete the car). You can implement complex data relationships and only code them in one place.
The second design can be used with the Visitor pattern. A car object knows what parts of it need to be persisted, but it doesn't have a connection to the database. So you pass the car object to a database connection instance. Internally the db knows how to call an object it's given. Presumably the car implements some "db-callable" interface.
The third design is helpful for implementing an Adapter pattern. Every database brand's API is different, and every flavor of SQL is slightly different. If you have a generic API to plain database operations, you can encapsulate those differences, and swap out a different implementation that knows how to talk to the respective brand of database.

It's hard to say, because your example is obviously contrived.
The decision needs to be made based on how often your data model will change. IMHO cars aren't often gathering new parts; so I would go with a static model in the database of all the items you wish to model, and then a table linking all those together, but it may be wrong for what you are actually doing.
I'd suggest you should talk to us about the actual data you need to model.

Related

DDD, Databases and Lists of Data

Im at the beginning of my first "real" software project, and I'd like to start off right. The concept of DDD seems like a very clean approach which separates the various software parts, however im having trouble implementing this in reality.
My Software is measurement tracker and essentially stores list of measurement data, consisting of a timestamp and the data value.
My Domain Models
class MeasurementDM{
string Name{get;set;}
List<MeasurementPointDM> MeasurementPoints{get;set;}
}
class MeasurementPointDM{
DateTime Time{get;set;}
double Value{get;set;}
}
My Persistence Models:
class MeasurementPM{
string Id{get;set;} //Primary key
string Name{get;set;} //Data from DomainModel to store
}
class MeasurementPointPM{
string Id{get;set;} //Primary Key
string MeasurementId{get;set;} //Key of Parent measurement
}
I now have the following issues:
1) Because I want to keep my Domain Models pure, I don't want or need the Database Keys inside those classes. This is no problem when building my Domain models from the Database, but I don't understand how to store them, as the Domain Model no longer knows the Database Id. Should I be including this in the Domain model anyway? Should I create a Dictionary mapping Domain objects to Database ids when i retreive them from the Database?
2)The measurement points essentially have the same Id problem as the measurements themselves. Additionally I'm not sure what the right way is to store the MeasurementPoints themselves. Above, each MeasurementPointPM knows to which MeasurementPM it belongs. When I query, I simply select MeasurementPoints based on their Measurement key. Is this a valid way to store such data? It seems like this will explode as more and more measurements are added. Would I be better off serializing my list of MeasurementPoints to a string, and storing the whole list as an nvarchar? This would make adding and removing datapoints more difficult, as Id always need to deserialize, reserialize the whole list
I'm having difficulty finding a good example of DDD that handles these problems, and hopefully someone out there can help me out.
My Software is measurement tracker and essentially stores list of measurement data, consisting of a timestamp and the data value.
You may want to have a careful think about whether you are describing a service or a database. If your primary use case is storing information that comes from somewhere else, then introducing a domain model into the mix may not make your life any better.
Domain models test to be interesting when new information interacts with old information. So if all you have are data structures, it's going to be hard to discover a good model (because the critical element -- how the model entities change over time -- is missing).
That said....
I don't understand how to store them, as the Domain Model no longer knows the Database Id.
This isn't your fault. The literature sucks.
The most common answer is that _people are allowing their models to be polluted with O/RM concerns. For instance, if you look at the Cargo entity from the Citerus sample application, you'll find these lines hidden at the bottom:
Cargo() {
// Needed by Hibernate
}
// Auto-generated surrogate key
private Long id;
This is an indirect consequence of the fact that the "repository" pattern provides the illusion of an in-memory collection of objects that maintain their own state, when the reality under the covers is that you are copying values between memory and durable storage.
Which is to say, if you want a clean domain model, then you are going to need a separate in memory representation for your stored data, and functions to translate back and forth between the two.
Put another way, what you are running into is a violation of the Single Responsibility Principle -- if you are using the same types to model your domain that you use to manage your persistence, the result is going to be a mix of the two concerns.
So essentially you would say that some minimal pollution of the domain model, for example an Id, is standard practice.
Less strong; I would say that it is a common practice. Fundamentally, a lot of people, particularly in the early stages of a project, don't value having a boundary between their domain model and their persistence plumbing.
Could it make sense to have every Domain Model inherit from a base class or implement an interface that forces the creation of Unique Id?
It could. There are a lot of examples on the web where domain entities extend some generic Entity or Aggregate pattern.
The really interesting questions are
What are the immediate costs and benefits of doing that?
What are the deferred costs and benefits of doing that?
In particular, does that make things easier or harder to change?

Data and storage design and modeling for filters by details

TL;DR
I have architecture issue which boils down to filtering entities by predefined set of common filters. Input is: set of products. Each product has details. I need to design filtering engine so that I can (easily and fast) resolve a task:
"Filter out collection of products with specified details"
Requirements
User may specify whatever filtering is possible with support of precedence and nested filters. So, bare example is (weight=X AND (color='red' OR color='green')) OR price<1000 The requests should go via HTTP / REST, but that's insignificant (it only adds an issue with translating filters from URI to some internal model). Any comparison operators should be supported (like equality, inequality, less than etc.)
Specifics
Model
There is no fixed model definition - in fact I am free to chose one. To make it simpler I am using simple key=>value for details. So it goes at the very minimum to:
class Value extends Entity implements Arrayable
{
protected $key;
protected $value;
//getters/setters for key/value here
}
for simple value for product detail and something like
class Product extends Entity implements Arrayable
{
protected $id;
/**
* #var Value[]
*/
protected $details;
//getters/setters, more properties that are omitted
}
for the product. Now, regarding data model, there is a first question: How to design filtering model?. I have a simple idea of implementing it as a let's say, recursive iterator which will be a tree regular structure according to incoming user request. The difficulties which I certainly need to solve here are:
Quickly build the model structure out from user request
Possibility for easy modification of the structure
Easy translate of chosen filters data model to chosen storage (see below)
Last point in the list above is probably the most important part as storage routines will be most time-consuming and therefore filters data model should fit in such structure. That means storage has always higher priority and if data model can not fit into some storage design that allows to resolve the issue - then data model should be changed.
Storage
As a storage I want to use NoSQL+RDBMS which is Postgree 9.4 for example. So that will allow to use JSON for storing details. I do not want to use EAV in any case, that is why pure relational DBMS isn't an option (see here why). There is one important thing - products may contain stocks which leads to the situation that I have basically two ways:
If I design products as a single entity with their stocks (pretty logical), then I can not go "storage" + "indexer" approach because this produces outdated state as indexer (such as SOLR) needs to update and reindex data
Design with separate entities. That means - to separate whatever can be cached from whatever that can not. First part then can go to indexer (and details probably can go to there, so we are filtering by them) and non-cacheable part will go somewhere else.
And the question for storage part would be, of course: which one to chose?
Good thing about first approach is that the internal API is simple, internal structures are simple and scalable because they then can easily be abstracted from storage layer. Bad thing is that then I need this "magic solution" which will allow to use "just storage" instead of "storage+indexer". "Magic" here means to somehow design indexes or some additional data-structures (I was thinking about hashing, but it isn't helpful against range queries) in storage that will resolve filtering requests.
On the other hand second solution will allow to use search engine to resolve filtering task inside itself but producing some gap when data will be outdated there. And of course now the data layer needs to be implemented the way it will somehow know about which part of model goes to which storage (so stocks to one storage, details to another etc)
Summary
What can be a proper data model to design filtering?
Which approach should be used to resolve the issue on the storage level: storage+indexer with separate products model or only storage with monolithic products model? Or may be something else?
If go the approach with storage only - is it possible to design storage so it will be possible to filter out products easily by any set of details?
If go with the indexer, what will fit better for this issue? (There is a good comparison between solr and sphinx here, but it's '15 now while it was made in '09 so for sure it is outdated)
Any links, related blogposts or articles are very welcome.
As a P.S.: I did a search across SO but faced barely-relevant suggestions/topics so far (for example this). I am not expecting a silver bullet here as it is always boils down to some trade-off, but however question looks very standard so there should be good insights already. Please, guide me - I tried to "ask google" with some luck but that was not enough yet.
P.P.S. feel free to edit tags or redirect question to proper SE resource if SO is not a good idea for such kind of questions. And I am not asking language-specific solution, so if you are not using PHP - it does not matter, design has nothing to do with the language
My preferred solution would be to split the entities - your second approach. The stable data would be held in Cassandra (or Solr or Elastic etc), while the volatile stock data would be held in (ideally) an in-memory database like Redis or Memcache that supports compare-and-swap / transactions (or Dynamo or Voldemort etc if the stock data won't fit in memory). You won't need to worry too much about the consistency of the stable data since presumably it changes rarely if ever, so you can choose a scalable but not entirely consistent database like Cassandra; meanwhile you can choose a less scalable but more consistent database for the volatile stock data.

Design approach for utilizing data from a database in VB.NET

My question began as an inquiry regarding naming a new object based on a variable, but as I wrote out my problem, I realized my solution's approach may not be appropriate, so I will ask instead for general design advice. Allow me to set forth a simple example of the type of problem I'd like to solve.
Problem:
There exists a database that holds information on each planet in the solar system. The database has multiple tables.
The program intends to access this data to display it to the user in a variety of manners. The program does not intend to manipulate the data at its source, though calculations on two fields may be performed to determine certain display characteristics.
What is the optimal method for accessing the data?
My initial approach, (read: gut feeling, no research) was to take each related row of data from the database and create an object of a custom class based on the data. So you might wind up with:
Universe.Planet.Earth.mass = 1
Universe.Planet.Earth.radiusPolar = 1
Universe.Planet.Earth.radiusEquatorial = 1
Universe.Planet.Earth.aphelion = 1.106
Universe.Planet.Earth.perihelion = .983
Universe.Planet.Mars.mass = .107
Universe.Planet.Mars.radiusPolar = .531
Universe.Planet.Mars.radiusEquatorial = .533
Universe.Planet.Mars.aphelion = 1.666
Universe.Planet.Mars.perihelion = 1.381
Then use these objects to perform whatever actions the program needed to. I ran into a lot of problems trying to implement this, so that I believe my design approach is fundamentally flawed.
Any advice on how I might approach this?
I would abstract it further. Create a Universe and a universe has galaxies, so there should be a collection of galaxies in the universe. Each galaxy would contain space objects (planets, moons, etc). I would end up with a class that similar to your except:
the object name would be a property of the class
the object would have a enumerated type (planet, moon, etc)
something describing location (distance from center of galaxy???)
I'm not a space junkie, so I don't know what you could building into a base class that would apply to all objects, but I'm guessing you do.
If there are calculations, it may be better to have SQL calculate them and make it part of a returned recordset.
What are you calculating?

Filtering results from the data access layer in the business layer

I haven't been able to find an answer to my question so far, and I suppose I have to ask my first question some time. Here goes.
I have a Data Access Layer that's responsible for interacting with various data storage elements and returns POCOs or collections of POCOs when querying out things.
I have a Business Layer that sits on top of this and is responsible for implementing business rules on objects returned from the Data Access Layer.
For instance, I have a SQL Table of Dogs, my data access layer can return that list of dogs as a collection of Dog object. My business layer then would do things like filter out dogs below a certain age, or any other filtering or transformation that had to happen based on business rules.
My question is this. What's the best way to handle filtering objects based on related records? Let's say I want all the people who have Cats. Right now my data access layer can return all the cats, and all the people, but doesn't do any filtering for me.
I can implement the filtering via a different data access method (i.e. DAO.GetCatPeople()) but this could get complicated if I have a multitude of related properties or relationships to handle
I can return all from both sides and do the matching myself all in the business layer, which seems like a lot of extra work and not fully utilizing the sql server.
I can write a data filtration interface and if my data access layer changes this layer would have to change as well.
Is there some known best practices here I could be benefiting from?
The view I take is that there's two "reasons" why you'd access data: Data centric and Use Case centric.
Data Centric is stuff like CRUD and other common / obvious stuff that is a no brainer.
"Use Case" centric is where you define an interface and matching POCO's for a specific purpose. [It's possible I'm missing some common terminology here, but Use Case centric is what I mean]
I think both types are valid. For the use case driven ones it's going to be mostly driven by business focused use cases, but I can see edge cases where they could be more technically driven - I'd say that was ok as long as they didn't violate any business rules or pervert your domain model.
Should Cats and Dogs know about each other? If they exist within the same domain model, and have established relationships within that model - then yes of course you should be able to make queries like GetCatPeople().
As far as managing the complexity goes, rather than GetCatPeople() you could have a more generic method that took an attribute as a parameter: GetPeopleByAnimal(animal).

Is using a table inheritance a valid way to avoid using a join table?

I've been following a mostly DDD methodology for this project, so, like any DDD'er, I created my domain model classes first. My intention is to use these POCO's as my LINQ-to-SQL entities (yes, they're not pure POCO's, but I'm ok with that). I've started creating the database schema and external mapping XML file, but I'm running into some issues with modeling the entities' relationships and associations.
An artifact represents a document. Artifacts can be associated with either a Task or a Case. The Case entity looks like this:
public class Case
{
private EntitySet<Artifact> _Artifacts;
public IList<Artifact> Artifacts
{
get
{
return _Artifacts;
}
set
{
_Artifacts.Assign(value);
}
}
.
.
.
}
Since an Artifact can be associated with either a Case, or a Task, I've the option to use inheritance on the Artifact class to create CaseArtifact and TaskArtifact derived classes. The only difference between the two classes, however, would be the presence of a Case field or a Task field. In the database of course, I would have a single table, Artifact, with a type discriminator field and the CaseId and TaskId fields.
My question: is this a valid approach to solving this problem, or would creating a join table for each association (2 new tables, total) be a better approach?
I would probably go with two tables - it makes the referential integrity-PK/FKs a little simpler to handle in the database, since you won't have to have a complex constraint based on the selector column.
(to reply to your comment - I ran out of space so post here as an edit) My overall philosophy is that the database should be modelled with database best practices (protect your perimeter and ensure database consistency, using as much RI and constraints as possible, provide all access through SPs, log activity as necessary, control all modes of access, use triggers where necessary) and the object model should be modelled with OOP best practices to provide a powerful and consistent API. It's the job of your SPs/data-access layer to handle the impedance mismatch.
If you just persist a well-designed object model to a database, your database won't have much intrinsic value (difficult to data mine, report, warehouse, metadata vague, etc) when viewed without going through the lens of the object model - this is fine for some application, typically not for mine.
If you just mimic a well-designed database structure in your application, without providing a rich OO API, your application will be difficult to maintain and the internal strucutres will be awkward to deal with - typically very procedural, rigid and with a lot of code duplication.
I would consider finding commonalities in between case and task, for the lack of better word let's call it "CaseTask" and then sub-typing (inheriting) from that one. After that you attach document to the super-type.
UPDATE (after comment):
I would then consider something like this. Each document can be attached to several cases or tasks.

Resources