Managing Entity Resolution in Anchor Modeling - data-modeling

I've been reading about anchor modeling and really like the concept. My hope is to possibly incorporate it into a data management framework where I consolidate multiple data sources into an anchor model, then either make it directly available or have it feed data marts for our data scientists.
But I'm not sure how to approach entity resolution. The guidelines state no updates, only inserts, with the option to delete only to remove erroneous data. So now lets say my source system(s) have duplicate entities (eg. John Smith appears more than once), and this makes its way into my anchor model? What is the best way to clean this up?
My rubber duck is telling me to create an entity resolution layer on top of my anchor model that looks for these issues and corrects them. Correcting would mean merging entities in anchors and fixing subsequent ties accordingly. But now I'm updating my anchor model...which is against best practices.
Or am I looking at this wrong....and entity resolution should be dealt with before data gets into the anchor model? But mistakes can happen, and it would be nice to know I could address the issue inside the anchor model should it present itself.

Related

What are the front-end data modeling best practices for translating complex nested SQL associations into manageable services?

I'm reaching out to gain perspective on possible solutions to this problem. I'll be using Angular and Rails, but really this problem is a bit more abstract and doesn't need to be answered in the context of these frameworks.
What are the best practices for managing complex nested SQL associations on the front-end?
Let's say you have posts and comments and comments are nested under posts. You send your posts to the front-end as JSON with comments nested under them. Now you can display them listed under each post, great. But then questions arise:
What if you want to display recent comments as well? Your comment service would need to have comments in a normalized collection or gain access to them in a fashion that allows them to be sorted by date.
Does this mean you make a separate API call for comments sorted by date? This would duplicate comments on the front-end and require you to update them in two places instead of one (one for the posts and one for the comment, assuming comments can be edited or updated).
Do you implement some kind of front-end data normalization? This meaning you have a caching layer that holds the nested data and then you distribute the individual resources to their corresponding service?
What if you have data that has varying levels of nesting? Continuing with the posts and comments example. What if your comments can be replied to up until a level of 10?
How does this effect your data model if you've made separate API calls for posts and comments?
How does this effect your caching layer if you choose that approach?
What if we're not just talking about posts? What if you can comment on photos and other resources?
How does this effect the two options for data-modeling patterns above?
Breaking from the example, what if we were talking about recursive relationships between friended users?
My initial thoughts and hypothetical solution
My initial thought and how I'd attack this is with a caching layer and normalize the data such that:
The caching layer handles any normalization necessary
The caching layer holds ONE canonical representation of each record
The services communicate with the caching layer to perform CRUD actions
The services generally don't care nor do they need to know how nested/complex the data model is, by the time the data reaches the services it is normalized
Recursive relationships would need to be finite at some point, you can't just continue nesting forever.
This all of course sounds great, but I see lots of potential pitfalls and wish to gain perspective. I'm finding it difficult to separate the abstract best practices from the concrete solutions to specific data models. I very interested to know how others have solved this problem and how they would go about solving it.
Thanks!
I assume you will use restful apis, attention I don't know rails but I will suggest you some general practices that you might consider
Let's say you have one page that shows 10 posts and their 10 comments sorted by date, make this response possible in one api call
There is one more page that shows only 5 posts and no comments use the same api end-point
Make this possible with some query parameters.
Try to optimize your response as much as you can.
You can have multiple response type in one end-point, in any programming languages, if we talking about APIs thats how I do the job.
If you query takes much time, and that query runs serveral times then of course you need to cache but talking about 10 posts in each api call doesn't need caching. It should not hard on database.
For nesting problem you can have a mechanism to make it possible i.e
I will fetch 10 posts and their all comments, I can send a query parameter that I want to include all comments of each post
like bar.com/api/v1/posts?include=comments
if I need only some customized data for their comments, I should be able to implement some custom include.
like bar.com/api/v1/posts?include=recent_comments
You API layer, should first match with your custom include if not found go on relation of the resources
for deeper references, like comments.publisher or recent_comments.publisher your API layer needs to know which resource are you currently working on it. You won't need this for normal include, but custom includes should describe that what model/resource they are point to that way it is possible to create endless chain
I don't know Rails, but you can make this pattern easily possible if you have a powerful ORM/ODM
Sometimes, you need to do some filtering same goes for this job too.
You can have filter query parameter and implement some custom filters
i.e
bar.com/api/v1/posts?include=recent_comments&filters=favorites
or forget about everything and make something below
bar.com/api/v1/posts?transformation=PageA
this will return 10 recent posts with their 10 recent comments
bar.com/api/v1/posts?transformation=PageB
this will return only 10 recent posts
bar.com/api/v1/posts?transformation=PageC
this will return 10 recent post and their all comments

is EAV - Hybrid a bad database design choice

We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.

Delphi DataModule Usage - Single or Multiple?

I am writing an application, there are various forms and their corresponding datamodules.
I wrote in a way that they are using each other by mentioning in uses class(one in implementation and another in interface to avoid cross reference)
Is this approach is wrong? why or why not i should use in this way?
I have to agree with Ldsandon, IMHO it's way better to have more than one datamodule in your project. If you look at it as a Model - View - Controller thingie, your DB is the model, your forms would be the Views and your datamodules would be the controllers.
Personally I always have AT LEAST 2 datamodules in my project. One datamodule is used to share Actions, ImageLists, DBConnection, ... and other stuff throughout the project. Most of the time this is my main datamodule.
From there on I create a new datamodule for each 'Entity' in my application. For example If my application needs to proces or display orders, cursomters and products, then I will have a Datamodule for each and everyone of those.
This way I can clearly separate functionality and easily reuse bits and pieces without having to pull in everything. If I need something related to Customers, I simple use the Customers Datamodule and just that.
Regards,
Stefaan
It is ok, especially if you're going to create more than one instance of the same form each using a different instance of the related datamodule.
Just be aware of a little issue in the VCL design: if you create two instances of the same form and their datamodules, both forms will point to the same datamodule (due the way the VLC resolves links) unless you use a little trick when creating the datamodule instance:
if FDataModule = nil then
begin
FDataModule := TMyDataModule.Create(Self);
FDataModule.Name := ''; // That will avoid pointing to the same datamodule
end;
A data module just for your table objects, if you only have a db few tables, would be a good one to be the first one.
And a data module just for your actions is another.
A data module just for image lists, is another good third data module. This data module only contains image lists, and then your forms that need access to image lists can use them all from this shared location.
If you have 200 table objects, maybe more than one data module just for db tables. Imagine an application that has 20 tables to do with invoicing, and another 20 tables to do with HR. I would think an InvoicingDataModule, and HRDataModule would be nice to be separate if the tables inside them, and the code that works against them, doesn't need to know anything about each other, or even when one module has a "uses" dependency in one direction, but that relationship is not circular. Even then, finer grained data module modularity can be beneficial.
Beyond the looking glass. I always use Forms instead of DataModules. I know it's not common ground.
I always use Forms instead of DataModules. I call them DataMovules.
I use one such DataMovule per each group of tables logically related.
I use Forms instead of DataModules because both DataModules and Forms are component containers and both can effectively be used as containers for data related components.
During development:
My application is easier to develop. Forms give you the opportunity to view the data components, making easy to develop those components.
My application is easier to debug, as you may put some viewer components right away so you may actually see the data. I usually create a tab rack, with one page per table, and one datagrid on every page for browsing the data on the table.
My application is easier to test, as you may eventually manipulate the data, for example experimenting with extreme values for stress testing.
After development:
I do turn form invisible. Practically making it a DataModule, I enjoy the very same container features than a datamodule has.
But with a bonus. The Form is still there, so I may eventually can turn it visible for problem determination. I use the About Box for this.
And no, I have not experienced any noticeable penalty in application size or performance.
I don't try to break the MVC paradigm. I try to adhere to it instead. I don't mix the Forms that constitute my View with the DataMovules that constitute my Controllers. I don't consider them part of my View. They will never be used as the User Interface of my application. DataMovules just happen to be Forms. They are just convenient engineering artifacts.

Why not construct UI based on DB schema?

People seem to avoid building user interfaces that pull their information (names, field types, etc. as well as relationships) from a database; they instead hard-code forms (and tables, etc.) that have pretty much the same data names and types and things.
Am I making sense?
For instance, imagine an emumerated field in MySQL: why not just have the UI construct a drop-down list whenever it encounters an ENUM? Why put the same values in both the database and the code?
Perhaps I'm just missing something; perhaps there are projects out there that do this — sort of super-crud interfaces that can be pointed at any database and from it build a fully-functional relationally-aware user interface. Are there?
I'm possibly not quite conforming to the stackoverflow norms with this question; I shall summarise:
Can you please tell me of a project that constructs its user interface (solely) from analysis of the database schema?
Why is this not a common way to do it — surely it is good to only define data structure in one place (i.e. the database)?
Thank you, and may joyous code-love rain upon your IDE.
I'd like to point out that, last time I checked, .NET and Qt (and probably other environments) make it possible to use "database-aware widgets" (sometimes shortened to just data-aware widgets), which is probably the best pragmatic solution available. What I mean by data-aware widgets is that the widgets themselves know that they're linked directly to database fields, so you would have a combobox that knows that it's backed by an enum and fetches the possible values directly from the database at runtime, just like you suggested.
This is a really neat utility, and used well, it probably won't hurt anything. It still requires that you spend some time laying out widgets manually on a form, but then if you update the database to add a new value to that enum, you don't have to rebuild your app to see it show up in the UI.
But the reason most usability experts will cringe when they hear your question is because programmers tend to think that, well, why not just generate the entire UI, form layout and everything, from the database? And this is where it starts to get really nasty, really fast.
Let's say you have a simple Person table, with first_name, last_name, email_address, street_address, city, state, zip, and phone_number. You want to automatically generate a UI based on these fields. How do you sort the fields? I mean, ideally, first name and last name should be right next to one another. And it would look very silly if you had city and state before street address. So you have to add a new column to the table to specify sort order, if you go with the quickest method, or a new table to specify each field's order index to their field ID.
What if you want to group parts of the information separately? Then you have to add more UI-specific cruft into your database layout (to do this generically, you'll need a new table specifying which UI fields belong to which UI groupboxes). So you've only solved two problems and already your database layout has gotten twice as ugly, plus now instead of a simple O(1) layout operation when you load the UI, you've gotta do several database queries to find out what fields exist and dynamically lay them out while applying the correct widget order... and we haven't even dealt with sizing (should every field be the maximum size to fit its possible contents, or should all text fields be the same width? Wouldn't it be nice if you could say that some text fields should be one width and height, and some should be another combination? etc), or text justification, or formatting, or any other really common elementary usability requirements that will require further sacrifices from the clarity and simplicity of your database schema.
Most visual database editors. phpMyAdmin for instance.
Because the database structure isn't always a very good logical structure for a user to be using, especially in the case of databases that have been denormalised on purpose for efficiency reasons.
Yup, this route has already been traveled.
Simply pointing at a database will create an oversimplified UI, not giving much more than the CRUD of an Access UI. That's why Naked Objects (I'm one of its committers) builds its metamodel from a pojo domain model. This allows the UI to expose any public methods as menus ... we call this "behaviourally complete".
Per the comment about the UI not being suitable for end-users, I have two points:
distinguish between power users vs casual users. Most internal apps are for the former (we use Alan Coopers' term of a "sovereign application" for this), who understand the domain and don't want fancy UI stuff getting in the way. Most external apps, eg public web sites, are for the latter.
for the latter, there's nothing to prevent the autogenerated UI of a tool like Naked Objects being replaced with a custom or semi-customized viewer. One such viewer is Scimpi, I'm also working on an Eclipse RCP viewer that'll expose extension points. But even here, the auto gen UI is still very valuable for the development team and business analysts for exploration and prototyping.
Hope some of the above has piqued your interest. If you want more, google around, or you might want to check out my book on domain-driven design and NO, at pragprag.com.
HTH
Dan
List of projects that implement this idea.
.NET
dotObjects
Naked Objects
TrueView
Java
Domain Object Explorer
JMatter
Naked Objects
Sanssouci
Trails
Lablz
C++
Typical Objects
Specifically to your second question: Alot of it really depends on your data model. Some are very complicated and would lead to un-intuitive user interfaces. Perhaps for simply CRUD based systems, having your UI be a front end to the database would be preferable. In that case, I think that some of these tools would be great. However, for some more complicated systems where some db data needs to be hidden from the users, it would be better if you UI didn't mirror the db schema.
Microsoft Access has used this model for years - the database and UI development are very closely tied. You can auto-generate a form directly from a table definition with smart defaults and search built in. The model works well for developing applications with few concurrent users such as custom applications for small businesses where the amount of data stored is small.
If you are scaling to larger relational DBs with a number of concurrent users, or large databases then reliability and performance become more important, and separately constructed UI and databases make more sense. When more users are involved they often have different requirements so decoupling the UI from the DB schema makes it more efficient to develop.
Just a note on Java "projects that implement this idea" - tynamo is the new version of Trails framework
There are many systems that build an interface for you to edit stuff directly from table information. End-user interfaces, however, must be tweaked a little bit. You may not want to reveal to the user every field in your table.
Frameworks that make good use of the MVC design pattern can let you do all kinds of things with your models, which are the preferred way to build new systems (rather than creating database tables directly).
To answer your questions specifically:
django allows you to construct forms (and a complete admin CMS) out of models.
It is a common thing to do.
Naked Objects is about one step removed from this. They base the UI on an object model, and then persist the object model.
I think you are forgetting to consider the user in your design process if you are thinking like that. Bad mistake. Users don't like it when the interface changes, they would especially not like it if it changed frequently as they then wouldn't know what to do. Further, if you generate your UI on the fly based on the database structure, then what order would the objects be in? UIs need to have objects in an order that makes sense to the users not the database designers.
Further in a well-designed database there are fields that are not meant for the users to see. Things like numeric keys, insert date, last updated etc. You don't want to automatically expose these to the users and you certainly don't want them to have the ability to mess with the data in such fields.
Finally, if you don't think about the functionality of the page, then you aren't doing your job. A UI needs to be more than just a list of fileds that can be edited. You need to have constraints on who can see what, checks of the data before inserting to the database, business rules that need to be applied. You can't just autogenerate a lot of this (and you shouldn't even if you could!). Design needs thought and care.
Now as to drop down lists, of course you can generate them from the database and not the code, in fact it is the better choice. Just make the query the source for your particular object, not a list generated in code.
You can do it with the help of this cool tool from a developer in Philippines, it is called COBALT. You can download it here.

Marrying up consumer-defined aggregates (e.g. SQL counts) with 'pure' model objects?

What is the best practice of introducing custom (typically volatile) data into entity model classes? This may sound like a bad practice first, but it seems to be quite a common scenario. In our recent web application we have developed a proper model and in most cases we are fine with loading model entities. But there are cases where we cannot afford loading an entire hierarchy of entities; we need to load, say, results of a couple of SQL COUNT’s or possibly some additional information alongside (or embedded inside) the model entities. So basically, the requirements and conditions are:
It’s a web application where 99.9999999999% of all operations are read operations.
They don’t need to process or do any complicated business logic. We just need to get data quickly to HTML.
In several performance critical cases, we need to load results of SQL aggregates which don’t fit any model properties.
We need an extensible way to introduce any new custom data if needed.
How do you usually solve this issue without working too much around your ORM (for instance raw data from db)? I’m sure this has been discussed many times, but I cannot figure out a good Google query to find anything useful.
Edit: Since I later realized the question was not very well formed, I decided to reformulate it and start a new one.
If you're just getting relational data to and from a browser, with little or no behavior in between, it sounds like your trying to solve a relational problem with an OO paradigm.
I might be inclined to dispense with the Object Oriented approach altogether.
Me team recently rewrote an application by asking "What is the simplest thing that can possibly work?" and "What is the closest language to the problem?". Our new app, replacing an OO one, ended up being 10 times smaller, faster, and cheaper.
We used SQL, stored procedures, XML libraries on the DB server, XSLT (to get the HTML), and javascript.
OOP purist like myself would go to the Decorator pattern.
http://en.wikipedia.org/wiki/Decorator_pattern
But the thing is, some people may not need the flexibility it offers. Plus, creating new classes for each distinct operation may seem overkill, but it provide good compile type checking.
The best practice in my view is that your application consumes data using the Domain Model pattern. The Domain Model can offer business-logic methods for doing the type of queries that make sense and are relevant to your application needs.
These can fetch "live" results that map directly to database rows and can therefore be edited and "saved."
But additionally, the Domain Model can provide methods that fetch read-only results that are too complex to be easily saved back to the database. This includes your example of grouped aggregate query results, and also includes joined query result sets, expressions as columns, etc.
The Domain Model pattern offers a way to decouple the OO design of an application from the design of the physical database.

Resources