ORM auto-creation of tables violates DBA preferences - database

I am using ORM to automatically create tables from model classes. I am naming the classes and their fields in a way that is natural for the application. The ORM then uses those same names for the tables and columns, and automatically generates names of other objects like constraints and sequences which are completely abstracted by the ORM.
I am not declaring how the tables, columns, etc. should be named. I leave it to the ORM to decide. I see this as good sense from the application's point of view.
The DBA for my team does not like this one bit. The DBA says that if column "B" in table "A" has a foreign key constraint to field "Y" in table "X", that the name must be "A.X_Y" instead of "A.B". The DBA says this is the "correct" way to name foreign keys and that the ORM is therefore naming them incorrectly. Both of the ORM engines I am familiar with allow class/field names to be explicitly mapped to table/column names, so I am aware it is possible to accommodate the DBA without changing the classes.
My question is, in practice, does doing this (explicitly mapping the names) necessarily introduce an extra entity(i.e. new configuration files/sections, new code) or coupling (i.e. imports, decorators, annotations) in the application that might not otherwise exist? ORM designs vary, so I would think it is possible that the answer is different for different engines.
If the answer to this is nearly universally yes, I would consider that a good argument that the DBA should yield to the ORM in the interest of productivity, on the logic that the database exists to serve the needs of the application, not the other way around. Commentary on this point is welcome also.
Please note, I am NOT asking for specific recommendations on how classes/fields/tables/columns should be named.

ORM designs vary, so I would think it is possible that the answer is different for different engines.
You are right here. Depending on the ORM, explicitly specifying the field->column mapping might (or might not) introduce extra entities in the form of configuration entries, annotations, attributes etc. Having said that, this is the bread and butter (or should I say the very basics) for an ORM engine and hence handling these mappings usually does not add a noticeable overhead.
I feel you definitely cannot use this as an argument against explicitly naming the tables,columns etc. Most often than not you would have another application that would use the same database. The default names generated by your application might make absolutely no sense for that application. I would let the DBA decide the right nomenclature for the database artifacts and tweak the mapping to my domain model, in my app. After all that's where the ORM adds value.

No, explicitly mapping the names does not create an extra entity or coupling; it defines a policy and process which your DBA has decided to be appropriate for their database(s). I'd contend, as well, that your logic is a bit backwards; the application exists to serve the needs of the database. The software which ANY company puts in place exists to serve / populate / present the data which exists in the database, not the other way around.
Consider this; the software that does whatever your company does will change as requirements and technology change, and get refactored as standards and complexity evolve. The data which is stored in the database, however, will not; even as the schema changes / evolves, the data underlying will change very little, if at all.

Coding standards exist for a reason - if I were the DBA, I might suggest that the productivity benefit of using the ORM without modification should yield in the name of maintainability.
It's a question of trade-offs: if the additional effort to name the foreign key column according to the corporate standard is minimal (and in most ORMs I've worked with, it is), I'd just do it.
If your ORM makes it super hard, I'd explain to the DBA that this is a different architecture than the one he's used to - the use of an ORM means there will rarely if ever be any manual SQL writing - it's all automagically created by the ORM, so it makes sense to relax the coding standards.

Related

Using LDAP server as a storage base, how practical is it?

I want to learn how practical using an LDAP server (say AD) as a storage base. To be more clear; how much does it make sense using an LDAP server instead of using RDBMS to store data?
I can guess that most you might just say "it doesn't" but there might be some reasons to make it meaningful (especially business wise);
A few points first;
Each table becomes a container entity and each row becomes a new entity as a child. Row entities contains attributes for columns. So you represent your data in this way. (This should be the most meaningful representation I think, suggestions are welcome)
So storing data like a DB server is possible but lack of FK and PK (not sure about PK) support is an issue. On the other hand it supports attribute (relates to a column) indexing (Not sure how efficient). So consistency of data is responsibility of the application layer.
Why would somebody do this ever?
Data that application uses/stores closely matches with the existing data in AD. (Users, Machines, Department Info etc.) (But still some customization is required to existing entity schema, and new schema definitions are needed for not very much related data.)
(I think strongest reason would be this: business related) Most mid-sized companies have very well configured AD servers (replicated, backed-up etc.) but they don't have such DB setup (you can make comment to this as much as you want). Say when you sell your software which requires a DB setup to these companies, they must manage their DB setup; but if you say "you don't need DB setup and management; you can just use existing AD", it sounds appealing.
Obviously there are many disadvantages of giving up using DB, feel free to mention them but let's assume they are acceptable. (I can mention more if question is not clear enough.)
LDAP is a terrible tool for maintaining most business data.
Think about a typical one-to-many relationship - say, customer and orders. One customer has many orders.
There is no good way to represent this data in an LDAP directory.
You could try having a mock "foreign key" by making every entry of that given object class have a "foreign key" attribute, but your referential integrity just went out the window. Cascade deletes are impossible.
You could try having a "customer" object that has "order" children. However, you've just introduced a specific hierachy - you're now tied to it.
And that's the simplest use case. Once you start getting into more complex relationships, you're basically re-inventing an RDBMS in a system explicity designed for a different purpose. The clue's in the name - directory.
If you're storing a phonebook, then sure, use LDAP. For anything else, use a real database.
For relatively small, flexible data sets I think an LDAP solution is workable. However an RDBMS provides a number significant advantages:
Backup and Recovery: just about any database will provide ACID properties. And, RDBMS backups are generally easy to script and provide several options (e.g. full vs. differential). Just don't know with LDAP, but I imagine these qualities are not as widespread.
Reporting: AFAIK LDAP doesn't offer a way to JOIN values easily, much the less do things like calculate summations. So you would put a lot of effort into application code to reproduce those behaviors when you do need reporting. And what application doesn't ultimately?
Indexing: looks like LDAP solutions have indexing, but again, seems hit or miss. Whereas seemingly all databases out there have put some real effort into getting this right.
I think any serious business system's storage should be backed up in the same fashion you believe LDAP is in most environments. If what you're really after is its flexibility in terms of representing hierarchy and ability to define dynamic schemas I'd suggest looking into NoSQL solutions or the Java Content Repository.
LDAP is very usefull for storing that information and if you want it, you may use it. RDMS is just more comfortable with ORM systems. Your persistence logic with LDAP will so complex.
And worth mentioning that this is not a standard approach -> people who will support the project will spend more time on analysis.
I've used this approach for fun, i generate a phonebook from Active Directory, but i don`t think that it's good idea to use LDAP as a store for business applications.
In short: Use the right tool for the right job.
When people see LDAP you already set an expectation on your system. Don't forget that the L Lightweight. LDAP was designed for accessing directories over a network.
With a “directory database” you can build a certain type of application. If you can map your data to a tree like data structure it will work. I surely would not want to steam videos from LDAP! You can probably hack something but I would prefer a steaming server..
There might be some hidden gotchas down the line if you use a tool not designed for what it is supposed to do. So, the downside is you'll have to test stuff that would have been a given in some cases.
It's not is not just a technical concern. Your operational support team might “frown” on your application as they would have certain expectations/preconceptions based on your applications architectural nature. Imagine their surprise if you give them CRM system (website + files and popped email etc.) in a LDAP server as database to maintain.
If I was in your position, I would steer towards one of the NoSQL db solutions rather than trying to use LDAP. LDAP is fine for things like storing user and employee information, but is terrible to interact with when you need to make changes. A NoSQL db will allow you to store your data how you want without the RDBMS overhead you would like to avoid.
The answer is actually easy. Think of CRUD (Create, Read, Update, Delete). If a lot of Read will be made in your system, you can think of using LDAP. Because LDAP is quick in read operations and designed so. If the other operations will be made more, the RDMS would be a better option.

It it bad practice to use inheritance in database to make it less painful to change schema latter?

I'm thinking on new design for database. And I want to use Table-Per-Type inheritance inside database. I want to separate audit data (like create|modify dates, user_id , permissions,etc) from actual data, and use some abstract objects(tables) but I don't plan to have deeper than 3 inheritance (to make things simple and fast). The reasons for this I want my database design to be flexible, and easy to modify, without lots of impact on my code. And using inheritance and extending existing functionality that way looks like a good choice to me.
I read some posts that using inheritance in database is bad practice, but I never seen explanation of that and what exact problems can occur.
I want my database design to be flexible, and easy to modify, without lots of impact on my code.
First, you're just describing Object-Relational Mapping (ORM). You need to update your question to specify a language, and a database and an ORM.
Second, this kind of flexibility is often an attractive nuisance.
You don't specify what database you're using, but there can be a fundamental contradiction in your desires.
A SQL database has a relatively fixed, less-than-perfectly-flexible schema so that it can be fast.
Asking for flexibility means you must now make certain kinds of relatively painful and complex changes to the database in the name of "flexibility". Table design changes that involve moving (and possibly normalizing) data can be painful and difficult because the database isn't generally designed for flexibility.
More importantly, however, there's no simple mapping from inherited attributes to database tables.
There are times when superclass attributes should be copied into multiple tables.
However, there are also times when the superclass attributes should be a separate table with an explicit join to the subclass tables.
As you read more about ORM's, you'll see that this is a problem of non-trivial complexity. It requires thinking, and each design decision has important consequences.
There is no "blanket" solution. Your vague "And I want to use Table-Per-Type inheritance inside database" isn't a good idea or a bad idea. It's a vague idea. Each individual table and each individual inheritance decision must be made individually based on performance requirements and the expected need for flexibility.
As an answer to my own question: I have reread Martin Fowler's book patterns of enterprise application architecture about database inheritance. And it helped me to clear my thought and choose my inheritance strategy. Nothing bad in using inheritance in database, and as S.Lott said, this requires thorough thinking and testing.

is EAV - Hybrid a bad database design choice

We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.

Users asking for denormalized database

I am in the early stages of developing a database-driven system and the largest part of the system revolves around an inheritance type of relationship. There is a parent entity with about 10 columns and there will be about 10 child entities inheriting from the parent. Each child entity will have about 10 columns. I thought it made sense to give the parent entity its own table and give each of the children their own tables - a table-per-subclass structure.
Today, my users requested to see the structure of the system I created. They balked at the idea of the table-per-subclass structure. They would prefer one big ~100 column table because it would be easier for them to perform their own custom queries.
Should I consider denormalizing the database for the sake of the users?
Absolutely not. You can always create a view later to show them what they want to see.
They are effectively asking for a report.
You could give them access to a view containing all the fields they require... that way you don't mess up your data model.
No. Structure the data properly and if the users need the a denormalized view of the data create it as a VIEW in the database.
Alternatively, consider that perhaps an RDBMS is not the appropriate storage tool for this project.
They are the users and not the programmers of the system for a reason. Provide a separate interface for their queries. Power users like this can both be helpful and a pain to deal with. Just explain you need the database designed a certain way so you can do your job, period. Once that is accomplished you and provide other means to make querying easier.
What do they know!? You could argue that users shouldn't even be having direct access to a database in the first place.
Doing that leaves you open to massive performance issues, just because a couple of users are running ridiculous queries.
How about if you created a VIEW in the format your users wanted while still maintaining a properly normalized table?
Aside from a lot of the technical reasons for or against your users' proposition, you need to be on same page in communicating the consequences of various scenarious and (more importantly) the costs of those consequences. If the users are your clients and they are paying you to do a job, explain that their awful "proposed" ideas may cost them more money in development time, additional hardware resources, etc.
Hopefully you can explain it in such a way that shows your expertise and why your idea is a much better value to your users in the long run.
As everyone more or less mentioned, that way lies madness, and you can always build a view.
If you just can't get them to come around on this point, consider showing them this thread and the number of pros who weighed in saying that the users are meddling with things that they don't fully understand, and the impact will be an undermined foundation.
A big part of the developer's craft is the feel for what won't work out long term, and the rules of normalization are almost canonical in that respect. There are situations where you need to denormalize (data warehouses, etc) but this doesn't sound like one of them!
It also sounds as though you may have a particularly troubling brand of user on your hand -- the amatuer developer who thinks they could do your job better themselves if only they had the time. This may or may not help, but I've found that those types respond well to presentation -- a few times now I've found that if I dress sharp and show a little bit of force in my personality, it helps them feel like I'm an expert and prevents a bunch of problems before they start.
I would strongly recommend coming up with an answer that doesn't involve someone running direct reports against your database. The moment that happens, your DB structure is set in stone and you can basically consider it legacy.
A view is a good start, but later on you'll probably want to structure this as an export, to decouple further. Of course, then you'll encounter someone who wants "real time" data. Proper business analysis usually reveals this to be unnecessary. Actual real time requirements are not best handled through reporting systems.
Just to be clear: I'd personally favour the table per subclass approach, but I don't think it's actually as big an issue as the direct reporting off transaction tables is going to be.
I would opt for a view (as others have suggested) or an inline table-valued function (the benefits of this is you require parameters - like an date range or a customer account - which can help to stop users from querying without any limits on the problem space) first. An inline TVF is really a parametrized view and is far closer to a view in terms of how the engine treats them than it is to a multi-statement table valued function or a scalar function, which can perform incredibly poorly.
However, in some cases, this can impact production performance if the view is complex or intensive. With poorly written ad hoc user queries, it can also cause locks to persist longer or be escalated further than they would on a better built query. It is also possible for users to misinterpret an E-R data model and produce multiplied numbers in cases where there are many-to-one or many-to-many relationships. The next option might be to materialize these views with indexes or make tables and keep them updated, which gets us closer to my next option...
So, given those drawbacks of the view option and already thinking of mitigating it by starting to make copies of data, the next option I would consider is to have a separate read-only (for these users) version of the data which is structured differently. Typically, I would first look at a Kimball-style star schema. You do not need to have a full-fledged time-consistent data warehouse. Of course, that's an option, but you could simply keep a reporting model up to date with data. Star-schemas are a special form of denormalization and are particularly good for numerical reporting, and a given star should not be able to be abused by users accidentally. You can keep the star up to date in a number of ways, including triggers, scheduled jobs, etc. They can be very fast for reporting needs and run on the same production installation - perhaps on a separate instance if not just a separate database.
Although such a solution may require you to effectively more than double your storage requirements, when compared with other practices it might be a really good option if you understand your data well and don't mind having two models - one for transactions and one for analysis (note that you will already start to have this logical separation anyway with the use of a the simplest first option of view).
Some architects will often double their servers and use the SAME model with some kind of replication in order to provide a reporting server which is indexed more heavily or differently. Such a second server doesn't impact production transactions with reporting requirements and can be kept up to date fairly easily. There will only be one model, but of course, this has the same usability problems with allowing users ad hoc access to the underlying model only, without the performance affects, since they get their own playground.
There are a lot of ways to skin these cats. Good luck.
The customer is always right. However, the customer is likely to back down when you convert their requirement into dollars and cents. A 100 column table will require extra dev time to write the code that does what the database would do automatically with the proper implementation. Further, their support costs will be higher since more code means more problems and lower ease of debugging.
I'm going to play devil's advocate here and say that both solutions sound like poor approximations of the actual data. There's a reason that object-oriented programming languages don't tend to be implemented with either of these data models, and it's not because Codd's 1970 ideas about relations were the ideal system for storing and querying object-oriented data structures. :-)
Remember that SQL was originally designed as a user interface language (that's why it looks vaguely like English and not at all like other languages of that era: Algol, C, APL, Prolog). The only reasons I've heard for not exposing a SQL database to users today are security (they could take down the server!) and usability (who wants to write SQL when you can clicky clicky?), but if it's their server and they want to, then why not let them?
Given that "the largest part of the system revolves around an inheritance type of relationship", then I'd seriously consider a database that lets me represent that natively, either Postgres (if SQL is important) or a native object database (which are awesome to work with, if you don't need SQL compatibility).
Finally, remember that every engineering decision is a tradeoff. By "sticking to your guns" (as somebody else proposed), you're implicitly saying the value of your users' desires are zero. Don't ask SO for a correct answer to this, because we don't know what your users want to do with your data (or even what your data is, or who your users are). Go tell them why you want a many-tables solution, and then work out a solution with them that's acceptable to both of you.
You've implemented Class Table Inheritance and they're asking for Single Table Inheritance. Both designs are valid in certain situations.
You might want to get a copy of Martin Fowler's Patterns of Enterprise Application Architecture to read more about the advantages and disadvantages of each design. That book is a classic reference to have on your bookshelf, in any case.

Defining the database schema in the application or in the database?

I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).

Resources