This may be more opinionated than I like, but please forgive me. I'm searching for a definitive answer.
I am using GIT, JIRA (Issue Management), Bitbucket (online GIT projects) and SourceTree (GIT GUI client) for a project that involves multiple cross-code and cross-platform segments.
My issue is specifically with how to handle database source control in relation to the applications that utilize said database and it's objects?
For example, let's say you have a web based tool that was developed to pull data from said database using stored procedures. Would the database stored procedures be stored in the same repository as the web application?
In another example, let's say the same web application just used basic SQL queries. But, the systems that prepared the data such as a complex ETL system, helped make it happen. Would the ETL system source code be in the repository too?
(Note: I am not referring to database changes to the data types, indexes or schema. I'm referring to SQL scripts, stored procedures, SSIS packages, SSRS source and even possibly OLAP cube frameworks that are stored in the service. But of course, are not members of a DRC or CSM system for control outside of developer control.)
I hope this isn't too broad. There is just very little documentation out there in handling relational database objects in relation to a application or related systems. Databases by themselves do not seem to be that popular for DRC and CSM systems even though they are a critical part of the puzzle.
Unfortunately, there is no definite answer to this question. Where to store part of your system (or where to draw the line between systems) is hight context dependent.
Some considerations that might help you to decide are:
Who is developing the code?
If everything is created and maintained by the same team then it might be best to store everything together
At what rate is the code changing?
If the code to show the data is changing rapidly, while the code to create the data is only changing sometimes it might be best to separate the two code-bases
How is the code deployed/run?
If the deployment and running of various parts of a system is vastly different then it could make sense to store and handle them in a different way.
To link this to your examples, in the first situation I would probably suggest to keep everything together based on considerations 1 and 3. For the second example, all considerations together would suggest moving the ETL system to a separate repository.
Related
I installed a wordpress blog and was tinkering with the database,
I noticed they are not using any sotred procedures or views why is this?
Or is it just not available for wordpress.org users and some premium feature for paid wordpress.com members?
Is it not advisable to use these to improve performance considering wordpress stores almost everything except media files in database.
Are there any resources / attempts to optimize wp database using these ?
The decision regarding where to keep transformations of / operations on data is heavily rooted in the concept of what you consider to be the central interface to the data within the application as a whole.
If you're a database programmer, you're much more likely to consider that central point to be the database. In this view, the data is the center, and the surrounding application can be thought of as just an interface on top of that data. This view makes sense when dealing with anything where data itself is key. I.e., where the data will stay put over time, and the ways in which the data is accessed, or the things which you want to do with the data will change over time. Examples which fit well into this view include: Financial systems, Healthcare records, Customer data, Phone records... pretty much anything that has a lot of ways of looking at the data, and is constantly growing.
If you're an application programmer, the data itself may be almost secondary. In this view, the data is transient. Where and how that data is stored is even less important. The MVC pattern encourages the database to be utterly replaceable, and strongly discourages putting any sort of logic related to anything other than basic data integrity into the the database. There is certainly nothing about the MVC pattern or other application-centric development practices which argue specifically against stored procedures or views, but there is much less room for them to be useful. Examples which fit well into this view inclue: Blogs, Message-boards, Stand-alone Documents... pretty much anything that has a very simple structure, does not have complex relations, and can be divided easily into self-contained units. Anything for which "what you can do" is tied closely in concept to "what you are doing it to".
A summary of the two above-mentioned viewpoints is that there are tools for which examining data is more important (data-centric), and there are tools for which creating data is more important (application-centric).
Another way of looking at it is that Stored Procedures and Views are just interfaces on top of a database. Wordpress is also an interface on top of a database, it's just written in PHP.
Well, I don't know their rationale for a fact but my guess would be that since MySQL actually stores the procedures in the "mysql" database - not the wordpress database where the tables are - that they did it because it can be an access issue. Let's say you have a DB server supporting multiple WP databases. All the procedures get put into the "mysql" database. So when you backup your WP database you don't get any of the procedures. You'd need to back up the mysql (system) database, and its likely the users would not have the rights to do so in such an environment, which is the typical environment for WP installs.
Excellent answers. To add, I think that from a plugin coding side, it is easier to update just the file system and do as little database work on an as needed basis.
Especially if a plugin update doesn't install right the first time and you have to restore the files and try again, a database change would be a lot more difficult to reverse.
Does it make sense to use an OR-mapper?
I am putting this question of there on stack overflow because this is the best place I know of to find smart developers willing to give their assistance and opinions.
My reasoning is as follows:
1.) Where does the SQL belong?
a.) In every professional project I have worked on, security of the data has been a key requirement. Stored Procedures provide a natural gateway for controlling access and auditing.
b.) Issues with Applications in production can often be resolved between the tables and stored procedures without putting out new builds.
2.) How do I control the SQL that is generated? I am trusting parse trees to generate efficient SQL.
I have quite a bit of experience optimizing SQL in SQL-Server and Oracle, but would not feel cheated if I never had to do it again. :)
3.) What is the point of using an OR-Mapper if I am getting my data from stored procedures?
I have used the repository pattern with a homegrown generic data access layer.
If a collection needed to be cached, I cache it. I also have experience using EF on a small CRUD application and experience helping tuning an NHibernate application that was experiencing performance issues. So I am a little biased, but willing to learn.
For the past several years we have all been hearing a lot of respectable developers advocating the use of specific OR-Mappers (Entity-Framework, NHibernate, etc...).
Can anyone tell me why someone should move to an ORM for mainstream development on a major project?
edit: http://www.codinghorror.com/blog/2006/06/object-relational-mapping-is-the-vietnam-of-computer-science.html seems to have a strong discussion on this topic but it is out of date.
Yet another edit:
Everyone seems to agree that Stored Procedures are to be used for heavy-duty enterprise applications, due to their performance advantage and their ability to add programming logic nearer to the data.
I am seeing that the strongest argument in favor of OR mappers is developer productivity.
I suspect a large motivator for the ORM movement is developer preference towards remaining persistence-agnostic (don’t care if the data is in memory [unless caching] or on the database).
ORMs seem to be outstanding time-savers for local and small web applications.
Maybe the best advice I am seeing is from client09: to use an ORM setup, but use Stored Procedures for the database intensive stuff (AKA when the ORM appears to be insufficient).
I was a pro SP for many, many years and thought it was the ONLY right way to do DB development, but the last 3-4 projects I have done I completed in EF4.0 w/out SP's and the improvements in my productivity have been truly awe-inspiring - I can do things in a few lines of code now that would have taken me a day before.
I still think SP's are important for some things, (there are times when you can significantly improve performance with a well chosen SP), but for the general CRUD operations, I can't imagine ever going back.
So the short answer for me is, developer productivity is the reason to use the ORM - once you get over the learning curve anyway.
A different approach... With the raise of No SQL movement now, you might want to try object / document database instead to store your data. In this way, you basically will avoid the hell that is OR Mapping. Store the data as your application use them and do transformation behind the scene in a worker process to move it into a more relational / OLAP format for further analysis and reporting.
Stored procedures are great for encapsulating database logic in one place. I've worked on a project that used only Oracle stored procedures, and am currently on one that uses Hibernate. We found that it is very easy to develop redundant procedures, as our Java developers weren't versed in PL/SQL package dependencies.
As the DBA for the project I find that the Java developers prefer to keep everything in the Java code. You run into the occassional, "Why don't I just loop through all the Objects that just returned?" This caused a number of "Why isn't the index taking care of this?" issues.
With Hibernate your entities can contain not only their linked database properties, but can also contain any actions taken upon them.
For example, we have a Task Entity. One could Add or Modify a Task among other things. This can be modeled in the Hibernate Entity in Named Queries.
So I would say go with an ORM setup, but use procedures for the database intensive stuff.
A downside of keeping your SQL in Java is that you run the risk of developers using non-parameterized queries leaving your app open to a SQL Injection.
The following is just my private opinion, so it's rather subjective.
1.) I think that one needs to differentiate between local applications and enterprise applications. For local and some web applications, direct access to the DB is okay. For enterprise applications, I feel that the better encapsulation and rights management makes stored procedures the better choice in the end.
2.) This is one of the big issues with ORMs. They are usually optimized for specific query patterns, and as long as you use those the generated SQL is typically of good quality. However, for complex operations which need to be performed close to the data to remain efficient, my feeling is that using manual SQL code is stilol the way to go, and in this case the code goes into SPs.
3.) Dealing with objects as data entities is also beneficial compared to direct access to "loose" datasets (even if those are typed). Deserializing a result set into an object graph is very useful, no matter whether the result set was returned by a SP or from a dynamic SQL query.
If you're using SQL Server, I invite you to have a look at my open-source bsn ModuleStore project, it's a framework for DB schema versioning and using SPs via some lightweight ORM concept (serialization and deserialization of objects when calling SPs).
I was thinking of starting a project that very clearly needs a persistent store. I was about to reluctantly decide on a RDBMS, when I came across an article which briefly mentions CouchDB. Seems some advancements in DB technology have happened since I last looked, so I thought I would ask here about databases before I got into it.
Here are my criteria. ( I list the criteria again at the end, so if you want to skip the explanations just scroll down. )
The project is open source and I will not be asking anything for it, so preferably the database is open source and free. Furthermore the software has to run on both Linux and Windows.
There are parts of the project that have to be in C++. The project is not large enough code wise to justify using a second language. So basically the whole thing will be C++.
This project will not have anything to do with the web, so preferably
the database will not require the detritus of a web library.
The objects I want to store fall into one of two categories: a basic object and a container object. The difference being objects which are containers will contain even more objects, ie: a parts of parts problem. I need a database that can handle such cases cleanly and efficiently.
I also expect the schema to evolve rapidly, at least initially. I alse suspect that some of the old data simply will not fit into the new schemas. So I would like to keep different versions of the schema around. Win possible, I would like to be able to transform data in one to schema into another schema.
For the application to work the way intended, people would have to exchange large chunks of database with each other. So I would want simple ways of importing and exporting data, which I could automate to some degree.
Finally it would be nice if the database could in someway be simulated in unit tests.
THose are my requirements. I have replicated them below to make it easier for people answering.
Thank you
Non Technical requirements
1. Open source preferably free.
2. Run on Windows and Linux
Has a C++ interface.
Is able to handle a non-web application, preferably without REST.
Can handle a "parts of parts" problem fairly well.
Can handle multiple indexes.
Has sort of concept of schema version, can handle multiple schema versions, and can migrate tables from one schema to another.
Should have a simple mechanism for move data from one instance of the database to another.
Preferably has some mechanism for testing.
HDF5 is a binary format which behaves like an hierarchical database. It has binding and libraries for C++ and python (I only use the latter) and it is used to store big amounts of data, like the ones produces in certain physics and astronomy experiments.
http://www.hdfgroup.org/HDF5/
I've looked at a few nosql databases some time ago (had an different requirement than than you though - needed it to be a standalone server). The ones that I remember as particularly interesting are Redis and Kyoto Cabinets. Have a look.
BTW, you don't mention any performance requirement. If so, have you considered SQLite? Simple, embedded, stable, and with the flexibility of SQL after all. With prepared statement the performance penalty of SQL should not be very high.
EDIT: ooops, just noticed that you asked this more than a year ago... Well, perhaps you can tell us what you've chosen :)
A question regarding a DB development project. The database already exist and is rather large (several TBs).
What do you use for version control in DB development?
How do you control concurrent changes to the data model by different teams
What is your approach to the Unit Testing in the DB development
How do you deal with the sensitive data if the DB owners do not know what is sensitive? What is your approach to the data obfuscation? What are your obfuscation techniques?
How do you work on a large DB from several locations?
Please answer one or more of the items as you see fit. Each answer will be reviewed separately. Thank you very much!
EDIT:
A related question with good answers to the p.1 is here: How do you version your database schema?
For most of these, while the tools don't apply the general processes of code development do:
Maintain a development system separate from production with enough data to get useful performance metrics when testing a new model
This system has unit tests (SQL queries, commits, aborted atomic commits, etc) written and run against it prior to every release.
There are official 'releases'
The development database is the source control system itself - in other words the database is modeled and held in the database with sign-ins and rollbacks, etc. It's non-trivial, and doesn't solve every problem, but given the lack of good VCS for databases it works.
Roll-outs (after testing, integration, etc) consist of just the new database structure going to the production site - the modeling tables are not replicated there.
For 4, "How do you deal with the sensitive data if the DB owners do not know what is sensitive? What is your approach to the data obfuscation?"
"Sensitive until proven innocuous" is my mantra. Unless someone makes a case for not adequately protecting any data from visibility (either internal or external) then my default mode is to protect it.
Cases come up later on where we'll open data up for perfromance, reporting, etc reasons, but a documented business case with the appropriate signatures is required.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have been developing web/desktop applications for about 6 years now. During the course of my career, I have come across application that were heavily written in the database using stored procedures whereas a lot of application just had only a few basic stored procedures (to read, insert, edit and delete entity records) for each entity.
I have seen people argue saying that if you have paid for an enterprise database use its features extensively. Whereas a lot of "object oriented architects" told me its absolute crime to put anything more than necessary in the database and you should be able to drive the application using the methods on those classes?
Where do you think is the balance?
Thanks,
Krunal
I think it's a business logic vs. data logic thing. If there is logic that ensures the consistency of your data, put it in a stored procedure. Same for convenience functions for data retrieval/update.
Everything else should go into the code.
A friend of mine is developing a host of stored procedures for data analysis algorithms in bioinformatics. I think his approach is quite interesting, but not the right way in the long run. My main objections are maintainability and lacking adaptability.
I'm in the object oriented architects camp. It's not necessarily a crime to put code in the database, as long as you understand the caveats that go along with that. Here are some:
It's not debuggable
It's not subject to source control
Permissions on your two sets of code will be different
It will make it more difficult to track where an error in the data came from if you're accessing info in the database from both places
Anything that relates to Referential Integrity or Consistency should be in the database as a bare minimum. If it's in your application and someone wants to write an application against the database they are going to have to duplicate your code in their code to ensure that the data remains consistent.
PLSQL for Oracle is a pretty good language for accessing the database and it can also give performance improvements. Your application can also be much 'neater' as it can treat the database stored procedures as a 'black box'.
The sprocs themselves can also be tuned and modified without you having to go near your compiled application, this is also useful if the supplier of your application has gone out of business or is unavailable.
I'm not advocating 'everything' should be in database, far from it. Treat each case seperately and logically and you will see which makes more sense, put it in the app or put it in the database.
I'm coming from almost the same background and have heard the same arguments. I do understand that there are very valid reasons to put logic into the database. However, it depends on the type of application and the way it handles data which approach you should choose.
In my experience, a typical data entry app like some customer (or xyz) management will massively benefit from using an ORM layer as there are not so many different views at the data and you can reduce the boilerplate CRUD code to a minimum.
On the other hand, assume you have an application with a lot of concurrency and calculations that span a lot of tables and that has a fine-grained column-level security concept with locking and so on, you're probably better off doing stuff like that directly in the database.
As mentioned before, it also depends on the variety of views you anticipate for your data. If there are many different combinations of columns and tables that need to be presented to the user, you may also be better off just handing back different result sets rather than map your objects one-by-one to another representation.
After all, the database is good at dealing with sets, whereas OO code is good at dealing with single entities.
Reading these answers, I'm quite confused by the lack of understanding of database programming. I am an Oracle Pl/sql developer, we source control for every bit of code that goes into the database. Many of the IDEs provide addins for most of the major source control products. From ClearCase to SourceSafe. The Oracle tools we use allow us to debug the code, so debugging isn't an issue. The issue is more of logic and accessibility.
As a manager of support for about 5000 users, the less places i have to look for the logic, the better. If I want to make sure the logic is applied for ALL applications that use the data , even business logic, i put it in the DB. If the logic is different depending on the application, they can be responsible for it.
#DannySmurf:
It's not debuggable
Depending on your server, yes, they are debuggable. This provides an example for SQL Server 2000. I'm guessing the newer ones also have this. However, the free MySQL server does not have this (as far as I know).
It's not subject to source control
Yes, it is. Kind of. Database backups should include stored procedures. Those backup files might or might not be in your version control repository. But either way, you have backups of your stored procedures.
My personal preference is to try and keep as much logic and configuration out of the database as possible. I am heavily dependent on Spring and Hibernate these days so that makes it a lot easier. I tend to use Hibernate named queries instead of stored procedures and the static configuration information in Spring application context XML files. Anything that needs to go into the database has to be loaded using a script and I keep those scripts in version control.
#Thomas Owens: (re source control) Yes, but that's not source control in the same sense that I can check in a .cs file (or .cpp file or whatever) and go and pick out any revision I want. To do that with database code requires a potentially-significant amount of effort to either retrieve the procedure from the database and transfer it to somewhere in the source tree, or to do a database backup every time a minor change is made. In either case (and regardless of the amount of effort), it's not intuitive; and for many shops, it's not a good enough solution either. There is also the potential here for developers who may not be as studious at that as others to forget to retrieve and check in a revision. It's technically possible to put ANYTHING in source control; the disconnect here is what I would take issue with.
(re debuggable) Fair enough, though that doesn't provide much integration with the rest of the application (where the majority of the code could live). That may or may not be important.
Well, if you care about the consistency of your data, there are reasons to implement code within the database. As others have said, placing code (and/or RI/constraints) inside the database acts to enforce business logic, close to the data itself. And, it provides a common, encapsulated interface, so that your new developer doesn't accidentally create orphan records or inconsistent data.
Well, this one is difficult. As a programmer, you'll want to avoid TSQL and such "Database languages" as much as possible, because they are horrendous, difficult to debug, not extensible and there's nothing you can do with them that you won't be able to do using code on your application.
The only reasons I see for writing stored procedures are:
Your database isn't great (think how SQL Server doesn't implement LIMIT and you have to work around that using a procedure.
You want to be able to change a behaviour by changing code in just one place without re-deploying your client applications.
The client machines have big calculation-power constraints (think small embedded devices).
For most applications though, you should try to keep your code in the application where you can debug it, keep it under version control and fix it using all the tools provided to you by your language.