Database development organisation - database

A question regarding a DB development project. The database already exist and is rather large (several TBs).
What do you use for version control in DB development?
How do you control concurrent changes to the data model by different teams
What is your approach to the Unit Testing in the DB development
How do you deal with the sensitive data if the DB owners do not know what is sensitive? What is your approach to the data obfuscation? What are your obfuscation techniques?
How do you work on a large DB from several locations?
Please answer one or more of the items as you see fit. Each answer will be reviewed separately. Thank you very much!
EDIT:
A related question with good answers to the p.1 is here: How do you version your database schema?

For most of these, while the tools don't apply the general processes of code development do:
Maintain a development system separate from production with enough data to get useful performance metrics when testing a new model
This system has unit tests (SQL queries, commits, aborted atomic commits, etc) written and run against it prior to every release.
There are official 'releases'
The development database is the source control system itself - in other words the database is modeled and held in the database with sign-ins and rollbacks, etc. It's non-trivial, and doesn't solve every problem, but given the lack of good VCS for databases it works.
Roll-outs (after testing, integration, etc) consist of just the new database structure going to the production site - the modeling tables are not replicated there.

For 4, "How do you deal with the sensitive data if the DB owners do not know what is sensitive? What is your approach to the data obfuscation?"
"Sensitive until proven innocuous" is my mantra. Unless someone makes a case for not adequately protecting any data from visibility (either internal or external) then my default mode is to protect it.
Cases come up later on where we'll open data up for perfromance, reporting, etc reasons, but a documented business case with the appropriate signatures is required.

Related

How to handle database source in repositories of other applications?

This may be more opinionated than I like, but please forgive me. I'm searching for a definitive answer.
I am using GIT, JIRA (Issue Management), Bitbucket (online GIT projects) and SourceTree (GIT GUI client) for a project that involves multiple cross-code and cross-platform segments.
My issue is specifically with how to handle database source control in relation to the applications that utilize said database and it's objects?
For example, let's say you have a web based tool that was developed to pull data from said database using stored procedures. Would the database stored procedures be stored in the same repository as the web application?
In another example, let's say the same web application just used basic SQL queries. But, the systems that prepared the data such as a complex ETL system, helped make it happen. Would the ETL system source code be in the repository too?
(Note: I am not referring to database changes to the data types, indexes or schema. I'm referring to SQL scripts, stored procedures, SSIS packages, SSRS source and even possibly OLAP cube frameworks that are stored in the service. But of course, are not members of a DRC or CSM system for control outside of developer control.)
I hope this isn't too broad. There is just very little documentation out there in handling relational database objects in relation to a application or related systems. Databases by themselves do not seem to be that popular for DRC and CSM systems even though they are a critical part of the puzzle.
Unfortunately, there is no definite answer to this question. Where to store part of your system (or where to draw the line between systems) is hight context dependent.
Some considerations that might help you to decide are:
Who is developing the code?
If everything is created and maintained by the same team then it might be best to store everything together
At what rate is the code changing?
If the code to show the data is changing rapidly, while the code to create the data is only changing sometimes it might be best to separate the two code-bases
How is the code deployed/run?
If the deployment and running of various parts of a system is vastly different then it could make sense to store and handle them in a different way.
To link this to your examples, in the first situation I would probably suggest to keep everything together based on considerations 1 and 3. For the second example, all considerations together would suggest moving the ETL system to a separate repository.

why wordpress does not use views or stored procedures

I installed a wordpress blog and was tinkering with the database,
I noticed they are not using any sotred procedures or views why is this?
Or is it just not available for wordpress.org users and some premium feature for paid wordpress.com members?
Is it not advisable to use these to improve performance considering wordpress stores almost everything except media files in database.
Are there any resources / attempts to optimize wp database using these ?
The decision regarding where to keep transformations of / operations on data is heavily rooted in the concept of what you consider to be the central interface to the data within the application as a whole.
If you're a database programmer, you're much more likely to consider that central point to be the database. In this view, the data is the center, and the surrounding application can be thought of as just an interface on top of that data. This view makes sense when dealing with anything where data itself is key. I.e., where the data will stay put over time, and the ways in which the data is accessed, or the things which you want to do with the data will change over time. Examples which fit well into this view include: Financial systems, Healthcare records, Customer data, Phone records... pretty much anything that has a lot of ways of looking at the data, and is constantly growing.
If you're an application programmer, the data itself may be almost secondary. In this view, the data is transient. Where and how that data is stored is even less important. The MVC pattern encourages the database to be utterly replaceable, and strongly discourages putting any sort of logic related to anything other than basic data integrity into the the database. There is certainly nothing about the MVC pattern or other application-centric development practices which argue specifically against stored procedures or views, but there is much less room for them to be useful. Examples which fit well into this view inclue: Blogs, Message-boards, Stand-alone Documents... pretty much anything that has a very simple structure, does not have complex relations, and can be divided easily into self-contained units. Anything for which "what you can do" is tied closely in concept to "what you are doing it to".
A summary of the two above-mentioned viewpoints is that there are tools for which examining data is more important (data-centric), and there are tools for which creating data is more important (application-centric).
Another way of looking at it is that Stored Procedures and Views are just interfaces on top of a database. Wordpress is also an interface on top of a database, it's just written in PHP.
Well, I don't know their rationale for a fact but my guess would be that since MySQL actually stores the procedures in the "mysql" database - not the wordpress database where the tables are - that they did it because it can be an access issue. Let's say you have a DB server supporting multiple WP databases. All the procedures get put into the "mysql" database. So when you backup your WP database you don't get any of the procedures. You'd need to back up the mysql (system) database, and its likely the users would not have the rights to do so in such an environment, which is the typical environment for WP installs.
Excellent answers. To add, I think that from a plugin coding side, it is easier to update just the file system and do as little database work on an as needed basis.
Especially if a plugin update doesn't install right the first time and you have to restore the files and try again, a database change would be a lot more difficult to reverse.

Using a common database for collaborative development

Some of the people in my project seem to think that using a common development database with everyone connecting to it is the best thing. I think that it isn't and each developer having his own database (with periodic updated data dumps) is the best. Am I right or wrong? Have you encountered any problems in any of these approaches?
Disk space and CPU should be cheap enough that every developer can run their own instance of the database, with an automated build under version control. This is needed to allow developers to be bold in hacking on the database, in isolation from any other developer's concurrent hacking.
The caveat being, of course, that any changes they make to their private instance are useless to anyone else unless it can be automatically applied during the build process. So there needs to be a firm policy that application code can't depend on any database state unless that state is represented by version-controlled, unit-tested changes to the DDL.
For an excellent guide on the theory and practice of treating the database definition as another part of the project code, and coordinating changes and refactorings, see Refactoring Databases: Evolutionary Database Design by Scott W. Ambler and Pramod Sadalage.
I like having my own copy of the database for development, because it gives you the flexibility to rapidly change things without worrying how it will impact others.
However, if all the developers are hacking away on their own copy of the database, it becomes more and more difficult to merge everyone's work together in the end.
I think you can get the best of both worlds by letting developers work on a local copy during day-to-day development, but each developer should probably merge their work into a common copy on a pretty regular basis. Writing a lot of unit tests helps too.
We share a single database amongst all our developer (20-odd) but we've got it structured so that everyone has their own tables.
You don't need a separate database per developer if you structure the application right. It should be configurable which database or table-prefix it uses anyway so you can easily move it between instances (unit test, system test, acceptance test, production, disaster recovery and so on).
The advantage to using a single database is that the cost of maintenance is amortized. You don't have your DBAs trying to handle a lot of databases (or, if you're a small-DB shop, you don't have every developer trying to maintain their own database when they're better utilized in developing).
Having a single point of Failure is not a good thing isn't it?
I prefer a single, shared database. But it's very dependent on the situation and the applications being developed.
What works for me may not work for you. Go with your gut.
If you are working with Hibernate or any hibernate-based platform you can configure your database to be created when you start your server (create-drop option). This is very useful when you are adding new attributes to your classes. If this is the case each developer must have his own copy of the DB.
If you are not changing the DB structure at all then you can use a single shared DB.
In this second case is not a must. I prefer to have my own DB where I can do whatever I want. On the other hand remember that some queries can take a lot of time and this will affect your whole team if you are sharing a DB.

One DB per developer or not?

In a corporate development environment writing mostly administrative software, should every developer use their own database instance, or should they use a central database instance during development? What are the advantages and disadvantages of each approach? What about other environments and other products?
If you all share the same database, you might have some issues if someone make a structure change to the database and that the code is not "Synchronized" with it.
I highly recommend one DB per developer for the only reason that you don't want to do "write" test to see someone else override you right after. A simple exemple? You try to display product for a website. Everything works until all the products disappear. Problem? Another developer decided to play with the "Active" flag of the product to test something else. In cases like that, a transaction might not even work. End of the story, you spend time debugging for someone else action.
I highly recommend replicating the staging database to the developer database once in a while to synchronize the structure (or better, have a tool to rebuild a database from scratch).
Of course, we require scripts for changes to the database and EVERYTHING is in a Source Control.
The days when database environments should be scarce are long gone. I'm writing this posting on a XW9300 with 5x15k SCSI disks in it. This machine will run a substantial ETL job in a fairly reasonable length of time and (in mid-2007) cost me about £1,700 on ebay including the disks. From a developer's perspective, especially on database centric projects like data warehousing, the line between a developer and a DBA is quite blurred. As I write this I am building a partition management framework for a SQL Server 2005 data warehouse.
Developers should have one or more development databases of their own for (IMO) these reasons:
Requires people to keep stored procedures, patch scripts and schema definition files in source control. Applying the patches can be automated to a fairly large extent. There are even tools such as Redgate SQL Compare Pro that do much of the grunt work for this.
Encourages an application architecture that facilitates easy configuration management and deployment, as people have to deploy onto their own workstations. Many deployment wrinkles will get sorted out long before they hit production or people even realise they could have gone wrong.
Avoids developers tripping up on each other's work. On something like a data warehouse where people are working with ETL code this is an even bigger win.
It encourages a degree of responsibility as developers have to learn basic database administration. This also eliminates a lot of the requirements for operational support staff and some of the dev-vs. ops friction.
If you have your own database, there are no gatekeepers obstructing experimentation or other work on it. The politics around managing 'servers' disappear as there are no 'servers'.
This is a productivity win in an any environment with significant incumbent bureaucracy.
For small data volumes an ordinary PC is fast enough for this. Developer editions or licencing are available for most if not all database management systems and will run on a desktop O/S. If you're working with Linux or Unix this is even less of an issue. For larger data volumes, up to and including most MIS applications, a workstation like an HP XW9400 or Lenovo D10 can be outfitted with 5 15k disks for less than the cost of a lot of professional development tooling. (Yes, I know it's dual licence, but a commercial all-platform licence for QT is about £4000 a seat).
A machine like this will run an ETL process with 10's to 100's of millions of rows faster than you might think.
It facilitates setting up more than one environment for smoke testing or reconciliation purposes. As you have complete control over the machine, you have quite a lot of scope for mocking up conditions in a production environment. For example, I once made a simple emulator for Control-M by just bodging some of its runtime scripts.
Where you have this level of control and transparency over the environment you can produce a fairly robustly tested deployment process which does quite a lot to eliminate opportunities for finger-pointing in production deployment.
I've seen small teams working with 14 environments, and had 7 active on a workstation at the same time. On database heavy work such as ETL, where you're with with whole tables, working in a single dev environment is a recipe for time wastage or spending your time walking on eggshells.
Also, you can use single user development licences for database platforms, which can save you the cost of the workstations just in database licencing. Most developer licences (Microsoft and OTN are a couple of examples I'm familiar with) will let you use the system on a single workstation for a single developer free or for a nominal price.
Conversely, licencing terms on shared development servers are often somewhat murky and I've seen vendors try to shake customers down for licencing on dev servers on more than one occasion.
Each of our developers has a fully functional database. Changes are scripted and source controlled like any other code.
Ideally, yes, each developer should have a "sandbox" development environment, so they can test their code even before deploying it to a shared testing/staging environment.
Each developer's environment should run scripted tests that reset the database to a known state. This is impossible to do in a shared environment.
The cost of giving each developer their own instance is less than the cost of the chaos resulting from multiple developers trying to test volatile changes together in a shared environment.
On the other hand, in many IT shops the system uses complex infrastructure, involving multiple application servers or multiple physical nodes. Then the economics change; it's less expensive for people to cooperate and avoid stepping on each other's work than it would be to replicate it for each developer. Especially true if you integrate expensive third-party systems that don't give you licenses for multiple development environments.
So the answer is yes and no. :-) Do give each developer their own environment if that environment can be reproduced inexpensively.
My recommendation is to have 2 levels of development environment:
Each developer has their own personal development system, with its own dp, web servers, etc. This allows them to code against a known setup, write automated (system level) tests that initialize their database and systems to a known state, etc.
The development integration environment is shared by all developers and used to make sure everything is working together as expected before handing it off to QA. Code is checked out from source control and installed there, and there's only a single instance of any servers (db or otherwise).
This question hints at what a developer needs to do his/her job. Certainly a private DB instance should be provided. Equally important, I would make sure that the DB is the same product/version as what you intend to deploy to. Don't develop on MySQL 6.x and deploy to MySQL 5.x. (This goes for app servers, and web servers as well!)
Having a developer DB doesn't necessarily ean you need it hosted on your local machine. You could have a central DBMS host machine with all dev dbs located on it. The pros are the garauntee that you develop against the target DB. Less overhead on dev boxes, more space/horsepower for beefy IDEs and app servers. The cons are single point of failure for all devs. (The DBMS server goes down nobody can work.) Lack of dev exposure to setting up and administering the DBMS. Devs cannot experiment as easily with upcoming DB releases or alternate DB choices to solve tough problems.
Some of the pros can be cons and vice-versa depending on your organization and structure. Maybe you don't want devs administering the DBMS. Maybe you do plan to support varying DB platforms. The decision boils down to your organization as well as your target platform choices. If you plan to target a variety of DB/OS/app server combinations then each dev should not only have their own DB but should work in a unique combination. (MySQL/Tomcat/OSX for one DB2/Jetty/Linux for another Postegres/Geronimo/WinXP for a 3rd, etc.) If you setup an ASP (Application Service Provider) type shop on an iSeries on the other hand then of course you'll likely have a central host with all dev dbmses still each dev should have at least a separate db instance to allow structural changes to schema.
I have an instance of SQLServer Development Edition installed locally. We have a QA DB server, as well as multiple production servers. All development and integration testing is done using my local server (or other developers local servers). New releases are staged to the QA server. Each release, after acceptance by the customer, is put into production.
Since I mostly do web development, I use the web server bundled with VS2008 for development and local test, then publish the web app to a QA web server hosted on a VM. Once accepted by the customer, it is published to one of several different production web servers -- some virtual, some not, depending on the application.
My department at my company only has limited development environments, purely because of cost of support and hardware. We have a couple of environments which are based on t-1 nightly refreshes from production, and some static ones.
Ideally, everyone should have their own, but in many cases, this is going to be impractical when the following are true:
you have a large number of developers needing resources (our department has maybe 80)
each developer needs multiple resources (typically i use 4-5 different dbs each day)
up to date data is important (you just cant refresh them fast enough)
In these cases, shared instances and good communication are whats needed.
One advantage to one database per developer, each developer has a snapshot of their own data in a "known" state.
I like the idea of using a local version when a developer must be isolated - developing a schema change, performance testing, setting up specific scenarios, etc...
At other times use the shared version as to insure everything is in sync with each other.
I think there's a terminology problem here. It's been a while since I've worn my DBA hat (golly gee, almost 10 years) - so someone else can chime in and correct me.
I think everyone is in agreement that each developer should have his own sandbox schema set.
In MySQL and Sybase/MS SQLServer, each database engine can support multiple databases. Each database is (normally) fully independent of the other database. So you can have one database engine instance, and give each developer his database space to do as he wish. the only problem is if the developers are using tempdb -- there can be collisions there (I think -- this you will need to look up). Just be careful that cross-database queries with fixed database names are not used.
In Oracle, the database engine instance is tied to a particular schema set. If you have multiple developers on the same engine, they are all pointing to the same tables. In this case, yes, you will need to run multiple instances.
Each of our developers has a local database. We store the create script AND a dump of the "standard data" in our SVN repo. We have an extensive set of tests that must pass against this test data. We also have a "sandbox" database that is available for people to put data in that they want shared into the standard data. This works well for us and allows us to let developers modify their local copies of data to test things, but we control what gets passed to other developers. We also strictly control schema changes, so we don't encounter the problems that someone else mentioned.
It really depends on the nature of your application. If yours is a client-server architecture in a distributed environment, it is best to have a central database that everyone uses. If the product gives users an environment with local database instances, you can use that. It is best if your development mirrors the real world environment as closely as possible.
It is also dependent on what stage of development you are in. Probably in the early stages, you dont want to get bogged down by connectivity, network and distributed environment issues and just want to be up and running. In such a case, you can start with a database instance-per-user model before switching to the central model as the product reaches some level of maturity.
In my company we tend to copy the entire DB when working on non-trivial new features. The reasoning there is disk space is cheap, whereas accidental data loss (even if it's test data) isn't.
I've worked in both types of development environments. Personally, I prefer to have my own DB/app server. However, there may be some advantages to using a shared infrastructure for development.
The main one is that a shared environment more closely resembles a real-world scenario: you are more likely to uncover problems with locking or transactions when all developers share a DB. Giving each developer their own DB may lead to "it works on my DB" syndrome.
However, if you need to apply and test schema changes or optimisations, then I can see problems in this sort of set-up.
Maybe a compromise solution would work best: all developers share infrastructure, and if someone needs to test schema changes, they create their own temporary DB instance (maybe there is one just sitting there for this purpose?) until they are happy to commit the new schema to source control.
You do have your entire schema (and test data) in source control, right? Right???
I like the compromise solution (all developers share infrastructure, and if someone needs to test schema changes, they create their own temporary DB instance (maybe there is one just sitting there for this purpose?) until they are happy to commit the new schema to source control.)
One DB per developer. No question. But the issue really is how to script entire databases, "control data", and version them. My solution is here : http://dbsourcetools.codeplex.com/
Have fun. - Nathan.
The database schemas should be held in source control and developers should own the changesets checked in for code and db together. Prior to checkin the developer should be working on his own database. After checkin, an automated build (eg: on checkin, nightly, etc), should update a central integrated db, along with the apps themselves.
At developer instance level the data loaded should be appropriate for unit testing, at least. At integrated level, the shared db should hold data also appropriate for testing, but should not rely on production replication - this is just a slack substitute for managed test data.
In my experience the only reason that developers opt for a shared db is that they believe that developing and running on recent production data is somehow 'real' and means that they can put less effort into testing. They prefer to tread on each others toes and put up with a shared db that slowly corrupts before the next production refresh than write and manage proper tests. It's this kind of management practice that gives the IT world the poor reputation to deliver that it currently has.
I'd suggest to use one instance of the database. You don't want your database to be a moving target.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Resources