Questions and considerations to ask client for designing a database - database

so as title says, I would like to hear your advices what are the most important questions to consider and ask end-users before designing database for their application. We are to make database-oriented app, with special attenion to pay on db security (access control, encryption, integrity, backups)... Database will also keep some personal information about people, which is considered sensitive by law regulations, so security must be good.
I worked on school projects with databases, but this is first time working "in real world", where this db security has real implications.
So I found some advices and questions to ask on internet, but here I always get best ones. All help appreciated!
Thank you!

Some other specifics besides what has already been said:
Do you have any Regulatory
requirements for data access and
storage (Sarbanes-Oxley and HIPAA
come to mind)
Do you need to be able to audit
record changes
What internal controls do you need
reflected in the database
What business rules must be followed
under what circumstances
How large to you expect the data to
get - the larger the data store
expected the more critical to design
with performance in mind from the
start
How flexible do you want the system
to be (do you want to be able to add
columns on the fly? OR add business
rules) Be careful with this one, make
sure the client understands that
flexibilty often comes at the cost of
performance.
Do you need a separate data warehouse
for reporting?
How do you need the data populated?
Will it come from an application,
multiple applications, data imports
or a combination?
What databases do you currently have
license for? Do you want to have
this application use it?
Will different groups of users need
different accesses?
How is the process currently being
handled, can we have access to that
database or see the current process
in action. Observe, for a minimum of
one day, the client using the current
system. Take extensive notes, you will learn many things no one will think to tell you.
Do you need to migrate data from the
old system

i would start with:
Please explain your business to me.
Which processes are you looking to
automate or improve?
Do you have any reports you need to
generate?
Do you need inputs to any other
systems?

use cases (google for that, it does not need to be drawings, text is fine)
inputs
outputs
static data
historical data
From there you derive the info you need to store, you apply 4th NF, and go !
Good luck ! 8-))

Related

Transact SQL, prefix with schema or not?

I'm developing a tool where I've prefixed tables etc. with "dbo" now I get requests for custom schema names. I'm thinking of skipping them and instead let the user control this via the associated login against the Db. I know there's talk about "performance" since it needs to search the users's schemes and then fallback on dbo etc, but is that really an issue? Opinions?
First, I would look at this question as a feature request from your customers (users?). So the immediate decision to make is, should you even consider looking into this now, or do you have other feature requests that are obviously more important and deliver more benefit to the customer?
For example, for now you could simply tell customers that your application requires its own database that should not be shared with other applications or manipulated in any way by the customer. Then you don't have to worry about schemas or the same object name in two schemas because your application 'owns' the database. Perhaps this is already the case, but if so then I don't understand why your customers care which schema your objects are in.
Second, assuming that you do decide to work on it, you should gather some information about why people are asking for this, to make sure that you clearly understand what they expect you to deliver and what the benefit is for them. If customers are really saying "your application runs slowly" then the choice of schema is highly unlikely to be the reason, it's much more probable that indexing, schema design or your application code are the areas to look at.
Finally, if you still want to go ahead you need to find a technical solution. This is partly a deployment issue and partly a coding issue. It's a deployment issue because you have to deploy your database objects in a specific schema that is specified at installation time, and all your patches and later releases need to be aware of that too. The coding issue is that you need your database code to be "schema-aware", in case you end up in a situation where you have dbo.TableName, MyTool.TableName and OtherSchema.TableName all in the same database. The solution is obviously to reference the schema name in all code, which is considered an important best practice anyway. But exactly how you do this depends on how you have structured your application, if you use an ORM etc.

CMS Database design - Master database or Multi-Db per site

I am in process of designing my CMS that I am about to create. I was thinking about the database and how I want to go by approaching it.
Do you think its best to create 1 master database for all my clients websites? or Should I have 1 database per site?
What is the benefits and negatives on both approaches? I am always thinking about the future so I was thinking about implementing memcache or APC cache to the project, to offer an option to my client.
Just trying to learn the best practices and what other developers apporach would be
I've run both. My business chooses to separate client-specific data into separate tables so that if one happens to go corrupt, not all are taken down. In an ideal world this might never happen, but murphy's law....It does seem very easy to find things with them separated. You will know with 100% certainty that one client's content will never show up on another's page.
If you do go down that route, be prepared to create scripts that build and configure databases for you. There's nothing fun about building a great system and having demand for it, only to spend your time manually setting up DB's and installs all day long. Also, setting db names is one additional step that's not part of using a single db table--it's a headache that will repeat itself seemingly over and over again.
Develop the single master DB. It will take a small amount of additional effort and add a little bit more complexity to the database design, but will give you a few nice features. The biggest is being able to share data between sites.
Designing for a master database means that you have the option to combine sites when it makes sense, but also lets you install a master per site. Best of both worlds.
It depends greatly upon the amount of customization each client will require. If you forsee clients asking for many one-off features specific to their deployment, separate databases based off of a single core structure might make sense. I would highly recommend trying to make any customizations usable by all clients though, and keep all structure defined in one place/database instead of duplicating it across multiple databases. By using one database, you make updating the structure straightforward and the implementation consistent across all sites so they can all use the same CMS code.

Which comes first: database or application logic?

What is the best way or recommended best practice in the flow of database driven asp.net web application? I mean the database first or coding first or side by side?
Your data access code won't compile without an existing database - unless you stub (or Mock) it. So probably the database comes first.
But it is a bad idea to do whole chunks of the application in isolation. Ideally you should design and build slivers of the system - database and application - hand-in-hand. These slivers should be cohesive sub-sets of functionality, probably smaller than sub-systems. Inevitably, the act of coding screens and business rules will throw up problems in the data model. So it is good to have a data modeller or DBA who is happy to work incrementally alongside the developers.
edit
Stephanie makes an extremely pertinent point:
"the core tables which are persisting
your app's data really can't be
piecemealed. Most of the data is known
at project start. It has a form, you
need to find it."
I agree that the core entities are knowable at project start, and the physical data model can be derived from that logical data model. But I don't think it is ever possible to nail down completely the structure of any table, even a core table, at the start. This is because at the start of the design/build phase all we have to go on are the Requirements, and if there's one thing history tells us about the Requirements it is that they will change.
So, new tables will be needed and some existing tables will become obsolete. There will be columns which need to be added, columns which need to be modified, columns which need to be dropped. This is why Nature gave us the ALTER TABLE statement.
I am not suggesting that we don't design our tables, or assemble them piecemeal. I am merely suggesting that when we start designing the HR sub-system we need to worry about the EMPLOYEES table and the SALARIES table. We don't need to concern ourselves with INVENTORY or ORDERS until we commence work on Sales.
We personally start with the Domain and do things side-by-side. The important part is that we implement vertical slices of the application (fully working end-to-end features), not horizontal slices (e.g. first the whole database layer, then the data access, then the services, then the presentation): we build the application incrementally and demonstrate progress with working code after each iteration.
Applications are all about features.
You don't build apps to store data,
but to provide functionality. If we
can't agree on that, the discussion is
moot of course. Software should be
developed to satisfy the needs of its
users and not of its developers.
Well I have really no understanding of the second sentence. If you think my company pays me a good salary to write code that satisfies me and not my users you're crazy. So that argument is a strawman. Back to the first.
This is a common view point of application centric people (they), vs. database centric people (We). They see the entire point of the exercise to "provide features". Those are things the clients know they want and ask for them. To them, the database is just persistence required for these features. And when they are done, that's it, features delivered, database is sufficient for those features. Could be an entire Rube Goldberg inside the database with redundant data, severe violations of normal forms, constraints enforced by the application, what have you.
think overall usability alone outweighs database design
If the design of your database is affecting your usability than the design was bad. I have no doubt that one who strives for features will leave the database in such a state that it severely hampers usability.
Data Centric people, don't look at a system as a place to provide only what's been asked for, but a repository of Intellectual Capital that can be exploited by more than whatever the Application-du-jour is. I can't begin to describe the number of cases where one team has used the database of some other team's app to enhance their apps value. Just look at all the medical research that is nothing more that the meta-analysis of existing studies. None of that is possible if you believe that only the features of your app matter and subsequent uses of your apps data do not.
A good data model isn't inviolate. Sure you'll add to it, change it when requirements change. But if you don't completely understand your data, I don't know how anyone can begin to write code.
I guess you need first to define datamodel and only then going coding. You should plan everything carefully before actually writting the code.
First is a feature list.
Then, detailed spec.
Then test plan and design of all, including databases.
Then, it wouldn't matter which to implement first.
You'll probably end up doing it "side by side".
You need some data to be able to test the application, but you need the application to be able to verify that you're storing the correct data.
Do some modelling first and then build the minimum you can for one or two features. Then when these are working add the next feature and so on.
You'll need to write some database update procedures (both the code and the rules about what and when to update) as you will have to extend your tables, but you'll need those for the final system anyway as it will have to change as new requirements come along.
Having done it quite a few times, I find myself invariably doing it like so:
Define the problem I'm trying to solve.
Write out some use-cases.
Have my significant other or a friend tell me if this is even a problem.
Sketch out a few sample screens.
Write flow diagrams for the use cases.
Ask my Rubber-duck questions.
Use questions to refine 1-6.
Write out the 'nouns'. Those become my data Model.
Write out the actions. Those become application logic.
Code data Model.
Code Application Logic.
Realize I've gotten it a little wrong.
Repeat 10-12 as many times as needed.
Ask, "Have I solved the problem"?
If not, rinse, lather and repeat 1-15.
This is a trick question. IMO, they both come in parallel during your planning and design phase. They are so closely related that it make sense to do them together. Just keep in mind that your database design will be almost fully developed while your code is still in its infancy (though your application logic should be almost fully mapped out in you head or on paper)
The idea is that you're designing your solution in the context of the problem. When you're planning out your solution you will be (or should be) defining your application as a set of things and actions (nouns and verbs).
For example, a very basic helpdesk program has people and tickets. People need to create tickets, update tickets, and close tickets. The nouns that require persistent storage will comprise your database, and the nouns + actions will be contained in your application.
Sometimes your table mappings and the relationship between tables will be obvious (IE people create tickets, ticket.creatorID = people.personID) and other times the relationship doesn't really click in your head until you start working through use cases or until you start writing your code (IE different ppl have different access levels defining what they can do. At a glance this would seem like a simple field in a table, but in practice it is better as a separate table).

Why shouldn't I give outsiders access to my database?

Lots of sites today have APIs that allow users to get data from the site as XML or JSON using a GET HTTP request. Flickr and del.icio.us are example of sites with APIs. These APIs require the server to access the database, and then output the result as either XML or JSON.
Why do we need this translation though? Why not just create a user on the database (for example MySQL)? The user would be given limited access to the database, only being allowed to SELECT, and only certain tables and certain columns in those tables. Wouldn't this be a lot more efficient for the server (it wouldn't have to deal with the HTTP request), and it would be easier for developers, who could now access exactly the data they need, the way they need it.
Security considerations aside, so that you can change your database structure without affecting your clients. Also, poorly formed queries tie up your server, not the clients.
Can you prevent a malicious individual from crafting a super-complex SQL query that will peg your database's CPU at 100%? Can you prevent a lot of innocent programmers from crafting inefficient queries that will never be optimized that will do the same thing?
Coding to Contract - with APIs, you may change everything behind them without affecting outsiders use of them. Here you'd be tying them to not just MySQL but your schema
Caching - Allowing them any query almost removes any opportunity for caching that predictable queries over http that can be used. This is probably the number one way to remove the often number one bottleneck, the database.
Security - with this approach, it would be easy for a denial of service attack, even by accident. Not to mention the fact you'd have to give access to data layer, which is often put in a restricted zone where security can be tightened
Usability - not everyone is a developer or wants to understand a your internal domain. They probably prefer a pre baked straight forward and self-explaining API. An extreme example would be to give managers db privileges rather than reports.
An API:
Makes it easier to montior and control usage (implementing 'limited queries per X' for DB users may be harder)
Allows for presenting simpler structures to the user than may be used in the DB.
Means the user doesn't have to understand your DB structure.
Allows for DB portability. (Oh you've grown massive and now need to implement: sharding, move to bigtable, etc. - With an API the user doesn't need to know)
Allows for different (better? / variable?) caching of requests.
Means you don't have to pay for extra DB users (If that's how the DB is licensed.)
Portability too. Lets say for licensing reasons and scaling you make the business decision to move from MSSQL to MySql. Syntax ain't quite the same and your clients will all have to change their code.
Much better to just buffer it all off and keep the implementation abstracted away. Whose to say you're not persisting the state of the application using trained monkeys scratching marks on bottletops?
Security is the number 1 reason but I hope those reasons are obvious. The user tying up precious resources with bad queries is another good reason.
Beyond that though, why an abstraction layer?
Might you ever want to add some logging to database queries to diagnose speed or to help debug?
Might you ever go from MySQL to MS SQL or vice versa where SQL other than pure ANSI might break?
Should the customer really have to learn your schema rather than a more logical abstraction?
When a new programmer learns of normalization and can now see your whole schema including your carefully balanced denormalizations, do you want to put up with every uninformed criticism?
When a more experienced db person points out improvements, do you want to be stuck with your old schema?
Why to use an API is a question of why to use abstractions and my list here barely scratches the surface.
the web server gives you a buffer that you can control. if there is some bug in your sql server or whatever, you don't want it exposed directly to the internet. true, if the web server has bugs, it might be just as bad ... except you have that extra layer between the data and the world.
-don
It's not as much a 'why not' than a 'why should you' question. Handling HTTP requests is a small penalty for complete control over what all data you allow or disallow a user from accessing. Further, should the nature / quantity / security level of data change in future, you will be better off with a JSON / XML response than allowing total access.
The thing to bear in mind when you're thinking of security issues is that it's really hard to anticipate all of the possible vectors that someone could use to attack you. For instance, are you really sure you've gotten your database permissions set so that people can't mess things up?
Therefore, you want to try restricting actions to only what you know to be good, not just trying to restrict the things you know to be bad. This can be done with a web service that you have absolute control over, but it's difficult to allow somebody to access the database directly and be sure that you're secure.
API is a kind of Wrapper around of database. Users do not know anything about database internal representation of data, he only need to send a number of unified requests and get unified response on it. How and when data will be processed on the server - it's not his headache.

Defining the database schema in the application or in the database?

I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).

Resources