Are bad data issues that common? [closed] - database

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've worked for clients that had a large number of distinct, small to mid-sized projects, each interacting with each other via properly defined interfaces to share data, but not reading and writing to the same database. Each had their own separate database, their own cache, their own file servers/system that they had dedicated access to, and so they never caused any problems. One of these clients is a mobile content vendor, so they're lucky in a way that they do not have to face the same problems that everyday business applications do. They can create all those separate compartments where their components happily live in isolation of the others.
However, for many business applications, this is not possible. I've worked with a few clients, one of whose applications I am doing the production support for, where there are "bad data issues" on an hourly basis. Yeah, it's that crazy. Some data records from one of the instances (lower than production, of course) would have been run a couple of weeks ago, and caused some other user's data to get corrupted. And then, a data script will have to be written to fix this issue. And I've seen this happening so much with this client that I have to ask.
I've seen this happening at a moderate rate with other clients, but this one just seems to be out of order.
If you're working with business applications that share a large amount of data by reading and writing to/from the same database, are "bad data issues" that common in your environment?

Bad data issues occur all the time. The only reasonably effective defense is a properly designed, normalized database, preferrably interacting with the outside world only through stored procedures.

This is why it is important to put the required data rules at the database level and not the application. (Of course, it seems that many systems don't bother at the application level either.)
It also seems that a lot of people who design data imports, don't bother to clean the data before putting it in their system. Of course it's hard to find all the possible ways to mess up the data, I've done imports for years and I still get surprised sometimes. My favorite was the company where their data entry people obviously didn't care about the field names and the application just went to the next field when the first field was fully. I got names like: "McDonald, Ja" in the last name field and "mes" in the first name field.
I do data imports from many, many clients and vendors. Out of hundreds of different imports I've developed, I can think of only one or two where the data was clean. For some reason the email field seems to be particularly bad and is often used for notes instead of emails. It's really hard to send an email to "His secretary is the hot blonde."

Yes, very common. Getting the customer to understand the extent of the problem is another matter. At one customer I had to resort to writing an application which analyzed their database and beeped every time it enountered a record which didn't match their own published data format. I took the laptop with their DB installed to a meeting and ran the program, then watched all the heads at the table swivel around to stare at their DBA while my machine beeped crazily in the background. There's nothing quite like grinding the customer's nose in his own problems to gain attention.

I don't think you are talking about bad data (but it would only be polite of you to answer the various questions raised in comments) but invalid data. For example, '9A!' stored in a field that is supposed to contains a 3-character ISO ccurrency code is probably invalid data, and should have been caught at data entry time. Bad is data usually taken to be equivalent to corruption caused by disk errors etc. The former are quite common, depending on the quality of the data input applications, while the latter are pretty rare.

I assume that by "bad data issues" you mean "issues of data that does not satisfy all applicable business constraints".
They can only be a consequence of two things : bad database design by the database designer (that is : either unintentional or -even worse- intentional omission of integrity constraints in the database definition), or else the inability of the DBMS to support the more complex types of database constraint, combined with a flawed program written by the programmer to enforce the dbms-unsupported integrity constraint.
Given how poor SQL databases are at integrity constraints, and given the poor level of knowledge of data management among the average "modern programmer", yes such issues are everywhere.

If the data get's corrupted because users shut down their application in the middle of complex database updates then transactions are your friends. This way you don't get entry in Invoice table, but no entries in InvoiceItems table. Unless Commited at the end of the process, all made changes are rolled back,

Related

Database design for a company: single or multi database? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We're (re)designing a corporate information system. For the database design, we are exploring these options:
[Option 1->] a single CompanyBigDatabase that has everything,
[Option 2->] several databases across the company (say, HRD_DB, FinanceDB, MarketingDB), which then synchronized through a layer of application. EmployeeTable is owned by HRD, if Finance wants to refer to employees, it queries EmployeeTable from HRD_DB via a web-service.
What is the best practice? What's the pros and cons? We want it to have high availability and to be reasonably reliable. Does Option 1 necessitate clustering-and-all for this? Do big companies and universities (like Toyota, Samsung, Stanford Uni, MIT, ...), always opt for Option 1?
I was looking in many DB textbooks but I could not find a sufficient explanation on this topic.
Any thoughts, tips, links, or advice is welcome. Thanks.
Ive have done this type of work for 20 yrs. Enterprise Architecting is one term used to describe this. If you are asking this question, in a real enterprise scenario, im going to recommend you get advice. If it's a uni question, There are so many things to consider:
budget
politics
timeframes
legacy systems or green field,
Scope of Build
In house or Hosted
Complete Outsource of some or all of the functionality (SaaS)
....
Entire Methodologies are written to support projects that do this.
You can end up with many answers to the variables.
Even agreeing on how to weight features and outcomes is tough.
This is HUGE question you could right a book on.
Its like a 2 paragraph question where I have seen 10 people spend a month putting a business case together to do X. Thats just costing and planning the various options. Without selection of the final approach.
So I have not directly answered your question... that my friend is a
serious research project, not really a StackOverflow question.
There is no single answer. It depends on the many other factors such as database load, application architecture, scalability and etc. My suggestion start the simplest way possible (single database) and change it based on the needs.
Single database has it's advantages: simpler joins, referential integrity, single backup. Only separate pieces of data when you have valid reason/need.
In my opinion, it would be more appropriate to have database normalized and have several databases across the company based on departments. This would allow you to manage data more effectively in terms of storing, retrieving and updating information and providing access to users based on department type or user type. You can also provide different views of the database. It will be a lot more easier to manage data.
There is a general principle of databases in particular, and computing in general, that there should be a single authoritative source for every data item.
Extending this to sets of data, as soon as you have multiple customer lists, multiple lists of items, multiple email addresses, you are soon into a quagmire of uncertainty that will then call for a business intelligence solution to resolve them all.
Now I'm a business intelligence chap by historical leaning, but I'd be the first to say that this is not a path that you want to go down simply because Marketing and Accounts cannot decide the definition of "customer". You do it because your well-normalised OLTP systems do not make it easy to count how many customers there were yesterday, last week, and last year.
Nor should they either, because then they would be in danger of sacrificing their true purpose -- to maintain a high performance, high-integrity persistent store of the "data universe" that your company exists in.
So in other words, the single database approach has data integrity on it's side, and you really do not want to work in a company that does not have data integrity. As a Business Intelligence practitioner I can tell you that it is a horrible place.
On the other hand, you are going to have practical situations in which you simply must have separate systems due to application vendor constraints etc, and in that case the problem becomes one of keeping the data as tightly coupled as possible, and of Metadata Management (ugh) in which the company agrees what data in what location has what meaning.
Either will work and other decisions will mostly affect your specification. To some extent you question could be described as 'Should I go down the ERP path or the SAAS path"? I think it is telling that right now most systems are tending towards SAAS.
How will you be managing the applications? If they will be updated at different times separate DBs make more sense. (SAAS path). On the other hand having one DB to connect to, one authorisation system, one place to look for details, one place to backup, etc appears to decrease complexity in the technical space. But then does not allow decisions affecting one part of the business to be considered separately from other parts of the business
Once the business is involved trying to get a single time each department agrees to an upgrade can be hell. Having a degree of abstraction so that you only have to get one department to align before updating its part of the stack has real advantages in coming years. And if your web services are robust and don't change with each release this can be a much easier path.
Don't forget you can have views of data in other DBs.
And as for your question of how do most big companies work; generally by a miss-mash of big and little systems that sometimes talk to each other, sometimes don't and often repeat data. Having said that repeating data is a real problem; always have an authoritative source and copies (or even better only one copy). A method I have seen work well in a number of enterprises is to have one place details can be CRUDed (Created, Retrieved, Updated and Deleted) and numerous where it can be read.
And really this design decision has little or nothing to do with availability and reliability. That tends to come from good design (simplicity, knowing where things live, etc) good practices (good release practices, admin practtices, backups, intelligent redundancy, etc) and spending money. Not from having one or multiple systems.

Approaches to finding / controlling illegal data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Search and destroy / capturing illegal data...
The Environment:
I manage a few very "open" databases. The type of access is usually full select/insert/update/delete. The mechanism for accessing the data is usually through linked tables (to SQL-Server) in custom-build MS Access databases.
The Rules
No social security numbers, etc. (e.g., think FERPA/HIPPA).
The Problem
Users enter / hide the illegal data in creative ways (e.g., ssn in the middle name field, etc.); administrative/disciplinary control is weak/ineffective. The general attitude (even from most of the bosses) is that security is a hassle, if you find a way around it then good for you, etc. I need a (better) way to find the data after it has been entered.
What I've Tried
Initially, I made modifications to the various custom-built user interfaces folks had (that I was aware of), all the way down to the table structures that they were linking to our our database server. The SSN's, for example, no longer had a field of their own, etc. And yet...I continue to find them buried in other data fields.
After a secret audit some folks at my institution did, where they found this buried data, I wrote some sql that (literally) checks every character in every field field in every table of the database looking for anything that matched an ssn pattern. It takes a long time to run, and the users are finding ways around my pattern definitions.
My Question
Of course, a real solution would entail policy enforcement. That has to be addressed (way) above my head, however, it is beyond the scope and authority of my position.
Are you aware of or do you use any (free or commercial) tools that have been targeted at auditing for FERPA & HIPPA data? (or if not those policies specifically, then just data patterns in general?
I'd like to find something that I can run on a schedule, and that stayed updated with new pattern definitions.
I would monitor the users, in two ways.
The same users are likely to be entering the same data, so track who is getting around the roadbloacks, and identify them. Ensure that they are documented as fouling the system, so that they are disciplined appropriately. Their efforts create risk (monetary and legal, which becomes monetary) for the entire organization.
Look at the queries that users issue. If they are successful in searching for the information, then it is somehow stored in the repository.
If you are unable to track users, begin instituting passwords.
In the long-run, though, your organization needs to upgrade its users.
In the end you are fighting an impossible battle unless you have support from management. If it's illegal to store an SSN in your DB, then this rule must have explicit support from the top. #Iterator is right, record who is entering this data and document their actions: implement an audit trail.
Search across the audit trail not the database itself. This should be quicker, you only have one day (or one hour or ...) of data to search. Each violation record and publish.
You could tighten up some validation. No numeric field I guess needs to be as long as an SSN. No name field needs numbers in it. No address field needs more that 5 or 6 numbers in it (how many houses are there on route 66?) Hmmm Could a phone number be used to represent an SSN? Trouble is you can stop someone entering acaaabdf etc. (encoding 131126 etc) there's always a way to defeat your checks.
You'll never achieve perfection, but you can at least catch the accidental offender.
One other suggestion: you can post a new question asking about machine learning plugins (essentially statistical pattern recognition) for your database of choice (MS Access). By flagging some of the database updates as good/bad, you may be able to leverage an automated tool to find the bad stuff and bring it to your attention.
This is akin to spam filters that find the bad stuff and remove it from your attention. However, to get good answers on this, you may need to provide a bit more details in the question, such as the # of samples you have (if it's not many, then a ML plugin would not be useful), your programming skills (for what's known as feature extraction), and so on.
Despite this suggestion, I believe it's better to target the user behavior than to build a smarter mousetrap.

How to learn database management system on own? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to learn database management system on my own,would you like to suggest me some tips
Don't try what others have suggested to "learn by doing".
That is monkey-style learning.
Human intelligence/knowledge has gone far beyond the point where for the most specimens of the species, it is still possible/productive to "learn by making mistakes".
Trust me, if you just go down the "learn-exclusively-by-doing" path, the only thing you will end up doing is repeating the same mistakes that millions have already made before you.
The only thing suggested in the other answers here that I can second, is "buy a good book". But "being a good book" does mean that it should give a thorough coverage of the theory. Otherwise you are likely to end up building database applications like the Romans built their aquaducts : by overdoing certain things "just to be on the safe side".
(That is exactly what those Roman engineers did when they built those monuments that often still stand to this very day, in the absence of any knowledge/understanding of things such as gravity and the strenght of concrete spannings : they threw in a few dozens of extra bricks, without really knowing precisely whether they were really needed or not. But their world wasn't as economically competitive as ours.)
I'm not a big fan of learning without an actual application to which it's applied.
Books on Modeling and Design are all well and good, but all of those suggestions need to be placed in the context of an application.
I have seen my share of "horrible" data models that work fine for the application. While there's a purity in having a "good design", the simple truth is that not everything needs a "good design". Or, better said, a "good design" for one application may be completely different than one for another.
Many simple applications perform fine with "stupid", "dumb", "bad" designs.
A lot of learning happens from doing the wrong thing.
To paraphrase Thomas Edison, "Progress, I've made lots of progress. I know a thousand things that don't work."
A lot of things are easier to learn when they're being applied in the "real world", and then judged against that metric of whether or not it's working or not vs simply holding it up to something read in a book, but unapplied.
The benefit of "good design" is that it works well with the "Code has Momentum" meme, specifically that a bad design, once entrenched, is difficult to refactor or remove and replace with a good design, so you want to start with a good design up front.
That said, as a corollary, especially if you follow blindly many books on modeling and architecture, you end up with simple applications that are terrible over engineered. With a lot of unnecessary code for circumstances that simply do not, and will not, exist in the application.
The game is finding the balance between "the perfect" solution, and the "workable solution".
So, read all the books you like, but also apply it to something of value to you, and then fix your errors as you grow. Not everyone should start at ground zero, you want to "stand on the shoulder of giants", but it's important to understand, also, the paths the giants took in the first place to better appreciate why, and in what situations, they advocate their choices.
1) choose a target database, Oracle, MySQL, MS SQL...
2) buy a good book,
3) participate discussion in community.
0) of course, setup an env to play around...
I too suggest you get a good book (or more, performance tuning is so complex a subject, it usually is covered in a separate book) on one of the major databases (whichever one you choose to learn first).
Subjects you need to learn are (at a minimum):
Querying including the use of joins (if you do not thoroughly understand joins, you can't do anything except the simplest of queries or you may get a result set that is incorrect. You can't learn too much about joins.
If data is to have meaning, you must understand how to tell if the results you are producing are the correct results.
Normalization - if you don't understand normalization, you cannot design an effective relational database. Learn also about primary and foreign keys. Do not ever design a database table without a way to uniquely identify a record.
Set theory - you have to learn to think in terms of manipulating groups of records not looping through records one at a time.
Performance - if you can't get results out in a timely manner, no one will use your software. Design in databases should consider performance carefully as databases as not so easy to refactor when they perform badly and there are many techiniques that are known to be faster than other techniques that produce the same result. You should learn what these are and not use the poorly performing ones because you find them easier to understand.
Data integrity - if you can't rely on the data to be correct, you do not have a useful database. You should know how to ensure that data has the correct values and relationships to other data. This also includes using the correct data type (Store dates as a datetime or date data type for instance).
Security - including protecting against both SQL injection attacks and possible internal fraud.
Constraints and triggers and stored procs and user-defined functions.
Finally, do not use object-oriented thinking in database design. Relational databases often use tools and techniques that you would not use in Object-oriented programming. It is a a different subject and thus subject to different rules and constraints.
Get a good Modeling and Design book before you do anything else.
Try to do some simple applications using databases in your favorite programming language. Learn by doing. And when you get a problem, read the DBMS-documentation and learn.
http://video.stoimen.com/2009/04/12/zend-framework-zend_db-tutorial/ perhaps

What should every developer know about databases? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Whether we like it or not, many if not most of us developers either regularly work with databases or may have to work with one someday. And considering the amount of misuse and abuse in the wild, and the volume of database-related questions that come up every day, it's fair to say that there are certain concepts that developers should know - even if they don't design or work with databases today.
What is one important concept that developers and other software professionals ought to know about databases?
The very first thing developers should know about databases is this: what are databases for? Not how do they work, nor how do you build one, nor even how do you write code to retrieve or update the data in a database. But what are they for?
Unfortunately, the answer to this one is a moving target. In the heydey of databases, the 1970s through the early 1990s, databases were for the sharing of data. If you were using a database, and you weren't sharing data you were either involved in an academic project or you were wasting resources, including yourself. Setting up a database and taming a DBMS were such monumental tasks that the payback, in terms of data exploited multiple times, had to be huge to match the investment.
Over the last 15 years, databases have come to be used for storing the persistent data associated with just one application. Building a database for MySQL, or Access, or SQL Server has become so routine that databases have become almost a routine part of an ordinary application. Sometimes, that initial limited mission gets pushed upward by mission creep, as the real value of the data becomes apparent. Unfortunately, databases that were designed with a single purpose in mind often fail dramatically when they begin to be pushed into a role that's enterprise wide and mission critical.
The second thing developers need to learn about databases is the whole data centric view of the world. The data centric world view is more different from the process centric world view than anything most developers have ever learned. Compared to this gap, the gap between structured programming and object oriented programming is relatively small.
The third thing developers need to learn, at least in an overview, is data modeling, including conceptual data modeling, logical data modeling, and physical data modeling.
Conceptual data modeling is really requirements analysis from a data centric point of view.
Logical data modeling is generally the application of a specific data model to the requirements discovered in conceptual data modeling. The relational model is used far more than any other specific model, and developers need to learn the relational model for sure. Designing a powerful and relevant relational model for a nontrivial requirement is not a trivial task. You can't build good SQL tables if you misunderstand the relational model.
Physical data modeling is generally DBMS specific, and doesn't need to be learned in much detail, unless the developer is also the database builder or the DBA. What developers do need to understand is the extent to which physical database design can be separated from logical database design, and the extent to which producing a high speed database can be accomplished just by tweaking the physical design.
The next thing developers need to learn is that while speed (performance) is important, other measures of design goodness are even more important, such as the ability to revise and extend the scope of the database down the road, or simplicity of programming.
Finally, anybody who messes with databases needs to understand that the value of data often outlasts the system that captured it.
Whew!
Good question. The following are some thoughts in no particular order:
Normalization, to at least the second normal form, is essential.
Referential integrity is also essential, with proper cascading delete and update considerations.
Good and proper use of check constraints. Let the database do as much work as possible.
Don't scatter business logic in both the database and middle tier code. Pick one or the other, preferably in middle tier code.
Decide on a consistent approach for primary keys and clustered keys.
Don't over index. Choose your indexes wisely.
Consistent table and column naming. Pick a standard and stick to it.
Limit the number of columns in the database that will accept null values.
Don't get carried away with triggers. They have their use but can complicate things in a hurry.
Be careful with UDFs. They are great but can cause performance problems when you're not aware how often they might get called in a query.
Get Celko's book on database design. The man is arrogant but knows his stuff.
First, developers need to understand that there is something to know about databases. They're not just magic devices where you put in the SQL and get out result sets, but rather very complicated pieces of software with their own logic and quirks.
Second, that there are different database setups for different purposes. You do not want a developer making historical reports off an on-line transactional database if there's a data warehouse available.
Third, developers need to understand basic SQL, including joins.
Past this, it depends on how closely the developers are involved. I've worked in jobs where I was developer and de facto DBA, where the DBAs were just down the aisle, and where the DBAs are off in their own area. (I dislike the third.) Assuming the developers are involved in database design:
They need to understand basic normalization, at least the first three normal forms. Anything beyond that, get a DBA. For those with any experience with US courtrooms (and random television shows count here), there's the mnemonic "Depend on the key, the whole key, and nothing but the key, so help you Codd."
They need to have a clue about indexes, by which I mean they should have some idea what indexes they need and how they're likely to affect performance. This means not having useless indices, but not being afraid to add them to assist queries. Anything further (like the balance) should be left for the DBA.
They need to understand the need for data integrity, and be able to point to where they're verifying the data and what they're doing if they find problems. This doesn't have to be in the database (where it will be difficult to issue a meaningful error message for the user), but has to be somewhere.
They should have the basic knowledge of how to get a plan, and how to read it in general (at least enough to tell whether the algorithms are efficient or not).
They should know vaguely what a trigger is, what a view is, and that it's possible to partition pieces of databases. They don't need any sort of details, but they need to know to ask the DBA about these things.
They should of course know not to meddle with production data, or production code, or anything like that, and they should know that all source code goes into a VCS.
I've doubtless forgotten something, but the average developer need not be a DBA, provided there is a real DBA at hand.
Basic Indexing
I'm always shocked to see a table or an entire database with no indexes, or arbitrary/useless indexes. Even if you're not designing the database and just have to write some queries, it's still vital to understand, at a minimum:
What's indexed in your database and what's not:
The difference between types of scans, how they're chosen, and how the way you write a query can influence that choice;
The concept of coverage (why you shouldn't just write SELECT *);
The difference between a clustered and non-clustered index;
Why more/bigger indexes are not necessarily better;
Why you should try to avoid wrapping filter columns in functions.
Designers should also be aware of common index anti-patterns, for example:
The Access anti-pattern (indexing every column, one by one)
The Catch-All anti-pattern (one massive index on all or most columns, apparently created under the mistaken impression that it would speed up every conceivable query involving any of those columns).
The quality of a database's indexing - and whether or not you take advantage of it with the queries you write - accounts for by far the most significant chunk of performance. 9 out of 10 questions posted on SO and other forums complaining about poor performance invariably turn out to be due to poor indexing or a non-sargable expression.
Normalization
It always depresses me to see somebody struggling to write an excessively complicated query that would have been completely straightforward with a normalized design ("Show me total sales per region.").
If you understand this at the outset and design accordingly, you'll save yourself a lot of pain later. It's easy to denormalize for performance after you've normalized; it's not so easy to normalize a database that wasn't designed that way from the start.
At the very least, you should know what 3NF is and how to get there. With most transactional databases, this is a very good balance between making queries easy to write and maintaining good performance.
How Indexes Work
It's probably not the most important, but for sure the most underestimated topic.
The problem with indexing is that SQL tutorials usually don't mention them at all and that all the toy examples work without any index.
Even more experienced developers can write fairly good (and complex) SQL without knowing more about indexes than "An index makes the query fast".
That's because SQL databases do a very good job working as black-box:
Tell me what you need (gimme SQL), I'll take care of it.
And that works perfectly to retrieve the correct results. The author of the SQL doesn't need to know what the system is doing behind the scenes--until everything becomes sooo slooooow.....
That's when indexing becomes a topic. But that's usually very late and somebody (some company?) is already suffering from a real problem.
That's why I believe indexing is the No. 1 topic not to forget when working with databases. Unfortunately, it is very easy to forget it.
Disclaimer
The arguments are borrowed from the preface of my free eBook "Use The Index, Luke". I am spending quite a lot of my time explaining how indexes work and how to use them properly.
I just want to point out an observation - that is that it seems that the majority of responses assume database is interchangeable with relational databases. There are also object databases, flat file databases. It is important to asses the needs of the of the software project at hand. From a programmer perspective the database decision can be delayed until later. Data modeling on the other hand can be achieved early on and lead to much success.
I think data modeling is a key component and is a relatively old concept yet it is one that has been forgotten by many in the software industry. Data modeling, especially conceptual modeling, can reveal the functional behavior of a system and can be relied on as a road map for development.
On the other hand, the type of database required can be determined based on many different factors to include environment, user volume, and available local hardware such as harddrive space.
Avoiding SQL injection and how to secure your database
Every developer should know that this is false: "Profiling a database operation is completely different from profiling code."
There is a clear Big-O in the traditional sense. When you do an EXPLAIN PLAN (or the equivalent) you're seeing the algorithm. Some algorithms involve nested loops and are O( n ^ 2 ). Other algorithms involve B-tree lookups and are O( n log n ).
This is very, very serious. It's central to understanding why indexes matter. It's central to understanding the speed-normalization-denormalization tradeoffs. It's central to understanding why a data warehouse uses a star-schema which is not normalized for transactional updates.
If you're unclear on the algorithm being used do the following. Stop. Explain the Query Execution plan. Adjust indexes accordingly.
Also, the corollary: More Indexes are Not Better.
Sometimes an index focused on one operation will slow other operations down. Depending on the ratio of the two operations, adding an index may have good effects, no overall impact, or be detrimental to overall performance.
I think every developer should understand that databases require a different paradigm.
When writing a query to get at your data, a set-based approach is needed. Many people with an interative background struggle with this. And yet, when they embrace it, they can achieve far better results, even though the solution may not be the one that first presented itself in their iterative-focussed minds.
Excellent question. Let's see, first no one should consider querying a datbase who does not thoroughly understand joins. That's like driving a car without knowing where the steering wheel and brakes are. You also need to know datatypes and how to choose the best one.
Another thing that developers should understand is that there are three things you should have in mind when designing a database:
Data integrity - if the data can't be relied on you essentially have no data - this means do not put required logic in the application as many other sources may touch the database. Constraints, foreign keys and sometimes triggers are necessary to data integrity. Don't fail to use them because you don't like them or don't want to be bothered to understand them.
Performance - it is very hard to refactor a poorly performing database and performance should be considered from the start. There are many ways to do the same query and some are known to be faster almost always, it is short-sighted not to learn and use these ways. Read some books on performance tuning before designing queries or database structures.
Security - this data is the life-blood of your company, it also frequently contains personal information that can be stolen. Learn to protect your data from SQL injection attacks and fraud and identity theft.
When querying a database, it is easy to get the wrong answer. Make sure you understand your data model thoroughly. Remember often actual decisions are made based on the data your query returns. When it is wrong, the wrong business decisions are made. You can kill a company from bad queries or loose a big customer. Data has meaning, developers often seem to forget that.
Data almost never goes away, think in terms of storing data over time instead of just how to get it in today. That database that worked fine when it had a hundred thousand records, may not be so nice in ten years. Applications rarely last as long as data. This is one reason why designing for performance is critical.
Your database will probaly need fields that the application doesn't need to see. Things like GUIDs for replication, date inserted fields. etc. You also may need to store history of changes and who made them when and be able to restore bad changes from this storehouse. Think about how you intend to do this before you come ask a web site how to fix the problem where you forgot to put a where clause on an update and updated the whole table.
Never develop in a newer version of a database than the production version. Never, never, never develop directly against a production database.
If you don't have a database administrator, make sure someone is making backups and knows how to restore them and has tested restoring them.
Database code is code, there is no excuse for not keeping it in source control just like the rest of your code.
Evolutionary Database Design. http://martinfowler.com/articles/evodb.html
These agile methodologies make database change process manageable, predictable and testable.
Developers should know, what it takes to refactor a production database in terms of version control, continious integration and automated testing.
Evolutionary Database Design process has administrative aspects, for example a column is to be dropped after some life time period in all databases of this codebase.
At least know, that Database Refactoring concept and methodologies exist.
http://www.agiledata.org/essays/databaseRefactoringCatalog.html
Classification and process description makes it possible to implement tooling for these refactorings too.
About the following comment to Walter M.'s answer:
"Very well written! And the historical perspective is great for people who weren't doing database work at that time (i.e. me)".
The historical perspective is in a certain sense absolutely crucial. "Those who forget history, are doomed to repeat it.". Cfr XML repeating the hierarchical mistakes of the past, graph databases repeating the network mistakes of the past, OO systems forcing the hierarchical model upon users while everybody with even just a tenth of a brain should know that the hierarchical model is not suitable for general-purpose representation of the real world, etcetera, etcetera.
As for the question itself:
Every database developer should know that "Relational" is not equal to "SQL". Then they would understand why they are being let down so abysmally by the DBMS vendors, and why they should be telling those same vendors to come up with better stuff (e.g. DBMS's that are truly relational) if they want to go on sucking hilarious amounts of money out of their customers for such crappy software).
And every database developer should know everything about the relational algebra. Then there would no longer be a single developer left who had to post these stupid "I don't know how to do my job and want someone else to do it for me" questions on Stack Overflow anymore.
From my experience with relational databases, every developer should know:
- The different data types:
Using the correct type for the correct job will make your DB design more robust, your queries faster and your life easier.
- Learn about 1xM and MxM:
This is the bread and butter for relational databases. You need to understand one-to-many and many-to-many relations and apply then when appropriate.
- "K.I.S.S." principle applies to the DB as well:
Simplicity always works best. Provided you have studied how DB work, you will avoid unnecessary complexity which will lead to maintenance and speed problems.
- Indices:
It's not enough if you know what they are. You need to understand when to used them and when not to.
also:
Boolean algebra is your friend
Images: Don't store them on the DB. Don't ask why.
Test DELETE with SELECT
I would like everyone, both DBAs and developer/designer/architects, to better understand how to properly model a business domain, and how to map/translate that business domain model into both a normalized database logical model, an optimized physical model, and an appropriate object oriented class model, each one of which is (can be) different, for various reasons, and understand when, why, and how they are (or should be) different from one another.
I would say strong basic SQL skills. I've seen a lot of developers so far who know a little about databases but are always asking for tips about how to formulate a quite simple query. Queries are not always that easy and simple. You do have to use multiple joins (inner, left, etc.) when querying a well normalized database.
I think a lot of the technical details have been covered here and I don't want to add to them. The one thing I want to say is more social than technical, don't fall for the "DBA knowing the best" trap as an application developer.
If you are having performance issues with query take ownership of the problem too. Do your own research and push for the DBAs to explain what's happening and how their solutions are addressing the problem.
Come up with your own suggestions too after you have done the research. That is, I try to find a cooperative solution to the problem rather than leaving database issues to the DBAs.
Simple respect.
It's not just a repository
You probably don't know better than the vendor or the DBAs
You won't support it at 3 a.m. with senior managers shouting at you
Consider Denormalization as a possible angel, not the devil, and also consider NoSQL databases as an alternative to relational databases.
Also, I think the Entity-Relation model is a must-know for every developper even if you don't design databases. It'll let you understand thoroughly what's your database all about.
Never insert data with the wrong text encoding.
Once your database becomes polluted with multiple encodings, the best you can do is apply some kind combination of heuristics and manual labor.
Aside from syntax and conceptual options they employ (such as joins, triggers, and stored procedures), one thing that will be critical for every developer employing a database is this:
Know how your engine is going to perform the query you are writing with specificity.
The reason I think this is so important is simply production stability. You should know how your code performs so you're not stopping all execution in your thread while you wait for a long function to complete, so why would you not want to know how your query will affect the database, your program, and perhaps even the server?
This is actually something that has hit my R&D team more times than missing semicolons or the like. The presumtion is the query will execute quickly because it does on their development system with only a few thousand rows in the tables. Even if the production database is the same size, it is more than likely going to be used a lot more, and thus suffer from other constraints like multiple users accessing it at the same time, or something going wrong with another query elsewhere, thus delaying the result of this query.
Even simple things like how joins affect performance of a query are invaluable in production. There are many features of many database engines that make things easier conceptually, but may introduce gotchas in performance if not thought of clearly.
Know your database engine execution process and plan for it.
For a middle-of-the-road professional developer who uses databases a lot (writing/maintaining queries daily or almost daily), I think the expectation should be the same as any other field: You wrote one in college.
Every C++ geek wrote a string class in college. Every graphics geek wrote a raytracer in college. Every web geek wrote interactive websites (usually before we had "web frameworks") in college. Every hardware nerd (and even software nerds) built a CPU in college. Every physician dissected an entire cadaver in college, even if she's only going to take my blood pressure and tell me my cholesterol is too high today. Why would databases be any different?
Unfortunately, they do seem different, today, for some reason. People want .NET programmers to know how strings work in C, but the internals of your RDBMS shouldn't concern you too much.
It's virtually impossible to get the same level of understanding from just reading about them, or even working your way down from the top. But if you start at the bottom and understand each piece, then it's relatively easy to figure out the specifics for your database. Even things that lots of database geeks can't seem to grok, like when to use a non-relational database.
Maybe that's a bit strict, especially if you didn't study computer science in college. I'll tone it down some: You could write one today, completely, from scratch. I don't care if you know the specifics of how the PostgreSQL query optimizer works, but if you know enough to write one yourself, it probably won't be too different from what they did. And you know, it's really not that hard to write a basic one.
The order of columns in a non-unique index is important.
The first column should be the column that has the most variability in its content (i.e. cardinality).
This is to aid SQL Server ability to create useful statistics in how to use the index at runtime.
Understand the tools that you use to program the database!!!
I wasted so much time trying to understand why my code was mysteriously failing.
If you're using .NET, for example, you need to know how to properly use the objects in the System.Data.SqlClient namespace. You need to know how to manage your SqlConnection objects to make sure they are opened, closed, and when necessary, disposed properly.
You need to know that when you use a SqlDataReader, it is necessary to close it separately from your SqlConnection. You need to understand how to keep connections open when appropriate to how to minimize the number of hits to the database (because they are relatively expensive in terms of computing time).
Basic SQL skills.
Indexing.
Deal with different incarnations of DATE/ TIME/ TIMESTAMP.
JDBC driver documentation for the platform you are using.
Deal with binary data types (CLOB, BLOB, etc.)
For some projects, and Object-Oriented model is better.
For other projects, a Relational model is better.
The impedance mismatch problem, and know the common deficiencies or ORMs.
RDBMS Compatibility
Look if it is needed to run the application in more than one RDBMS. If yes, it might be necessary to:
avoid RDBMS SQL extensions
eliminate triggers and store procedures
follow strict SQL standards
convert field data types
change transaction isolation levels
Otherwise, these questions should be treated separately and different versions (or configurations) of the application would be developed.
Don't depend on the order of rows returned by an SQL query.
Three (things) is the magic number:
Your database needs version control too.
Cursors are slow and you probably don't need them.
Triggers are evil*
*almost always

Business Logic in Database versus Code? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
As a software engineer, I have a strong bias towards writing business logic in the application layer, while typically relying on the database for little more than CRUD (Create Retrieve Update and Delete) operations. On the other hand, I have run across applications (typically older ones) where a large amount of the business logic was written in stored procedures, so there are people out there that prefer to write business logic in the database layer.
For the people that have and/or enjoy written/writing business logic in a stored procedure, what were/are your reasons for using this method?
I try to seriously limit my business logic in the DB to only procs that have to do alot of querying and updating to perform a single application operation. Some may argue that even that should be in the app, but I like to keep the IO down if I can.
Databases are great for CRUD but if they get bloated with logic:
It becomes confusing where the logic is,
Typically databases are a silo and do not scale horizontally nearly as well as the app servers.
t_sql/PLsql is hard to read and procedural in nature
You forfeit all of the benefits of OOAD.
To the maximum extent possible, keep your business logic in the environment that is the most testable and debuggable. There are some valid reasons for storing business logic in the database in other people's existing answers, but they are almost always far outweighed by this.
Limiting the business logic to the application layer is short-sighted at best. Experienced professional database designers rarely allow it on their systems. Database need to have constraints and triggers and stored procs to help define how the data from any source will go into it.
If the database is to maintain its integrity and to ensure that all sources of new data or data changes follow the rules, the database is the place to put the required logic. Putting it the application layer is a data nightmare waiting to happen. Databases do not get information just from one application. Business logic in the application is often unintentionally bypassed by imports (assume you got a new customer who wanted their old historical data imported to your system or a large number of target records, no one is going to enter a million possible targets through the interface, it will happen in an import.) It is also bypassed by changes made through the query window to fix one-time issues (things like increasing the price of all products by 10%). If you have application layer logic that should have been applied to the data change, it won't be. Now it's ok to put it in the application layer as well, no sense sending bad data to the database and wasting network bandwidth, but to fail to put it in the database will sooner or later cause data problems.
Another reason to keep all of this in the database has to to with the possibility of users committing fraud. If you put all your logic in the application layer, then you must grant the users access directly to the tables. If you encapsulate all your logic in stored procs, they can be limited to doing only what the stored procs allow and not anything else. I would not consider allowing any kind of access by users to a database that stores financial records or personal information (such as health records) as I would not allow anyone except a couple of dbas to directly access the production records in any way shape or form. More fraud is committed than many developers realize and almost none of them consider the possibility in their design.
If you need to import large amount of data, going through a data access layer could slow down the import to a crawl becasue it doesn't take advanatge of the set-based operations that databases are designed to handle.
Your usage of the term "business logic" is rather vague.
It can be interpreted to mean to include the enforcement of constraints on the data (aka 'business rules'). Enforcement of these unequivocally belongs in the dbms, period.
It can also be interpreted to mean to include things like "if a new customer arrives, then within a week we send him a welcome letter." Trying to push stuff like this in the data layer is probably a big mistake. In such cases, the driver for "create a new welcome letter" should probably be the application that also triggers the new customer row insertion. Imagine every new database row insertion triggering a new welcome letter, and then suddenly we take over another company and we must integrate that company's customers in our own database ... Ouch.
We do a lot of processing in the DB tier, where appropriate. There's a lot of operations you wouldn't want to pull back large datasets to the app tier to do analysis on. It's also an easier deployment for us -- a single point vs. updating applications at all install points. But a lot depends on your application and what it does; there's no single good answer here.
On a couple of ocassions I have put 'logic' in sprocs because the CRUD might be happening in more than one place. By 'logic' I would have to say it is not really business logic but more 'integrity logic'. It might be the same - some cleanup might be necessary if something gets deleted or updated in a certain way, and if that delete or update could happen from more than one tool with different code-bases it made sense to put it in the proc they all used.
In addition, sometimes the 'business logic line' is pretty blurry. Take reports for example - they may rely on stored procedures or views that encapsulate 'smarts' about what the schema means to the business. How often have you seen CASE statements and the like that 'do things' based on column values or other critieria? Could be construed as business logic and yet it probably does belong in the DB where it can be optimized, etc.
I'd say if 'business-logic' means application flow, user control, timed operations and generally 'doing-business-stuff' then it should be in the application layer. But if it means making sure that no matter how you dig around in the data, it always makes sense and is a sensible, non-self-conflicting whole, then the checks to enforce those rules go in the DB, absolutely, no questions. There are always many ways to push data into the DB and manipulate it once its there. Not all those ways have 'business-logic' built in to them. You will find a SQL session into a DB through a DOS window on a support call at 3am is very liberal in what it allows for example! If the logic isn't in the DB to make sure that ALL data changes make sense, you can bet for sure that the data will get very, very screwed up over time. And since a system is only as valuable as the data it holds, that makes for a much lower return on investment.
Two good reasons for putting the business logic in the database are:
It secures your logic and data
against additional applications that
may access the database that don't
implement similar logic.
Database designs usually outlive the
application layer and it reduces the
work necessary when you move to new
technologies on the client side.
You often find business logic at the database layer because it can often be faster to make a change and deploy. I think often the best intentions are not to put the logic there but because of the ease of deployment it ends up there.
The primary reason I would put BL in stored procs in the past is that transactions were easier in the database.
If deployments are difficult for your app and you don't have an app-server, changing the BL in stored procedures is the most effective way to deploy a change.
I work for a financial type company where certain rules are applied by states, and these rules and their calculations are subject to change almost daily if not surely weekly. That being the case, it made more sense to move parts of the logic dealing with calculations to the database; where a change can be tested and applied without having to recompile and redistibute an application, which is impossible to do daily without disrupting business. The stored proc is tested, approved, applied and the end user is none the wiser.
With the move to web based applications, the reliance on moving the logic to the database is less but still present. Even web apps (depending on the language) must be compiled and published to the site which could cause downtime.
Sometimes business logic is too slow to run on the app layer. This is especially true on on older systems where client power and bandwidth was more limited.
The main reason for using the database to do the work is that you have a single point of control. Often, app developers re-use or rewrite code fragments in different parts of the application. Even assuming that these all work exactly the same way (which is doubtful), when the business logic changes, the app needs to be reviewed, recoded, recompiled. Unless the parameters change, this would not be necessary where the business logic is stored only in the database.
My preference is to keep any complicated business logic out of the database, simply for maintenance purposes. If I get a call at 2 o'clock in the morning I would rather debug my application code than try to step through database scripts.
I'm in a team to build-up and maintain a rather large financial system, and I find no way put the logic into the application layer for action that affect to or get constraints from dozens of thousand records.
Beside the performance issue, should errors happen, rectifying a stored procedures is much faster than debugging the application, fixing, recompiling, redeploying the code with longer downtime
I think Specially for older applications which i working on (Banking) where the Bussiness logic is huge, it's almost next to impossible to perform all these business logic in application layer, and also It's a big performenance hit when we put these logic in Application layer where the number of fetch to the database is more, results in more resource utilization(more java objects if it's done in java layer) and network issues and forget abt performenance.

Resources