How to learn database management system on own? [closed] - database

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to learn database management system on my own,would you like to suggest me some tips

Don't try what others have suggested to "learn by doing".
That is monkey-style learning.
Human intelligence/knowledge has gone far beyond the point where for the most specimens of the species, it is still possible/productive to "learn by making mistakes".
Trust me, if you just go down the "learn-exclusively-by-doing" path, the only thing you will end up doing is repeating the same mistakes that millions have already made before you.
The only thing suggested in the other answers here that I can second, is "buy a good book". But "being a good book" does mean that it should give a thorough coverage of the theory. Otherwise you are likely to end up building database applications like the Romans built their aquaducts : by overdoing certain things "just to be on the safe side".
(That is exactly what those Roman engineers did when they built those monuments that often still stand to this very day, in the absence of any knowledge/understanding of things such as gravity and the strenght of concrete spannings : they threw in a few dozens of extra bricks, without really knowing precisely whether they were really needed or not. But their world wasn't as economically competitive as ours.)

I'm not a big fan of learning without an actual application to which it's applied.
Books on Modeling and Design are all well and good, but all of those suggestions need to be placed in the context of an application.
I have seen my share of "horrible" data models that work fine for the application. While there's a purity in having a "good design", the simple truth is that not everything needs a "good design". Or, better said, a "good design" for one application may be completely different than one for another.
Many simple applications perform fine with "stupid", "dumb", "bad" designs.
A lot of learning happens from doing the wrong thing.
To paraphrase Thomas Edison, "Progress, I've made lots of progress. I know a thousand things that don't work."
A lot of things are easier to learn when they're being applied in the "real world", and then judged against that metric of whether or not it's working or not vs simply holding it up to something read in a book, but unapplied.
The benefit of "good design" is that it works well with the "Code has Momentum" meme, specifically that a bad design, once entrenched, is difficult to refactor or remove and replace with a good design, so you want to start with a good design up front.
That said, as a corollary, especially if you follow blindly many books on modeling and architecture, you end up with simple applications that are terrible over engineered. With a lot of unnecessary code for circumstances that simply do not, and will not, exist in the application.
The game is finding the balance between "the perfect" solution, and the "workable solution".
So, read all the books you like, but also apply it to something of value to you, and then fix your errors as you grow. Not everyone should start at ground zero, you want to "stand on the shoulder of giants", but it's important to understand, also, the paths the giants took in the first place to better appreciate why, and in what situations, they advocate their choices.

1) choose a target database, Oracle, MySQL, MS SQL...
2) buy a good book,
3) participate discussion in community.
0) of course, setup an env to play around...

I too suggest you get a good book (or more, performance tuning is so complex a subject, it usually is covered in a separate book) on one of the major databases (whichever one you choose to learn first).
Subjects you need to learn are (at a minimum):
Querying including the use of joins (if you do not thoroughly understand joins, you can't do anything except the simplest of queries or you may get a result set that is incorrect. You can't learn too much about joins.
If data is to have meaning, you must understand how to tell if the results you are producing are the correct results.
Normalization - if you don't understand normalization, you cannot design an effective relational database. Learn also about primary and foreign keys. Do not ever design a database table without a way to uniquely identify a record.
Set theory - you have to learn to think in terms of manipulating groups of records not looping through records one at a time.
Performance - if you can't get results out in a timely manner, no one will use your software. Design in databases should consider performance carefully as databases as not so easy to refactor when they perform badly and there are many techiniques that are known to be faster than other techniques that produce the same result. You should learn what these are and not use the poorly performing ones because you find them easier to understand.
Data integrity - if you can't rely on the data to be correct, you do not have a useful database. You should know how to ensure that data has the correct values and relationships to other data. This also includes using the correct data type (Store dates as a datetime or date data type for instance).
Security - including protecting against both SQL injection attacks and possible internal fraud.
Constraints and triggers and stored procs and user-defined functions.
Finally, do not use object-oriented thinking in database design. Relational databases often use tools and techniques that you would not use in Object-oriented programming. It is a a different subject and thus subject to different rules and constraints.

Get a good Modeling and Design book before you do anything else.

Try to do some simple applications using databases in your favorite programming language. Learn by doing. And when you get a problem, read the DBMS-documentation and learn.

http://video.stoimen.com/2009/04/12/zend-framework-zend_db-tutorial/ perhaps

Related

Reasons for absence of performance test for oss database systems? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I used open source java implementation of TPC-C benchmark (called TCJ - TPC-C via JDBC (created by MMatejka last year)) to compare the performance of Oracle and 2 OSS DBMS.
TPC-C is standard in the proprietary sphere and my question is:
What are the main reasons that there is not systematically implemented performance test for OSS database systems?
Firstly, I'm not sure your question is a perfect fit for SO as it is getting close to asking opinion, and so all of my answer is more opinion than fact. most of this I have read over the years, but will struggle to find references/proof anymore. I am not a TPC member, but did heavily investigate trying to get a dsitributed column store database tested under the TPC-H suite.
Benchmarks
These are great at testing a single feature and comparing them, unfortunately, that is nowhere as easy as it sounds. Companies will spend large amounts of effort to get better results, sometimes (so I have heard) implementing specific functions in the source for a benchmark. There is a lot of discussion about how reliable benchmark results are overall. Also, a benchmark may be a perfect fit for some product, but not another.
Your example uses Jdbc, but not every database has jdbc, or worse it may be a 'minor bolt on' just to enable that class of application. So performing benchmarks via jdbc when all main usage will be embedded sql may portray some solutions unfairly/poorly.
There is some arguments around that benchmarks distract vendors from real priorities, they spend effort and implement features solely for benchmarks.
Benchmarks can also be very easily misunderstood, even TPC is a suite of different benchmarks and you need to select the correct one for your needs ( tpc-c for oltp, tpc-h for dss etc)
TPC
If this reads as negative for tpc, please forgive me, I am pro tpc.
Tpc defines a very tight set of test requirements. You must follow these to a letter. For tpc-h this is an example of what you must do
do multiple runs, some in parallel, some single user
use exactly the sql provided, you must not change it at all. If you need to because your system uses a slightly different syntax, but must get a waiver.
you must use an external auditor.
you may not index colmns etc beyond what is specified.
for tpch you must do writing in a specified way (which eliminates 'single writer' style databases)
The above ensures that people reading the results can have trust in the integrity of the results, which is great for a corporate buyer.
Tpc is a non profit org and anybody can join. There is a fee but it isnt a major barrier, except for OSS. You are only realistically going to pay this fee if you think you can achieve really great results, or you need published results to bid for govt contracts etc.
The biggest problem I see with tpc for oss is that it is heavily skewed towards relational vendors and very few oss solutions can met the entry criteria with their offerings, or if they do they may not perform well enough for every test. Doing a benchmark may also be a distraction for some teams.
Alternatives to tpc
Of course alternatives exist to tpc, but none really gain traction, as yet, that i am aware of. Major vendors often stipulate that you cannot benchmark their products and publish the results. So any new benchmark will need to be politically astutue to get them on board. I agree with the vendors stance here, I would hate someone to mis-implement a benchmark and report my product poorly.
The database landscape has fractured a lot since tpc started, but many 'bet you business' applications still run on 'classic' databases, so they still have a place. However, with the rise in nosql etc, there is a place for new benchmarks, but the real question becomes what to measure - even chosing xyz like '%kitten%'. Or xyz like 'kitten%'. Will have dramatic effects on different solutions. If you solve that, what common interface wil,you allow (odbc, jdbc, http/ajax, embedded sql, etc) each of these interfaces affects performance greatly. What about the actual models, such as ACID for relational models vs eventual consistency models? What about hardware/software solutions that use specificaly designed hardware?
Each database has made design trade offs for different needs, and a benchmark is attempting to level the playing field, which is only really possible if you have something in common, or report lots of different metrics.
One of the problems with trying to create an alternative is that 'who will pay'? You need consenus over the type of tests to perform, and then you need to audit results for them to be meaningful. This all costs money.

Database design for a company: single or multi database? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We're (re)designing a corporate information system. For the database design, we are exploring these options:
[Option 1->] a single CompanyBigDatabase that has everything,
[Option 2->] several databases across the company (say, HRD_DB, FinanceDB, MarketingDB), which then synchronized through a layer of application. EmployeeTable is owned by HRD, if Finance wants to refer to employees, it queries EmployeeTable from HRD_DB via a web-service.
What is the best practice? What's the pros and cons? We want it to have high availability and to be reasonably reliable. Does Option 1 necessitate clustering-and-all for this? Do big companies and universities (like Toyota, Samsung, Stanford Uni, MIT, ...), always opt for Option 1?
I was looking in many DB textbooks but I could not find a sufficient explanation on this topic.
Any thoughts, tips, links, or advice is welcome. Thanks.
Ive have done this type of work for 20 yrs. Enterprise Architecting is one term used to describe this. If you are asking this question, in a real enterprise scenario, im going to recommend you get advice. If it's a uni question, There are so many things to consider:
budget
politics
timeframes
legacy systems or green field,
Scope of Build
In house or Hosted
Complete Outsource of some or all of the functionality (SaaS)
....
Entire Methodologies are written to support projects that do this.
You can end up with many answers to the variables.
Even agreeing on how to weight features and outcomes is tough.
This is HUGE question you could right a book on.
Its like a 2 paragraph question where I have seen 10 people spend a month putting a business case together to do X. Thats just costing and planning the various options. Without selection of the final approach.
So I have not directly answered your question... that my friend is a
serious research project, not really a StackOverflow question.
There is no single answer. It depends on the many other factors such as database load, application architecture, scalability and etc. My suggestion start the simplest way possible (single database) and change it based on the needs.
Single database has it's advantages: simpler joins, referential integrity, single backup. Only separate pieces of data when you have valid reason/need.
In my opinion, it would be more appropriate to have database normalized and have several databases across the company based on departments. This would allow you to manage data more effectively in terms of storing, retrieving and updating information and providing access to users based on department type or user type. You can also provide different views of the database. It will be a lot more easier to manage data.
There is a general principle of databases in particular, and computing in general, that there should be a single authoritative source for every data item.
Extending this to sets of data, as soon as you have multiple customer lists, multiple lists of items, multiple email addresses, you are soon into a quagmire of uncertainty that will then call for a business intelligence solution to resolve them all.
Now I'm a business intelligence chap by historical leaning, but I'd be the first to say that this is not a path that you want to go down simply because Marketing and Accounts cannot decide the definition of "customer". You do it because your well-normalised OLTP systems do not make it easy to count how many customers there were yesterday, last week, and last year.
Nor should they either, because then they would be in danger of sacrificing their true purpose -- to maintain a high performance, high-integrity persistent store of the "data universe" that your company exists in.
So in other words, the single database approach has data integrity on it's side, and you really do not want to work in a company that does not have data integrity. As a Business Intelligence practitioner I can tell you that it is a horrible place.
On the other hand, you are going to have practical situations in which you simply must have separate systems due to application vendor constraints etc, and in that case the problem becomes one of keeping the data as tightly coupled as possible, and of Metadata Management (ugh) in which the company agrees what data in what location has what meaning.
Either will work and other decisions will mostly affect your specification. To some extent you question could be described as 'Should I go down the ERP path or the SAAS path"? I think it is telling that right now most systems are tending towards SAAS.
How will you be managing the applications? If they will be updated at different times separate DBs make more sense. (SAAS path). On the other hand having one DB to connect to, one authorisation system, one place to look for details, one place to backup, etc appears to decrease complexity in the technical space. But then does not allow decisions affecting one part of the business to be considered separately from other parts of the business
Once the business is involved trying to get a single time each department agrees to an upgrade can be hell. Having a degree of abstraction so that you only have to get one department to align before updating its part of the stack has real advantages in coming years. And if your web services are robust and don't change with each release this can be a much easier path.
Don't forget you can have views of data in other DBs.
And as for your question of how do most big companies work; generally by a miss-mash of big and little systems that sometimes talk to each other, sometimes don't and often repeat data. Having said that repeating data is a real problem; always have an authoritative source and copies (or even better only one copy). A method I have seen work well in a number of enterprises is to have one place details can be CRUDed (Created, Retrieved, Updated and Deleted) and numerous where it can be read.
And really this design decision has little or nothing to do with availability and reliability. That tends to come from good design (simplicity, knowing where things live, etc) good practices (good release practices, admin practtices, backups, intelligent redundancy, etc) and spending money. Not from having one or multiple systems.

What should every developer know about databases? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Whether we like it or not, many if not most of us developers either regularly work with databases or may have to work with one someday. And considering the amount of misuse and abuse in the wild, and the volume of database-related questions that come up every day, it's fair to say that there are certain concepts that developers should know - even if they don't design or work with databases today.
What is one important concept that developers and other software professionals ought to know about databases?
The very first thing developers should know about databases is this: what are databases for? Not how do they work, nor how do you build one, nor even how do you write code to retrieve or update the data in a database. But what are they for?
Unfortunately, the answer to this one is a moving target. In the heydey of databases, the 1970s through the early 1990s, databases were for the sharing of data. If you were using a database, and you weren't sharing data you were either involved in an academic project or you were wasting resources, including yourself. Setting up a database and taming a DBMS were such monumental tasks that the payback, in terms of data exploited multiple times, had to be huge to match the investment.
Over the last 15 years, databases have come to be used for storing the persistent data associated with just one application. Building a database for MySQL, or Access, or SQL Server has become so routine that databases have become almost a routine part of an ordinary application. Sometimes, that initial limited mission gets pushed upward by mission creep, as the real value of the data becomes apparent. Unfortunately, databases that were designed with a single purpose in mind often fail dramatically when they begin to be pushed into a role that's enterprise wide and mission critical.
The second thing developers need to learn about databases is the whole data centric view of the world. The data centric world view is more different from the process centric world view than anything most developers have ever learned. Compared to this gap, the gap between structured programming and object oriented programming is relatively small.
The third thing developers need to learn, at least in an overview, is data modeling, including conceptual data modeling, logical data modeling, and physical data modeling.
Conceptual data modeling is really requirements analysis from a data centric point of view.
Logical data modeling is generally the application of a specific data model to the requirements discovered in conceptual data modeling. The relational model is used far more than any other specific model, and developers need to learn the relational model for sure. Designing a powerful and relevant relational model for a nontrivial requirement is not a trivial task. You can't build good SQL tables if you misunderstand the relational model.
Physical data modeling is generally DBMS specific, and doesn't need to be learned in much detail, unless the developer is also the database builder or the DBA. What developers do need to understand is the extent to which physical database design can be separated from logical database design, and the extent to which producing a high speed database can be accomplished just by tweaking the physical design.
The next thing developers need to learn is that while speed (performance) is important, other measures of design goodness are even more important, such as the ability to revise and extend the scope of the database down the road, or simplicity of programming.
Finally, anybody who messes with databases needs to understand that the value of data often outlasts the system that captured it.
Whew!
Good question. The following are some thoughts in no particular order:
Normalization, to at least the second normal form, is essential.
Referential integrity is also essential, with proper cascading delete and update considerations.
Good and proper use of check constraints. Let the database do as much work as possible.
Don't scatter business logic in both the database and middle tier code. Pick one or the other, preferably in middle tier code.
Decide on a consistent approach for primary keys and clustered keys.
Don't over index. Choose your indexes wisely.
Consistent table and column naming. Pick a standard and stick to it.
Limit the number of columns in the database that will accept null values.
Don't get carried away with triggers. They have their use but can complicate things in a hurry.
Be careful with UDFs. They are great but can cause performance problems when you're not aware how often they might get called in a query.
Get Celko's book on database design. The man is arrogant but knows his stuff.
First, developers need to understand that there is something to know about databases. They're not just magic devices where you put in the SQL and get out result sets, but rather very complicated pieces of software with their own logic and quirks.
Second, that there are different database setups for different purposes. You do not want a developer making historical reports off an on-line transactional database if there's a data warehouse available.
Third, developers need to understand basic SQL, including joins.
Past this, it depends on how closely the developers are involved. I've worked in jobs where I was developer and de facto DBA, where the DBAs were just down the aisle, and where the DBAs are off in their own area. (I dislike the third.) Assuming the developers are involved in database design:
They need to understand basic normalization, at least the first three normal forms. Anything beyond that, get a DBA. For those with any experience with US courtrooms (and random television shows count here), there's the mnemonic "Depend on the key, the whole key, and nothing but the key, so help you Codd."
They need to have a clue about indexes, by which I mean they should have some idea what indexes they need and how they're likely to affect performance. This means not having useless indices, but not being afraid to add them to assist queries. Anything further (like the balance) should be left for the DBA.
They need to understand the need for data integrity, and be able to point to where they're verifying the data and what they're doing if they find problems. This doesn't have to be in the database (where it will be difficult to issue a meaningful error message for the user), but has to be somewhere.
They should have the basic knowledge of how to get a plan, and how to read it in general (at least enough to tell whether the algorithms are efficient or not).
They should know vaguely what a trigger is, what a view is, and that it's possible to partition pieces of databases. They don't need any sort of details, but they need to know to ask the DBA about these things.
They should of course know not to meddle with production data, or production code, or anything like that, and they should know that all source code goes into a VCS.
I've doubtless forgotten something, but the average developer need not be a DBA, provided there is a real DBA at hand.
Basic Indexing
I'm always shocked to see a table or an entire database with no indexes, or arbitrary/useless indexes. Even if you're not designing the database and just have to write some queries, it's still vital to understand, at a minimum:
What's indexed in your database and what's not:
The difference between types of scans, how they're chosen, and how the way you write a query can influence that choice;
The concept of coverage (why you shouldn't just write SELECT *);
The difference between a clustered and non-clustered index;
Why more/bigger indexes are not necessarily better;
Why you should try to avoid wrapping filter columns in functions.
Designers should also be aware of common index anti-patterns, for example:
The Access anti-pattern (indexing every column, one by one)
The Catch-All anti-pattern (one massive index on all or most columns, apparently created under the mistaken impression that it would speed up every conceivable query involving any of those columns).
The quality of a database's indexing - and whether or not you take advantage of it with the queries you write - accounts for by far the most significant chunk of performance. 9 out of 10 questions posted on SO and other forums complaining about poor performance invariably turn out to be due to poor indexing or a non-sargable expression.
Normalization
It always depresses me to see somebody struggling to write an excessively complicated query that would have been completely straightforward with a normalized design ("Show me total sales per region.").
If you understand this at the outset and design accordingly, you'll save yourself a lot of pain later. It's easy to denormalize for performance after you've normalized; it's not so easy to normalize a database that wasn't designed that way from the start.
At the very least, you should know what 3NF is and how to get there. With most transactional databases, this is a very good balance between making queries easy to write and maintaining good performance.
How Indexes Work
It's probably not the most important, but for sure the most underestimated topic.
The problem with indexing is that SQL tutorials usually don't mention them at all and that all the toy examples work without any index.
Even more experienced developers can write fairly good (and complex) SQL without knowing more about indexes than "An index makes the query fast".
That's because SQL databases do a very good job working as black-box:
Tell me what you need (gimme SQL), I'll take care of it.
And that works perfectly to retrieve the correct results. The author of the SQL doesn't need to know what the system is doing behind the scenes--until everything becomes sooo slooooow.....
That's when indexing becomes a topic. But that's usually very late and somebody (some company?) is already suffering from a real problem.
That's why I believe indexing is the No. 1 topic not to forget when working with databases. Unfortunately, it is very easy to forget it.
Disclaimer
The arguments are borrowed from the preface of my free eBook "Use The Index, Luke". I am spending quite a lot of my time explaining how indexes work and how to use them properly.
I just want to point out an observation - that is that it seems that the majority of responses assume database is interchangeable with relational databases. There are also object databases, flat file databases. It is important to asses the needs of the of the software project at hand. From a programmer perspective the database decision can be delayed until later. Data modeling on the other hand can be achieved early on and lead to much success.
I think data modeling is a key component and is a relatively old concept yet it is one that has been forgotten by many in the software industry. Data modeling, especially conceptual modeling, can reveal the functional behavior of a system and can be relied on as a road map for development.
On the other hand, the type of database required can be determined based on many different factors to include environment, user volume, and available local hardware such as harddrive space.
Avoiding SQL injection and how to secure your database
Every developer should know that this is false: "Profiling a database operation is completely different from profiling code."
There is a clear Big-O in the traditional sense. When you do an EXPLAIN PLAN (or the equivalent) you're seeing the algorithm. Some algorithms involve nested loops and are O( n ^ 2 ). Other algorithms involve B-tree lookups and are O( n log n ).
This is very, very serious. It's central to understanding why indexes matter. It's central to understanding the speed-normalization-denormalization tradeoffs. It's central to understanding why a data warehouse uses a star-schema which is not normalized for transactional updates.
If you're unclear on the algorithm being used do the following. Stop. Explain the Query Execution plan. Adjust indexes accordingly.
Also, the corollary: More Indexes are Not Better.
Sometimes an index focused on one operation will slow other operations down. Depending on the ratio of the two operations, adding an index may have good effects, no overall impact, or be detrimental to overall performance.
I think every developer should understand that databases require a different paradigm.
When writing a query to get at your data, a set-based approach is needed. Many people with an interative background struggle with this. And yet, when they embrace it, they can achieve far better results, even though the solution may not be the one that first presented itself in their iterative-focussed minds.
Excellent question. Let's see, first no one should consider querying a datbase who does not thoroughly understand joins. That's like driving a car without knowing where the steering wheel and brakes are. You also need to know datatypes and how to choose the best one.
Another thing that developers should understand is that there are three things you should have in mind when designing a database:
Data integrity - if the data can't be relied on you essentially have no data - this means do not put required logic in the application as many other sources may touch the database. Constraints, foreign keys and sometimes triggers are necessary to data integrity. Don't fail to use them because you don't like them or don't want to be bothered to understand them.
Performance - it is very hard to refactor a poorly performing database and performance should be considered from the start. There are many ways to do the same query and some are known to be faster almost always, it is short-sighted not to learn and use these ways. Read some books on performance tuning before designing queries or database structures.
Security - this data is the life-blood of your company, it also frequently contains personal information that can be stolen. Learn to protect your data from SQL injection attacks and fraud and identity theft.
When querying a database, it is easy to get the wrong answer. Make sure you understand your data model thoroughly. Remember often actual decisions are made based on the data your query returns. When it is wrong, the wrong business decisions are made. You can kill a company from bad queries or loose a big customer. Data has meaning, developers often seem to forget that.
Data almost never goes away, think in terms of storing data over time instead of just how to get it in today. That database that worked fine when it had a hundred thousand records, may not be so nice in ten years. Applications rarely last as long as data. This is one reason why designing for performance is critical.
Your database will probaly need fields that the application doesn't need to see. Things like GUIDs for replication, date inserted fields. etc. You also may need to store history of changes and who made them when and be able to restore bad changes from this storehouse. Think about how you intend to do this before you come ask a web site how to fix the problem where you forgot to put a where clause on an update and updated the whole table.
Never develop in a newer version of a database than the production version. Never, never, never develop directly against a production database.
If you don't have a database administrator, make sure someone is making backups and knows how to restore them and has tested restoring them.
Database code is code, there is no excuse for not keeping it in source control just like the rest of your code.
Evolutionary Database Design. http://martinfowler.com/articles/evodb.html
These agile methodologies make database change process manageable, predictable and testable.
Developers should know, what it takes to refactor a production database in terms of version control, continious integration and automated testing.
Evolutionary Database Design process has administrative aspects, for example a column is to be dropped after some life time period in all databases of this codebase.
At least know, that Database Refactoring concept and methodologies exist.
http://www.agiledata.org/essays/databaseRefactoringCatalog.html
Classification and process description makes it possible to implement tooling for these refactorings too.
About the following comment to Walter M.'s answer:
"Very well written! And the historical perspective is great for people who weren't doing database work at that time (i.e. me)".
The historical perspective is in a certain sense absolutely crucial. "Those who forget history, are doomed to repeat it.". Cfr XML repeating the hierarchical mistakes of the past, graph databases repeating the network mistakes of the past, OO systems forcing the hierarchical model upon users while everybody with even just a tenth of a brain should know that the hierarchical model is not suitable for general-purpose representation of the real world, etcetera, etcetera.
As for the question itself:
Every database developer should know that "Relational" is not equal to "SQL". Then they would understand why they are being let down so abysmally by the DBMS vendors, and why they should be telling those same vendors to come up with better stuff (e.g. DBMS's that are truly relational) if they want to go on sucking hilarious amounts of money out of their customers for such crappy software).
And every database developer should know everything about the relational algebra. Then there would no longer be a single developer left who had to post these stupid "I don't know how to do my job and want someone else to do it for me" questions on Stack Overflow anymore.
From my experience with relational databases, every developer should know:
- The different data types:
Using the correct type for the correct job will make your DB design more robust, your queries faster and your life easier.
- Learn about 1xM and MxM:
This is the bread and butter for relational databases. You need to understand one-to-many and many-to-many relations and apply then when appropriate.
- "K.I.S.S." principle applies to the DB as well:
Simplicity always works best. Provided you have studied how DB work, you will avoid unnecessary complexity which will lead to maintenance and speed problems.
- Indices:
It's not enough if you know what they are. You need to understand when to used them and when not to.
also:
Boolean algebra is your friend
Images: Don't store them on the DB. Don't ask why.
Test DELETE with SELECT
I would like everyone, both DBAs and developer/designer/architects, to better understand how to properly model a business domain, and how to map/translate that business domain model into both a normalized database logical model, an optimized physical model, and an appropriate object oriented class model, each one of which is (can be) different, for various reasons, and understand when, why, and how they are (or should be) different from one another.
I would say strong basic SQL skills. I've seen a lot of developers so far who know a little about databases but are always asking for tips about how to formulate a quite simple query. Queries are not always that easy and simple. You do have to use multiple joins (inner, left, etc.) when querying a well normalized database.
I think a lot of the technical details have been covered here and I don't want to add to them. The one thing I want to say is more social than technical, don't fall for the "DBA knowing the best" trap as an application developer.
If you are having performance issues with query take ownership of the problem too. Do your own research and push for the DBAs to explain what's happening and how their solutions are addressing the problem.
Come up with your own suggestions too after you have done the research. That is, I try to find a cooperative solution to the problem rather than leaving database issues to the DBAs.
Simple respect.
It's not just a repository
You probably don't know better than the vendor or the DBAs
You won't support it at 3 a.m. with senior managers shouting at you
Consider Denormalization as a possible angel, not the devil, and also consider NoSQL databases as an alternative to relational databases.
Also, I think the Entity-Relation model is a must-know for every developper even if you don't design databases. It'll let you understand thoroughly what's your database all about.
Never insert data with the wrong text encoding.
Once your database becomes polluted with multiple encodings, the best you can do is apply some kind combination of heuristics and manual labor.
Aside from syntax and conceptual options they employ (such as joins, triggers, and stored procedures), one thing that will be critical for every developer employing a database is this:
Know how your engine is going to perform the query you are writing with specificity.
The reason I think this is so important is simply production stability. You should know how your code performs so you're not stopping all execution in your thread while you wait for a long function to complete, so why would you not want to know how your query will affect the database, your program, and perhaps even the server?
This is actually something that has hit my R&D team more times than missing semicolons or the like. The presumtion is the query will execute quickly because it does on their development system with only a few thousand rows in the tables. Even if the production database is the same size, it is more than likely going to be used a lot more, and thus suffer from other constraints like multiple users accessing it at the same time, or something going wrong with another query elsewhere, thus delaying the result of this query.
Even simple things like how joins affect performance of a query are invaluable in production. There are many features of many database engines that make things easier conceptually, but may introduce gotchas in performance if not thought of clearly.
Know your database engine execution process and plan for it.
For a middle-of-the-road professional developer who uses databases a lot (writing/maintaining queries daily or almost daily), I think the expectation should be the same as any other field: You wrote one in college.
Every C++ geek wrote a string class in college. Every graphics geek wrote a raytracer in college. Every web geek wrote interactive websites (usually before we had "web frameworks") in college. Every hardware nerd (and even software nerds) built a CPU in college. Every physician dissected an entire cadaver in college, even if she's only going to take my blood pressure and tell me my cholesterol is too high today. Why would databases be any different?
Unfortunately, they do seem different, today, for some reason. People want .NET programmers to know how strings work in C, but the internals of your RDBMS shouldn't concern you too much.
It's virtually impossible to get the same level of understanding from just reading about them, or even working your way down from the top. But if you start at the bottom and understand each piece, then it's relatively easy to figure out the specifics for your database. Even things that lots of database geeks can't seem to grok, like when to use a non-relational database.
Maybe that's a bit strict, especially if you didn't study computer science in college. I'll tone it down some: You could write one today, completely, from scratch. I don't care if you know the specifics of how the PostgreSQL query optimizer works, but if you know enough to write one yourself, it probably won't be too different from what they did. And you know, it's really not that hard to write a basic one.
The order of columns in a non-unique index is important.
The first column should be the column that has the most variability in its content (i.e. cardinality).
This is to aid SQL Server ability to create useful statistics in how to use the index at runtime.
Understand the tools that you use to program the database!!!
I wasted so much time trying to understand why my code was mysteriously failing.
If you're using .NET, for example, you need to know how to properly use the objects in the System.Data.SqlClient namespace. You need to know how to manage your SqlConnection objects to make sure they are opened, closed, and when necessary, disposed properly.
You need to know that when you use a SqlDataReader, it is necessary to close it separately from your SqlConnection. You need to understand how to keep connections open when appropriate to how to minimize the number of hits to the database (because they are relatively expensive in terms of computing time).
Basic SQL skills.
Indexing.
Deal with different incarnations of DATE/ TIME/ TIMESTAMP.
JDBC driver documentation for the platform you are using.
Deal with binary data types (CLOB, BLOB, etc.)
For some projects, and Object-Oriented model is better.
For other projects, a Relational model is better.
The impedance mismatch problem, and know the common deficiencies or ORMs.
RDBMS Compatibility
Look if it is needed to run the application in more than one RDBMS. If yes, it might be necessary to:
avoid RDBMS SQL extensions
eliminate triggers and store procedures
follow strict SQL standards
convert field data types
change transaction isolation levels
Otherwise, these questions should be treated separately and different versions (or configurations) of the application would be developed.
Don't depend on the order of rows returned by an SQL query.
Three (things) is the magic number:
Your database needs version control too.
Cursors are slow and you probably don't need them.
Triggers are evil*
*almost always

Comparing Database Platforms [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
My employer has a database committee and we've been discussing different platforms. Thus far we've had people present on SqlLite, MySQL, and PostreSql. We currently use the Microsoft stack so we're all quite familiar with Microsoft Sql Server.
As a part of this comparison I thought it would be interesting to create a small reference application for each database platform to explore the details involved in working with it.
First: does that idea make sense or is the comparison require going beyond the scope of a trivial sample application?
Second: I would imagine each reference application having a discrete but small set of requirements that fulfill many of the scenarios we run into on a regular basis. Here is what I have so far, what else can be added to the list but still keep the application small enough to be built in a very limited timespan?
Connectivity from the application layer
Tools for database administration
Process of creating a schema (small "s" schema, tables/views/functions other objects)
Simple CRUD (Create, Retrieve, Update, Delete)
Transaction support
Third: has anyone gone through this process, what are your findings?
Does that idea make sense or is the
comparison require going beyond the
scope of a trivial sample application?
I don't think it's a good idea. Most of the things that will really affect you are long term database management issues, and how the database management system you choose can handle those things.
You could be tempted in the short term with things like "I found out in 3 seconds how to do this with XYZ database management system". Now, I'm not saying support is not important; quite the contrary. But finding an answer in google in 3 seconds means that you got an answer to a simple question. How quickly, if ever, can you find an answer to a challenging problem?
A short list (not exhaustive) of important things to consider are:
backup and recovery -- at both logical level and physical level
good support for functions (or stored procedures), triggers, various SQL query constructs
APIs that allow real extensibility -- these things can get you out of tough situations and allow you to solve problems in creative ways. You'd be surprised what can be accomplished with user-defined types and functions. How do the user-defined types interact with the indexing system?
SQL standard support -- doesn't trump everything else, but if support is lacking in a few areas, really consider why it is lacking, what the workarounds are, and what are the costs of those workarounds.
A powerful executor that offers a range of fundamental algorithms (e.g. hash join, merge join, etc.) and indexing structures (btree, hash, maybe a full text option, etc.). If it's missing some algorithms or index structures, consider the types of questions that the database will be inefficient at answering. Note: I don't just mean "slow" here; the wrong algorithm can easily be worse by orders of magnitude.
Can the type system reasonably represent your business? If the set of types available is incredibly weak, you will have a mess. Representing everything as strings is kind of like assembly programming (untyped), and you will have a mess.
A trivial application won't show you any of those things. Simple things are simple to solve. If you have a "database committee" then your company cares about its data, and you should take the responsibility seriously. You need to make sure that you can develop applications on it easily with the results you and your developers expect; and when you run into problems you need to have access to a powerful system and quality support that can get you through it.
Actually learning capabilities of each RDMS is more crucial. Because it depends on the application. If you need spatial data capabilities PostGIS with PostgreSQL is better than MySQL. If you need easy replication, high availability features MySQL seems better. Also there are license issues. A link for comparison here. All has strengths and weaknesses. First get the requirements of your project or projects than compare it with list the features of the RDMSs you pick and decide which one to go.
I don't think you need to test the simple CRUD stuff, it's hard to imagine a vendor that doesn't support the basics.
Firstly, you're going beyond the scope of a sample app, in my humble opinion.
Secondly, I'd pick the one most appropriate to the tool or application you wish to develop. For example, are schemas and transactions relevant for a database that stores a single-user app configuration?
Thirdly, I've worked with Access, SQL Server, SQLite, MySQL, PostgreSQL and Oracle, and they all have their place. If you're in the MS space, go with SQL Server (and don't forget Express). There are also ADO.NET ways to talk to the others in my list. It depends on what you want.
Frankly, I doubt an arbitrarily-defined simple application would be likely to really highlight the differences between database engines. I think you'd be better to read the advertising literature for the various engines to see what they claim as their strong points. Then consider which of these matter to you, and construct tests specifically designed to verify claims that you care about.
For example, here are pros and cons of database engines I've used the most that have mattered to me. I don't claim this is an exhaustive list, but it may give you an idea of things to think about:
MySQL: Note: MySQL has two main engines internally: MyISAM and InnoDB. I have never used the InnoDB.
Pros: Fast. Free to cheap depending on how you're using it. Very convenient and easy-to-use commands for managing the schema. Some very useful extensions to the SQL standard, like "insert ... on duplicate".
Cons: The MyISAM engine does not support transactions, i.e. there's no rollback. MyISAM engine does not manage foreign keys for you. (InnoDB does not have these drawbacks, but as I say, I've never used it, so I can't comment much further.) Many deviations from SQL standards.
Oracle: Pros: Fast. Generally good conformance to SQL standards. My brother works for Oracle so if you buy there you'll be helping support my family. (Okay, maybe that's not an important pro for you ...)
Cons: Difficult to install and manage. Expensive.
Postgres: Pros: Very high conformance to SQL standards. Free. Very good "explain" plans.
Cons: Relatively slow. Optimizer is easily confused on complex queries. Some awkwardness in modifying existing tables.
Access: Pros: Easy to install and manage. Very easy to use schema management. Built-in data entry tools and query builder for quick-and-dirty stuff. Cheap.
Cons: Slow. Unreliable with multiple users.
I think that you can investigate Firebird too
This is an extract of Firebird-General on yahoogroups and I find it quite objective
Our natural audience is developers who
want to package and sell proprietary
applications. Firebird is easier to
package and install than Postgres;
more capable than SQLite; and doesn't
charge a royalty like MySQL.

Why did object oriented databases fail? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why did object oriented databases fail?
I find it astonishing that:
foo bar = new foo();
bar.saveToDatabase();
Lost to:
foo bar = new foo();
/* write complicated code to extract stuff from foo */
/* write complicated code to write stuff to database */
Related questions:
Are Object oriented databases still
in use?
Object Oriented vs Relational
Databases
Why have object oriented databases not been successful (yet)?
Probably because of their coupling with specific programming languages.
First, I don't believe they have "failed" entirely. There are still some around, and they're still used by a couple of companies, as far as I know.
anyway, probably because a lot of the data we want to store in a database is relational in nature.
The problem is that while yes, OO databases are easier to integrate into OO programming languages, relational databases make it easier to define queries and, well, relations between the data stored. Which is often the complicated part.
I have been using db4o (an object oriented database) lately on my personal pet projects, just because it is so quick to set up & get going. No need with the itty, gritty details.
That aside, as I see it, the main reasons why they haven't become popular are:
Reporting is more difficult on object oriented databases. Related to this, it is also easier to manually look into the actual data in a relational database.
Microsoft and Oracle base so much of their business on relational databases.
Lots of businesses already have relational databases in place.
The IT departments often have relational database expertise.
And, as Jan Aagaard, have pointed out, lately it is because the problem have been solved in a different way, giving programmers the object oriented feel even though they program against a relational database.
There are countless numbers of existing applications out there storing their data in relational databases. This data is the lifeblood of those companies. They have collectively invested huge amounts in storing, maintaining and reporting on this data. The cost and risk of moving this priceless information into a fundamentally different environment is extremely high.
Now consider that ORM tools can map modern application data structures into traditional relational models, and you remove pretty much any incentive to migrate to OODBMS. They provide a low-risk alternative to a very costly high risk migration.
Because, as much as ODBMS advertisements were laden with derogatory language about ORM systems, it wasn't that hard to make ORMs do the job, and without all the various hits taken in switching to a pure ODBMS.
What actually happened is that your first code sample won, it just happens to be on a RDBMS persistence layer.
I think it is because the problem was solved differently. You might be using a relational database behind the scenes when you are coding in Ruby on Rails or LINQ to SQL, but it feels like you are working with objects.
Very subjective, but a few reasons come to mind:
Performance has not been as good as relational databases (or at least that's the perception)
Again with performance - relational databases allow you to do things like denormalizing data to further improve performance.
Legacy support for all the non-OO apps that need to access the data.
I think a lot of your answer lies in the "Why we abandoned Object Databases" answer of "Object Oriented vs Relational Databases".
As far as your example goes, it doesn't have to be that way. Linq to SQL is actually a quite nice basic layer over a DBMS, and Linq to Entities (v2 -- v1 sucked) will be pretty cool too. (N)Hibernate has been solving the problem you're having for years now using RDBMSes.
So I guess my answer to you is that O/R mappers are getting to the point where they solve your problem nicely, and you don't need an ODBMS to get what you need.
They will succeed some day. They are future.
Looking back to software technologies in history, the trend is sacrificing performance to reduce complexity (Assembly => C => C++ => .NET). An application which takes 30 minutes to code now, some days in past took a month.
ORMs are right answer to wrong question. Currently, they are the choice, since they make life easier in the absence of a better solution. But they cannot handle the level of complexity they aimed to. "Problems cannot be solved by the same level of thinking that created them." A.E
As others mentioned relational databases are heavily used and relied and replacing them forces a lot of risks. Look the interval between SQL versions and the major changes between these versions and other Microsoft products (conservative approach, which is necessary here). Also I'll add the following items:
Current approach still works. You may argue it will work forever (we
can code assembly yet), but here I mean it doesn't
work logically when, the AVERAGE level of projects complexity and
the time to develop them on relational databases rings the bell.
Major companies did not involved seriously. When the market signals, they do.
The problem is not well-defined yet. Unfortunately current failures help.
It need some improvements in other sciences (QC, AI) rather than
computer. Storing and querying multidimensional data on flat
infrastructure and without enough smartness for self-organizing are
the top obstacles at the theoretic level.
Why not?
I guess they were a solution to a problem nobody was having, or not having enough to pay for it.
Further, OOP and set-based programming are not always very comptatble.
Personally, when I started reading about OO databases, I couldn't help but think "Boy, I hope I never have to work on one of those, update 1 million rows out of a 6 million row table and then make sure all appropriate records in other tables get updated as well"

Resources