I know there are similar questions on SO but some are a decade old and others don't have helpful answers.
I am a newbie in the system designing world. I have some experience with relational DBMS where I created some small scale projects.
Every post on 'Relational vs Non-Relational DBMS' points out that because of ACID transactions, Ref integrity constraints and consistency, Relational DBMS are difficult to scale. But on the other hand giants like Amazon and financial services continue to use relational DBMS and they don't seem to have any problems with scalability.
I just want to theoretically understand if relational DBMS are actually difficult to scale? If it is, how are these companies using it with terabytes of data?
Thank you!
As long as the data is in Third Normal Form, and appropriate indexes are applied, there should be no problem. This requires knowing what data is to be stored and how it needs to be accessed.
At some scale (e.g. I had a system which added up to 18 million rows per day to 1 table) you may want to have a process in place to transfer new data to an Analysis database (OLAP - e.g. SQL Server Analysis Database, MSAS). An OLAP is relatively easy to design, but the process to keep it up-to-date on even a daily basis can be difficult to design and manage without some experience. Queries to an OLAP database are optimised for reporting; I don't believe that any query for the system I mentioned ever took more than an average of 3 seconds to complete.
Related
We are planning to create a web app to store banking transactions for customers, e.g purchases, transfers etc and allow them to tag / categorize each transaction.
Could someone point us to the best DB for this purpose? It needs to scale horizontally and we also need to perform analysis on all transactions.
Thanks
The best database to store banking transactions is the one the banks use, DB2/z.
But, since I doubt you'd be able to afford a System z mainframe, that's probably not an option. That doesn't make it any less the best database of course.
If, however, you're talking about storing transaction for Joe Bloggs or Dodgy Brothers Rug Emporium (as opposed to the two hundred million or so customers of ICBC), pretty well any database will be up to the task - Oracle (despite its inability to differentiate NULLs from empty strings), SQL Server, MySQL PostgreSQL, even SQLite probably.
I'm going to start this by saying its almost impossible to recommend a system based on what you've described. It could be for such a varied number of uses, ranging from mission critical real time financial data that needs to be there and needs to be accurate, through to a web app that sucks in financial records from a bank/credit card statement and lets the user annotate them, in which case it isn't as sensitive.
If you're storing mission critical, sensitive data, I'd go with a commercial option that includes significant support. Also a DBA would be a good idea.
Oracle or MS SQL would be my inclination, and probably Oracle over MS SQL, over because of its multi-platform support. If you're happy to run on Windows then MS SQL is fine.
If you're storing existing transactions that can be tagged (ala Blippy), then any database would be sufficient. If you're thinking of scaling this out to the n'th degree, you might like one of the document database flavours of the month, (MongoDB, Couch etc).
Really I think the question should be reconsidered from the context of what your application will do, not that it happens to do it with financial data. The fact that financial data may require additional security, or additional accuracy checks, that forms part of what the system will do, as does the way the user interacts with your web app etc.
This may not answer your question directly, but here is what I have experienced.
I think, its really about how you'd save your banking transactions. Most database vendors provide sufficient amount of database performance, so all you have to do is to choose one over other.
What you are left with is the actual information to be saved(besides schema). You might think about using database encryption option, but then its not really realistic in your case; because you are talking about transactions, I assume there are quite alot of transactions coming in, and you doing large of amount of reads for your reporting(besides write), possibly for mining, etc.
Usually(sql server), using encryption any data that is written into the database file is encrypted. Snapshots and backups are also use encryption. The transaction log is also protected, so it would hit the performance that you might desire.
So, I see your question really boiling down to How to protect sensitive data?
Here are couple of articles that might help:
Btw, I have deployed solutions with Oracle, SQL Server, and even Sybase as backends, with several transactions pouring in from ATMs, and what I really look for is the performance, besides security. Except for minute limitations of one over other, all are same.
Following articles might help:
Database security: protecting sensitive and critical information
Using One-Way Functions to Protect Sensitive Information in SQL Server Databases
We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.
Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.
My question is: how do you know when a separate database for reporting is required?
What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?
In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.
When transaction performance is critical.
When it's hard to get a maintenance window on the transactional app.
If reporting needs to correlate results not only from this app, but from other application silos.
If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
If the reports are long running.
If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).
It adds non-trivial complexity, so imo, there has to be a good reason to split.
Typically, I would try to report off the transactional database initially.
Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.
When you do go to a reporting database, remember there are only a few reasons you are going there:
Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.
Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.
Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).
Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.
This question requires experience rather than science.
As a BI architect, the approach I take on designing each BI solution for my clients are very different. I don't go through a checklist. It requires a general understanding of their system, their reporting requirements, budget and man power.
I personally prefer to keep the reporting processes as much as possible on the database side (Best practice in BI world). REPORTING TOOLS ARE FOR DISPLAYING PURPOSE ONLY (MAXIMUM FOR SMALL CALCULATIONS). This approach requires a lot of pre-processing of data which requires different staging tables, triggers and etc.
When you said:
I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
There are a few things wrong with your statement.
Hundreds of millions of rows are A LOT. even today's in memory tools like Cognos TM1 or Qlikview would struggle to get such a results. (look at SAP HANA from SAP to understand how giants in the industry handle it).
If you have Hundreds of millions of rows in database, it doesn't necessarily mean that the report needs to go through all those records. maybe the report worked on 1000s not millions. probably that's what you saw.
Transactional reports are very different than dashboards. Most dashboard tools pre-processing and cache the data.
My point is that it all comes to experience for deciding when to:
design a new schema
create a semantic database
work on the same transactional database
or even use a reporting tool (Sometimes handwritten dashboards with Java/JSF/Ajax/jQuery or JSP would work fine for client)
The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.
As for questions, here are some basic one:
Can I do the high intensity reports during non-peak hours?
Does it interfere with the users using the system?
If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?
I would also add another reason for which you might use a reporting database, and that is: CQRS pattern (Command Query Responsibility Separation).
If you have a large number of users accessing and writing to a small set of data, you would do wise to consider this pattern. It basicly, in its simplest form, means that all your commands (Create, Update, Delete) are pushed to the transactional database.
All of your queries (Read) are from your reporting database. This lets you freely scopy your architecture and upgrade function.
There are MUCH more to it in the pattern, I just mentioned the bit which was interesting due to your question regarding reporting database.
Basically, when the database load from the app becomes incompatible with the database load for reporting. This could be due to:
Reporting consuming inordinate amount of database server resources impacting the app's DB performance.
A part of this category would be the app DB work having to wait on a majorly slow report query due to locking, though it might be possible to resolve with less drastic methods like locking tuning.
Reporting queries being very incompatible with app queries as far as tuning (e.g. indices but not limited to that) - the most dumb example would be something like a hot spot affecting app inserts because of the reporting-purpose index.
Timing issues. E.g. the only small windows for DB maintenance available (due to application usage) are the times of heavy reporting work
Reporting data's sheer volume (e.g. logging, auditing, statistics) is so big that your primary DB server architecture is a bad solution for such reporting (see Sybase ASE vs. Sybase IQ). BTW, this is a real scenario - we moved our performance reporting to IQ because of this.
I would also add that transactional databases are meant to hold current state and oftentimes do so to be self-maintaining. You don't want transactional databases growing beyond their necessary means. When a workflow or transaction is complete then move that data out and into a Reporting database, which is much better designed to hold historical data.
As part of my role at the firm I'm at, I've been forced to become the DBA for our database. Some of our tables have rowcounts approaching 100 million and many of the things that I know how to do SQL Server(like joins) simply break down at this level of data. I'm left with a couple options
1) Go out and find a DBA with experience administering VLDBs. This is going to cost us a pretty penny and come at the expense of other work that we need to get done. I'm not a huge fan of it.
2) Most of our data is historical data that we use for analysis. I could simply create a copy of our database schema and start from scratch with data putting on hold any analysis of our current data until I find a proper way to solve the problem(this is my current "best" solution).
3) Reach out to the developer community to see if I can learn enough about large databases to get us through until I can implement solution #1.
Any help that anyone could provide, or any books you could recommend would be greatly appreciated.
Here are a few thoughts, but none of them are quick fixes:
Develop an archival strategy for the
data in your large tables. Create
tables with similar formats to the
existing transactional table and
copy the data out into those tables
on a periodic basis. If you can get
away with whacking the data out of
the tx system, then fine.
Develop a relational data warehouse
to store the large data sets,
complete with star schemas
consisting of fact tables and
dimensions. For an introduction to
this approach there is no better
book (IMHO) than Ralph Kimball's
Data Warehouse Toolkit.
For analysis, consider using MS
Analysis Services for
pre-aggregating this data for fast
querying.
Of course, you could also look at
your indexing strategy within the
existing database. Be careful with
any changes as you could add indexes
that would improve querying at the
cost of insert and transactional
performance.
You could also research
partitioning in SQL Server.
Don't feel bad about bringing in a DBA on contract basis to help out...
To me, your best bet would be to begin investigating movement of that data out of the transactional system if it is not necessary for day to day use.
Of course, you are going to need to pick up some new skills for dealing with these amounts of data. Whatever you decide to do, make a backup first!
One more thing you should do is ensure that your I/O is being spread appropriately across as many spindles as possible. Your data files, log files and sql server temp db data files should all be on separate drives with a database system that large.
DBA's are worth their weight in gold, if you can find a good one. They specialize in doing the very thing that you are describing. If this is a one time problem, maybe you can subcontract one.
I believe Microsoft offers a similar service. You might want to ask.
You'll want to get a DBA in there, at least on contract to performance tune the database.
Joining to a 100 Million record table shouldn't bring the database serer to its knees. My company customers do it many hundreds (possibly thousands) of times per minute on our system.
Our in-house system is built on SQL Server 2008 with a 40-table 6NF schema. Most of the tables FK to 3 others, a key few as many as 7. The system will ultimately support 100s of employees working with 10s of 1000s of customers and store 100s of 1000s of transactional records -- prime-time access should peak at 1000 rows per second.
Is there any reason to think that this depth of RDBMS inter-relation would overburden a system built using modern hardware with ample RAM? I'm attempting to evaluate whether we need to adjust our design or project direction/goals before we approach the final development phase (in a couple of months).
In SQl Server terms what you describe is a smallish database. With correct design SQL Server can handle terrabytes of data.
This is not to guarantee that your current design may perform well. There are many ways to construct poorly performing t-SQL and many bad database design choices.
If I were you I would load test data to twice the size you expect the tables to have and then start testing your code. Load testing might also be a good idea. It is far easier to fix database performance problems before they go to production. Far, far easier!
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
MS SQL Server and Oracle, which one is better in terms of scalability?
For example, if the data size reach 500 TB etc.
Both Oracle and SQL Server are shared-disk databases so they are constrained by disk bandwidth for queries that table scan over large volumes of data. Products such as Teradata, Netezza or DB/2 Parallel Edition are 'shared nothing' architectures where the database stores horizontal partitions on the individual nodes. This type of architecture gives the best parallel query performance as the local disks on each node are not constrained through a central bottleneck on a SAN.
Shared disk systems (such as Oracle Real Application Clusters or Clustered SQL Server installations still require a shared SAN, which has constrained bandwidth for streaming. On a VLDB this can seriously restrict the table-scanning performance that is possible to achieve. Most data warehouse queries run table or range scans across large blocks of data. If the query will hit more than a few percent of rows a single table scan is often the optimal query plan.
Multiple local direct-attach disk arrays on nodes gives more disk bandwidth.
Having said that I am aware of an Oracle DW shop (a major european telco) that has an oracle based data warehouse that loads 600 GB per day, so the shared disk architecture does not appear to impose unsurmountable limitations.
Between MS-SQL and Oracle there are some differences. IMHO Oracle has better VLDB support than SQL server for the following reasons:
Oracle has native support for bitmap indexes, which are an index structure suitable for high speed data warehouse queries. They essentially do a CPU for I/O tradeoff as they are run-length encoded and use relatively little space. On the other hand, Microsoft claim that Index Intersection is not appreciably slower.
Oracle has better table partitioning facilities than SQL Server. IIRC The table partitioning in SQL Server 2005 can only be done on a single column.
Oracle can be run on somewhat larger hardware than SQL Server, although one can run SQL server on some quite respectably large systems.
Oracle has more mature support for Materialized views and Query rewrite to optimise relational queries. SQL2005 does have some query rewrite capability but it is poorly documented and I haven't seen it used in a production system. However, Microsoft will suggest that you use Analysis Services, which does actually support shared nothing configurations.
Unless you have truly biblical data volumes and are choosing between Oracle and a shared nothing architecture such as Teradata you will probably see little practical difference between Oracle and SQL Server. Particularly since the introduction of SQL2005 the partitioning facilities in SQL Server are viewed as good enough and there are plenty of examples of multi-terabyte systems that have been successfully implemented on it.
When you are talking 500TB, that is (a) big and (b) specialized.
I'd be going to a consultancy firm with appropriate specialists to look at the existing skill sets, integration with existing technology stacks, expected usage, backup/recovery/DR requirements....
In short, it's not the sort of project I'd be heading into based on opinions from stackoverflow. No offence intended, but there's simply too many factors to take into account, a lot of which would be business confidential.
Whether Oracle or MSSQL will scale / perform better is question #15. The data model is the first make-it or break-it item regardless of if you're running Oracle, MSSQL, Informix or anything else. Data model structure, what kind of applicaiton, how it accesses the db etc, which platform your developers know well enough to target for a large system etc are the first questions you should ask yourself.
I've worked as a DBA on Oracle (although some years back) and I use MSSQL extensively now, although not as a formal DBA. My advice would be that in the vast majority of cases both will meet everything you can throw at them and your performance issues will be much more dependent upon database design and deployment than the underlying characteristics of the products, which in both cases are absolutely and utterly solid (MSSQL is the best product that MS makes in many peoples opinion so don't let the usual perception of MS blind you on that).
Myself I would tend towards MSSQL unless your system is going to be very large and truly enterprise level (massive numbers of users, multiple 9's uptime etc.) simply because in my experience Oracle tends to require a higher level of DBA knowledge and maintenance than MSSQL to get the best out of it. Oracle also tends to be more expensive, both for initial deployment and in the cost to hire DBAs for it. OTOH if you are looking at an enterprise system then Oracle would have the edge, not least because if you can afford it their support is second to none.
I have to agree with those who said deisgn was more important.
I've worked with superfast and super slow databases of many different flavors (the absolute worst being an Oracle database, but it wasn't Oracle's fault). Design of the database and how you decide to index it and partition it and query it have far more to do with the scalability than whether the product is from MSSQL Server or Oracle.
I think you may more easily find more Oracle dbas with terrabyte database experience (running a large database is a specialty just like knowing a particular flavor of SQL) but that could depend on your local area.
oracle people will tell you oracle is better, sql server peopele will tell you sql server is better.
i say they scale pretty much the same. use what you know better. you have databases out there that are that size on oracle as well as sql server
When you get to OBSCENE database sizes (where over 1TB is really big enough, and 500TB is frigging massive), then operational support must come very high up on the list of requirements. With that much data, you don't mess about with penny pinching system specifications.
How are you going to backup that size of system? Upgrade the OS and patch the database? Scalability and reliability a concern?
I have experience of both Oracle and MS SQL, and for the really really big systems (users, data or importance) then Oracle is better designed for operational support and data management.
Every tried to backup and restore a 1TB+ SQL Server database split over multiple databases on multiple instances with transaction log files being spat out everywhere by each database and trying to keep it all in sync? Good luck with that.
With Oracle, you have ONE database (so I disagree with the "shared nothing" approach is better) with ONE set of REDO logs(1) and one set of archive logs(2) and you can just add extra hardware nodes without changing (i.e. repartitioning) you application and data.
(1) Redo logs are, of course, mirrored.
(2) Archive logs are, of course, stored in multiple locations.
It would also depend on what is your application meant for. If it uses only Inserts with very few updates, then I think MSSQL would be more scalable and better in terms of performance. However if one has lots of updates, then Oracle would scaleup better
I very much doubt that you are going to get an objective answer to that particular question, until you come across anyone that has implemented the same database (schema, data, etc.) on both platforms.
However given the fact that you can find millions of happy users of both databases, I dare say it's not too much of a stretch to say either will scale just fine (I've seen a snappy Sql 2005 implementation of 300 TB that seemed pretty responsive)
Oracle like a high-quality manual film camera, which needs the best photographer to take the best picture while MS SQL like an automatic digital camera. In old days, of course, all professional photographers will use film camera, now think about how many professional photographers use automatic digital camera.