What's TOO BIG for a database? - database

I have a buddy who runs a web app for people listing cars for sale. There are a few thousand clients who use it, and each client has hundreds and sometimes thousands of rows in the database (some have been on for 5 years with hundreds of cars selling each month, and 10s of rows per sale (comments, messages, etc)). He has run this system in one SQL Server database in one physical server with like 20GB or RAM and a couple processors for the whole time, with no problems. Is this some sort of miracle?
Just like most programmers, I'm no DBA and just get by, thanks to ORMs, etc. Everywhere I look, people talk about having the need to shard or get a separate database server for big users of a web app. Why is this? Is it really that inefficient to have a large DB with lots or rows? Should I plan to use Cassandra or something, or can I rely on scaling up well with Postgres?

I personally don't think what you've described is that large of a database. The server (20 gigs of ram? ;)) sounds decent. It's more about usage and design. If the database is indexed and well designed, it can grow much, much larger on the current hardware.
Before doing any sort of switch, I'd simply look at archiving useless data and optimising queries if there's a fear of performance issues.

The reason for sharding and separate db servers is that at some point it's going to be cheaper to use multiple cheaper machines than one expensive one. Hardware price doesn't scale linearly with performance and once you reach a certain point it'll be much cheaper to get twice as many machines as to get a machine that's twice as fast.

You should have no problem in SQL server, Oracle, or any modern relational or non-relational database. I have administered databases with 100's of millions of records and Terabytes of data.

Typically you split components up across different servers so you can manage up time, resilience, and performance more easily.
It's certainly quite possible to have one monster machine which does it all, but then you may need another monster machine in case your motherboard dies, or your datacenter is unavailable.
By splitting a web site or application up, amongst different server's it's easier to get cheaper machines, and more of them.
Thus you can build in resilience, and not have components which have similiar demands on hardware clashing.
It's also important to think about restore times for servers, and recovery plans.
What happens when your machine dies, can you replace it in the agreed upon time? Can you restore from backups in that time?
SQL Server or other enterprise class databases shouldn't have any problems with 10's or 100GB databases, as long as they not designed too badly. (We have a few machines with that capacity/use which aren't struggling at all.).

In my mind that's nothing. Having tens of millions of rows on multiple tables with database size exceeding 10 GB has not caused problems for MS SQL Server. Of course it is not too fast with that much data, but otherwise it works just fine.
And to answer the question, too big is so big it does cause problems. And when it starts causing problems depends on the table structure and your performance demands.

Databases are extremely efficient at storing and retrieving relational data (i.e. data that is structured and has references to other data) - that's what they're designed to do. Honestly, 99% of the people spewing about key-value stores and Cassandra and whatnot have no clue what they're doing. A database server is just fine for storing large volumes of data, particularly if you're willing to put a bit of work into tuning it properly.
That said, there are use cases for Cassandra et. al. - if you have mostly unstructured key/value data or don't need consistency or want to shard for redundancy, it may be worth investigating.
Unless you're an extremely popular website, you probably can get by just fine with a decent database server - don't switch until you've determined why you need to switch. Switching is fine, just make sure you are switching because it serves your needs better, and not because it's the "cool web-scale thing to do"

Related

Performance tuning of ERP System called axavia

I'm working here in a small company and one of my jobs is the administration of the ERP system 'AXAVIA' (www.axavia.com)
There are .NET Clients and a MSSQL Server 2005 Database with a size of about 10GB.
The system works on a metadata model, this means they have very few tables (one for each datatype and some for the relations) and this data is computed with adhoc queries. Up to 2000 batches / sec...
I guess they don't really hava a database specialist, because the didn't know anything about index fragmentation and i allready deleted a lot of unused indexes - now the db is about 30% smaller...
What else can i do for more performance?
- I rebuild now the indexes every night
I think, there are no 'missing indexes' and also the primary keys are at least 'ok'
The filesystem is a fast 10 raid - and with 6,6 GB Ram there is very little IO
The Server is a VM Ware with one virtual CPU - here i guess is the beste possibility: The huge ammount of small batches would benefit from a phyical cpu with 4 cores?!
I'm also thinking about partitioned tables, but in the moment the database isn't big enough to benefit much from this.
So - any other ideas?
Add a CPU, at lesat for test. I Would say you likely run into a problem here. Generally - and I mean really in general - I never have one core VMS anymore. Even the smallest machine has 2 cores. Makes thigns a lot faster even on windows level (OS operations ahppen on the second core).
10gb is tiny today. Still there is no database crappy programming can not kill (and it is likely in your case that is a lot of crappy programming going on, from your explanations). Start a full analysis of why things are waiting. If they are just hitting he server with a lot of sequential SQL for any operation the only thing you can do is make sure (a) you have as little waits as possible and (b) you have as fast a CPU as possible. In a sdatabase like you describe it the problem is seriously in the program - and basically there is only so much you can tune down at the database level.
If not already, have your data and log files on seperate drives. You can also move your tempdb to it's own drive, and also split it into multiple files. Read Brent's piece on tempdb here: Brent Ozar
I suggest you to use Glenn Berry's script to determine troubles in your server:
https://dl.dropboxusercontent.com/u/13748067/SQL%20Server%202005%20Diagnostic%20Information%20Queries(September%202014).sql
There are many another potential problems, not only missing indexes.
I was used this script as knowledge database to create my own tool to check my ERP health. And I can tell you it works well.

Would it ever be wise to have a SQL server per web server?

I'm wondering if, under the circumstances that
You get lots more reads than writes
Your SQL server of choice is cheap/free and offers a fast mirroring/replication service
Your database isn't insanely large
rather than having separate SQL servers it would be better to have an instance of SQL on each machine getting instant updates from the master. This way there would be no network latency when doing all the read queries, but there would be a per box performance hit as the SQL instance has to execute. Would this be better overall for performance? Are there any other pros/cons that might come up?
Your SQL Server should always be on a different box to the webserver, of that there is no question.
How many DB servers and webservers you have, and how they mirror (or otherwise) is up to how you scale your application.
You have SQL Server on a different machine because it needs (and deserves) a lot of RAM.
It's quite a common architectural pattern to have read-only replicas of a database. We accept some degree of stalesness in them, perhaps they are even only updated once a day.
The general rule will be that multiple copies will introduce complexity in terms of operations and management and tend to introduce the possibilities of inconsistency of data - almost inevitably the copies will not be perfectly is step (or the costs of making them soo will be too high.)
An example: what happens if your replication processing breaks a bit. So that some, but not all copies become stale. Now your users start to see radically different views of the world. How much might that matter to you? If it's a site with low value data (eg. celebrity sightings in London suberbs) then perhaps that's fine. If it's on hand inventory, and being out of date means that your customers can't place orders, then maybe you care rather more.
My advice: things that sound simple at a boxed on paper sort of level don't always work out that way when you're sitting in an operations room at 3AM. Be very sure that you can easily operate your solution.
How would your SQL Server be cheap/free? I should have said the licensing costs for this setup would be crippling. At retail prices you're looking at $6000 per server. See also Jeff's comments about costs. Scale out the web servers by all means, but not your SQL Server until it's pretty much on its' knees.
You might instead want to think about a distributed cache like Velocity or NCache.
Either way, run your site first with one SQL server and see how it copes with the load, then think about mirroring/replication across servers, otherwise you're just optimising prematurely. Measure first!
An immediate con is that there is no distributed lock co-ordinator in SQL Server so you can get merge conflicts as updates can change the same row on two different servers at the same time.
Depending on the size of the database and the disks in the web servers, you will find your network latency is smaller than the disk latency you will start suffering as the web server disks will not usually be as performant as the disk array you give to the database. If you wanted that kind of performance, you would be buying it per web server.
Replication performance is not without latency either, the distribution of the transactions isn't 'free' and careful maintenance of the transaction log would have to be planned to ensure you did not get log fragmentation (too many vlog's wthin the transaction log) which kills replication performance.

Best database engine for huge datasets

I do datamining and my work involves loading and unloading +1GB database dump files into MySQL. I am wondering is there any other free database engine that works better than MySQL on huge databases? is PostgreSQL better in terms of performance?
I only use basic SQL commands so speed is the only factor for me to choose a database
It is unlikely that substituting a different database engine will provide a huge increase in performance. The slow down you mention is more likely to be related to your schema design and data access patterns. Maybe you could provide some more information about that? For example, is the data stored as a time series? Are records written once sequentially or inserted / updated / deleted arbitrarily?
As long as you drop indexes before inserting huge data, should not be much difference between those two.
HDF is the storage choice of NASA's Earth Observing System, for instance. It's not exactly a database in the traditional sense, and it has its own quirks, but in terms of pure performance it's hard to beat.
If your datamining tool will support it, consider working from flat file sources. This should save most of your import/export operations. It does have some caveats, though:
You may need to get proficient with a scripting language like Perl or Python to do the data munging (assuming you're not already familiar with one).
You may need to expand the memory on your computer or go to a 64-bit platform if you need more memory.
Your data mining tool might not support working from flat data files in this manner, in which case you're buggered.
Modern disks - even SATA ones - will pull 100MB/sec or so off the disk in sequential reads. This means that something could inhale a 1GB file fairly quickly.
Alternatively, you could try getting SSDs on your machine and see if that improves the performance of your DBMS.
If you are doing datamining, perhaps you could use a document-oriented database.
These are faster than relational databases if you do not use my SQL.
MongoDB
and CouchDB are both good options. I prefer MongoDB because I don't know Java, and found CouchDB easier to get up and running.
Here are some articles on the topic:
Why we migrated from MySQL to MongoDB
MySQL vs. CouchDB vs. MongoDB
I am using PostgreSQL with my current project and also have to dump/restore databases pretty often. It takes less than 20 mins to restore 400Mb compressed dump.
You may give it a try, although some server configuration parameters need to be tweaked to comply with your hardware configuration. These parameters include, but not limited to:
shared_buffers
work_mem
temp_buffers
maintenance_work_mem
commit_delay
effective_cache_size
Your question is too ambiguous to answer usefully. "Performance" means many different things to different people. I can comment on how MySQL and PostgreSQL compare in a few areas that might be important, but without information it's hard to say which of these actually matter to you. I've written up a bunch more background information on this topic at Why PostgreSQL Instead of MySQL: Comparing Reliability and Speed. Which is faster certainly depends on what you're doing.
Is the problem that loading data into the database is too slow? That's one area that PostgreSQL doesn't do particularly well at, the COPY command in Postgres is not a particularly fastest bulk-loading mechanism.
Is the problem that queries run too slowly? Is so, how complicated are they? On complicated queries, the PostgreSQL optimizer can do a better job than the one in SQL, particularly if there are many table joins involved. Small, simple queries tend to run faster in MySQL because it isn't doing as much thinking about how to execute the query before beginning; smarter execution costs a bit of overhead.
How many clients are involved? MySQL can do a good job with a small number of clients, at higher client counts the locking mechanism in PostgreSQL might do a better job.
Do you care about transactional integrity? If not, it's easier to turn more of those features off in MySQL, which gives it a significant speed advantage compared to PostgreSQL.

What database to use for big data storage and manipulation?

I have to make a decision of which database server to use for my next project, but the simple decision to use MySQL like almost all the projects I did is harder now, because I expect very much records.
The database will store a user list, some other irrelevant tables, and the last one, some user-collected data. Let's say, if I have 6000 users responding to a quiz about each other. Simple math shows that from those users, if each one completes the quiz about everyone (and in my project that is 99% sure that will happen) I'll end up with 35.99million records(they will exclude themselves and in this particular situation the operation is 6000*5999). Unfortunately 6000 maybe is a small number, the real one growing day by day.
What to choose? MySQL and maybe if things go well and the project grows to expand it in a cluster? PostgreSQL, MSSQL? Oracle?
I've read about all of them, each one has it's pros and cons, but still don't know what to choose. The advantage of MySQL and PostgreSQL is of course, the starting price of $0 which is pretty nice in a usual self-funded startup.
Any opinions, pieces of advice? If you encountered this situation in your experience as developers, I'd love to hear from you.
These days, free isn't something that differenciates between databases any more. Both Oracle and SQL Server have free versions, but the limitations is resources - 4 GB database, RAM & single CPU utilization. Millions of records is not a concern - it's what datatypes you're using.
I saw the OPs comment about not liking MS software - that's your prerogative, but using the free versions of either Oracle or SQL Server do benefit from seamless transition to upscale versions of the respective database.
Personally, my choice would be either Oracle or SQL Server because of IMHO, real feature considerations like hierarchical query support, subquery factoring/CTE, packages (long before I get concerned with functions/procedures), full text searching, xml support, etc.
MySQL will handle 35 million records no problem. Worry about scalability when you get there. You can easily add raid hard disks backing your database tables, and if you really start getting big you can get a compellant SAN that will scream... Don't worry about the DB engine as much as the underlying hardware.. MySQL rocks for us with millions of records.
I've had no problems handling tables as large as 36,000,000 rows on MySQL and Oracle.
Just be sure that you index the proper columns, run EXPLAINs for your queries, and maintain proper design principles.
Most of the truly large scale web properties use a distributed key-value store. That said, 35 million is large, but not that large. With most modern databases, your main two scaling worries should be throughput and what happens when no single box can contain your entire database anymore. And both of these problems can be solved to some degree for any database you choose to use. (Caching, replication, sharding, etc.)
Use MySQL until you can't anymore. At that point, you ought to be rolling in dough anyways and you now have a very desirable problem.
Use MySQL as it's free and you have experience with it.
Besides in my opinion it matters more on how you design the tables than which database you use.
35 million records can be easily handled by MS SQL Server (assuming proper database design, indices, etc.). You can start with the free SQL Server Express edition and later, if you need, you can upgrade to the full version which supports clustering, etc.
SQL Server Express does have some limitations - single CPU, 1 GB memory, max 4 GB database size and a few other things. I'm not sure how quickly these limitations will become a problem but you can always move to the full version when you run into them.
MySQL(i) & Postgre
0$ of costs
large community
many tutorials
well documentated
MSSQL
You can get "money" from MS if you promote that you are using MSSQL (secret information from some companies I worked for)
MS tools work very well
Complete tool set from C# IDE over .NET lib to Windows Server 2003
Oracle
Professional and commercial provider
Used by many large companies (I also heard about Blizzard (World of Warcraft) using Oracle)
- expensive
The final decision depends on the very special requirements of your project.
Make yourself a quick list of things , that ARE IMPORTANT for your project (e.g. quick performed queries) and look up which Database pros are matching the most to your requirements.
Everything is about design. SQL Database are some kind of cars, you just have to know which component has to be placed here and which there.
Make a clear design and you won't struggle with any of them.
May be you can test Firebird
Blog post about big Firebird database here
MySQL licence is here (not allways free).
Postgresql and Firebird are free.
First of all, don't think about performance. Premature optimization being the root of all evil and all that. You can always throw more hardware and/or tuning at it later.
All of the mentioned should perform nicely if tuned/maintained correctly. I'd focus on manageability and familiarity. IMHO open source databases excels on manageability (perhaps not the best GUIs, but the CLI has been my home for a long long time).
And if the database becomes the bottleneck, why limit yourself to those choices? How about a key-value distributed database? Or perhaps serialize data directly to disk? Storing data outside of a RDBMS, while often frowned upon, might be the correct path. Or simply use the common route of denormalization.
Always remember not to optimize prematurely.
As far as opinions go (since you specifically asked for it) I favor open source databases, specifically PostgreSQL. It's rock solid, fast and very well-featured. And even with (relatively) large datasets it has performed superbly on mediocre hardware (some tuning involved, of course, but you can't skip that step no matter which db you end up choosing).

RDBMS data-relation burden

Our in-house system is built on SQL Server 2008 with a 40-table 6NF schema. Most of the tables FK to 3 others, a key few as many as 7. The system will ultimately support 100s of employees working with 10s of 1000s of customers and store 100s of 1000s of transactional records -- prime-time access should peak at 1000 rows per second.
Is there any reason to think that this depth of RDBMS inter-relation would overburden a system built using modern hardware with ample RAM? I'm attempting to evaluate whether we need to adjust our design or project direction/goals before we approach the final development phase (in a couple of months).
In SQl Server terms what you describe is a smallish database. With correct design SQL Server can handle terrabytes of data.
This is not to guarantee that your current design may perform well. There are many ways to construct poorly performing t-SQL and many bad database design choices.
If I were you I would load test data to twice the size you expect the tables to have and then start testing your code. Load testing might also be a good idea. It is far easier to fix database performance problems before they go to production. Far, far easier!

Resources