I'm finishing a web app that consumes a lot of data. Currently there are 150 tables per db, many tables with 1,000 or less records, some with 10,000 or less records and just a few with 100,000 or less records and having a database per customer. I'm setting up a server with 64GB RAM, 1 TB SSD NVMe, Xeon 4 cores at 4GHz and 1GB bandwidth. I'm planning to make 10 instances of MariaDb with 6GB RAM each and the rest for the OS (Ubuntu 18 64-bit) putting 10-15 databases per instance.
Do you think this could be a good approach for the project?
Although 15000 tables is a lot, I see no advantage in setting up multiple instances on a single machine. Here's the rationale:
Unless you have lousy indexes, even a single instance will not chew up all the CPUs.
Having 10 instance means the ancillary data (tables, caches, other overhead) will be repeated 10 times.
When one instance is busy, it cannot rob RAM from the other instances.
If you do a table scan on 100K rows in a 6G instance (perhaps 4G for innodb_buffer_pool_size), it may push the other users in that instance out of the buffer_pool. Yeah, there are DBA things you could do, but remember, you're not a DBA.
150 tables? Could it be that there is some "over-normalization"?
I am actually a bit surprised that you are worried about 150 tables where the max size of some tables is 100k or fewer rows?
That certainly does not look like much of a problem for modern hardware that you are planning to use. I will, however, suggest allocating more RAM and use a large enough InnoDB bufferpool size, 70%, of your total ram is the suggested number by MariaDB.
I have seen clients running MariaDB with much larger data sets and a lot more tables + very high concurrency!
Anyways, the best way to proceed is, of course, to test during your UAT cycles with the similar to production data loads. I am still sure it's not going to be a problem.
If you can provide a Master/Slave setup and use MariaDB MaxScale on top, while MaxScale can provide you with automatic db server, connection, failover & transaction replay which gives you the highest level of availability, it also takes care of load balancing with its cool "Read/Write" splitting service. Your servers load will be balanced and overall a smooth experience :)
Not too sure if the above is something you have planned for in your design but just putting it here for a second look that your application will certainly benefit.
Hope this helps.
Cheers,
Faisal.
Related
I'm looking for a DB solution for a high performance application.
The database will need to be local and stored in RAM for performance and will be several GB in size.
It will be local to the application, but it may be accessed by multiple processes running on the machine (up to 40). The data in the DB is immutable once it's been inserted and I only need a basic key value store rather than anything relational.
The obvious candidates are Memcached and Redis, but I believe they both have limitations with overhead and bottlenecks from the network component.
Something like Berkeley DB would also appear to be ideal, but it's only single process as far as I can see.
Throughput is the most important consideration (more so than latency).
I'm working here in a small company and one of my jobs is the administration of the ERP system 'AXAVIA' (www.axavia.com)
There are .NET Clients and a MSSQL Server 2005 Database with a size of about 10GB.
The system works on a metadata model, this means they have very few tables (one for each datatype and some for the relations) and this data is computed with adhoc queries. Up to 2000 batches / sec...
I guess they don't really hava a database specialist, because the didn't know anything about index fragmentation and i allready deleted a lot of unused indexes - now the db is about 30% smaller...
What else can i do for more performance?
- I rebuild now the indexes every night
I think, there are no 'missing indexes' and also the primary keys are at least 'ok'
The filesystem is a fast 10 raid - and with 6,6 GB Ram there is very little IO
The Server is a VM Ware with one virtual CPU - here i guess is the beste possibility: The huge ammount of small batches would benefit from a phyical cpu with 4 cores?!
I'm also thinking about partitioned tables, but in the moment the database isn't big enough to benefit much from this.
So - any other ideas?
Add a CPU, at lesat for test. I Would say you likely run into a problem here. Generally - and I mean really in general - I never have one core VMS anymore. Even the smallest machine has 2 cores. Makes thigns a lot faster even on windows level (OS operations ahppen on the second core).
10gb is tiny today. Still there is no database crappy programming can not kill (and it is likely in your case that is a lot of crappy programming going on, from your explanations). Start a full analysis of why things are waiting. If they are just hitting he server with a lot of sequential SQL for any operation the only thing you can do is make sure (a) you have as little waits as possible and (b) you have as fast a CPU as possible. In a sdatabase like you describe it the problem is seriously in the program - and basically there is only so much you can tune down at the database level.
If not already, have your data and log files on seperate drives. You can also move your tempdb to it's own drive, and also split it into multiple files. Read Brent's piece on tempdb here: Brent Ozar
I suggest you to use Glenn Berry's script to determine troubles in your server:
https://dl.dropboxusercontent.com/u/13748067/SQL%20Server%202005%20Diagnostic%20Information%20Queries(September%202014).sql
There are many another potential problems, not only missing indexes.
I was used this script as knowledge database to create my own tool to check my ERP health. And I can tell you it works well.
I have a buddy who runs a web app for people listing cars for sale. There are a few thousand clients who use it, and each client has hundreds and sometimes thousands of rows in the database (some have been on for 5 years with hundreds of cars selling each month, and 10s of rows per sale (comments, messages, etc)). He has run this system in one SQL Server database in one physical server with like 20GB or RAM and a couple processors for the whole time, with no problems. Is this some sort of miracle?
Just like most programmers, I'm no DBA and just get by, thanks to ORMs, etc. Everywhere I look, people talk about having the need to shard or get a separate database server for big users of a web app. Why is this? Is it really that inefficient to have a large DB with lots or rows? Should I plan to use Cassandra or something, or can I rely on scaling up well with Postgres?
I personally don't think what you've described is that large of a database. The server (20 gigs of ram? ;)) sounds decent. It's more about usage and design. If the database is indexed and well designed, it can grow much, much larger on the current hardware.
Before doing any sort of switch, I'd simply look at archiving useless data and optimising queries if there's a fear of performance issues.
The reason for sharding and separate db servers is that at some point it's going to be cheaper to use multiple cheaper machines than one expensive one. Hardware price doesn't scale linearly with performance and once you reach a certain point it'll be much cheaper to get twice as many machines as to get a machine that's twice as fast.
You should have no problem in SQL server, Oracle, or any modern relational or non-relational database. I have administered databases with 100's of millions of records and Terabytes of data.
Typically you split components up across different servers so you can manage up time, resilience, and performance more easily.
It's certainly quite possible to have one monster machine which does it all, but then you may need another monster machine in case your motherboard dies, or your datacenter is unavailable.
By splitting a web site or application up, amongst different server's it's easier to get cheaper machines, and more of them.
Thus you can build in resilience, and not have components which have similiar demands on hardware clashing.
It's also important to think about restore times for servers, and recovery plans.
What happens when your machine dies, can you replace it in the agreed upon time? Can you restore from backups in that time?
SQL Server or other enterprise class databases shouldn't have any problems with 10's or 100GB databases, as long as they not designed too badly. (We have a few machines with that capacity/use which aren't struggling at all.).
In my mind that's nothing. Having tens of millions of rows on multiple tables with database size exceeding 10 GB has not caused problems for MS SQL Server. Of course it is not too fast with that much data, but otherwise it works just fine.
And to answer the question, too big is so big it does cause problems. And when it starts causing problems depends on the table structure and your performance demands.
Databases are extremely efficient at storing and retrieving relational data (i.e. data that is structured and has references to other data) - that's what they're designed to do. Honestly, 99% of the people spewing about key-value stores and Cassandra and whatnot have no clue what they're doing. A database server is just fine for storing large volumes of data, particularly if you're willing to put a bit of work into tuning it properly.
That said, there are use cases for Cassandra et. al. - if you have mostly unstructured key/value data or don't need consistency or want to shard for redundancy, it may be worth investigating.
Unless you're an extremely popular website, you probably can get by just fine with a decent database server - don't switch until you've determined why you need to switch. Switching is fine, just make sure you are switching because it serves your needs better, and not because it's the "cool web-scale thing to do"
I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.
I want to access my sql server database files in a INTEL SS4000-E storage. It´s a NAS Storage. Could it be possible to work with it as a storage for sql server 2000? If not, what is the best solution?
I strongly recommend against it.
Put your data files locally on the server itself, with RAID mirrored drives. The reasons are twofold:
SQL Server will run much faster for all but the smallest workloads
SQL Server will be much less prone to corruption in case the link to the NAS gets broken.
Use the NAS to store backups of your SQL Server, not to host your datafiles. I don't know what your database size will be, or what your usage pattern will be, so I can't tell you what you MUST have. At a minimum for a database that's going to take any significant load in a production environment, I would recommend two logical drives (one for data, one for your transaction log), each consisting of a RAID 1 array of the fastest drives you can stomach to buy. If that's overkill, put your database on just two physical drives, (one for the transaction log, and one for data). If even THAT is over budget, put your data on a single drive, back up often. But if you choose the single-drive or NAS solution, IMO you are putting your faith in the Power of Prayer (which may not be a bad thing, it just isn't that effective when designing databases).
Note that a NAS is not the same thing as a SAN (on which people typically DO put database files). A NAS typically is much slower and has much less bandwidth than a SAN connection, which is designed for very high reliability, high speed, advanced management, and low latency. A NAS is geared more toward reducing your cost of network storage.
My gut reaction - I think you're mad risking your data on a NAS. SQL's expectation is continuous low-latency uninterrupted access to your storage subsystem. The NAS is almost certainly none of those things - you local or SAN storage (in order of performance, simplicity and therefore preference) - leave the NAS for offline file storage/backups.
The following KB lists some of the constraints and issues you'd encounter trying to use a NAS with SQL - while the KB covers SQL 7 through 2005, a lot of the information still applies to SQL 2008 too.
http://support.microsoft.com/kb/304261
local is almost always faster than networked storage.
Your performance for sql will depend on how your objects, files, and filegroups are defined, and how consumers use the data.
Well "best" means different things to different people, but I think "best" performance would be a TMS RAMSAN or a RAID of SSDs... etc
Best capacity would be achieved with a RAID of large HDDs...
Best reliability/data saftey would be achieved with Mirroring across many drives, and regular backups (off site preferably)...
Best availability... I don't know... maybe a clone the system and have a hot backup ready to go at all times.
Best security would require encryption, but mainly limiting physical access to the machine (and it's backups) is enough unless it's internet connected.
As the other answers point out, there will be a performance penalty here.
It is also worth mentioning that these things sometimes implement a RAM cache to improve I/O performance, if that is the case and you do trial this config, the NAS should be on the same power protection / UPS as the server hardware, otherwise in case of power outtage the NAS may 'loose' the part of the file in cache. ouch!
It can work but a dedicated fiber attached SAN will be better.
Local will usually be faster but it has limited size and won't scale easily.
I'm not familiar with the hardware but we initially deployed a warehouse on a shared NAS. Here's what we found.
We were regularly competing for resources on the head unit -- there was only so much bandwidth that it could handle. Massive warehouse queries and data loads were severely impacted.
We needed 1.5 TB for our warehouse (data/indexes/logs) we put each of these resources onto a separate set of LUNS (like you might do with attached storage). Data was spanning just 10 disks. We ran into all sorts of IO bottlenecks with this. the better solution was to create one big partition across lots of small disks and store data, index and logs all in the same place. This sped things up considerably.
If you're dealing with a moderately used OLTP system, you might be fine but a NAS can be troublesome.