I'm looking for a DB solution for a high performance application.
The database will need to be local and stored in RAM for performance and will be several GB in size.
It will be local to the application, but it may be accessed by multiple processes running on the machine (up to 40). The data in the DB is immutable once it's been inserted and I only need a basic key value store rather than anything relational.
The obvious candidates are Memcached and Redis, but I believe they both have limitations with overhead and bottlenecks from the network component.
Something like Berkeley DB would also appear to be ideal, but it's only single process as far as I can see.
Throughput is the most important consideration (more so than latency).
Related
Hi as per the link mentioned below
spilling described as "When Snowflake cannot fit an operation in memory, it starts spilling data first to disk, and then to remote storage."
Part#1
-cannot fit an operation in memory : is that means the memory size of the warehouse is small to handle a workload and the queries are getting in to queued state ?
what operations could cause this other than join operation?
Part#2
-it starts spilling data first to disk, and then to remote storage : What is disk referred to in this context,as we know warehouse is just the compute unit with no disk in it.
Does this means the data that can't fit in warehouse memory will spill in to storage layer?
-What is referred as "remote storage". Does that means internal stage?
Please help understanding Disk spilling in snowflakes.
https://community.snowflake.com/s/article/Recognizing-Disk-Spilling
Memory is the compute server memory (which is fastest to access), local storage is EBS volume attached to the EC2 and remote storage is S3 storage (slowest to access).
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling). To alleviate this, it's recommended:
Using a larger warehouse (effectively increasing the available memory/local disk space for the operation), and/or Processing data in smaller batches.
Docs Reference: https://docs.snowflake.com/en/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory
Yes, remote spilling is S3 (local is the local instance cache) - and generally when things come to remote spilling the situation is quite bad and the performance of the query is suffering.
Other than rewriting the query you can always try run it on a better warehouse as mentioned in the docs - it will have more cache of its own and spilling should reduce noticeably.
Variations of JOIN like FLATTEN create more rows and aggregation operations like COUNT DISTINCT.
Just yesterday I was do some COUNT DISTINCTs over two year worth of data, with monthly aggregation, and it was spilling, to local and remote.
I realized was doing COUNT(DISTINCT column1,column2) when I wanted COUNT(*) as all those pairs of values where already distinct, and that stopped the remote spill, and to avoid some/most of the local spill I split my SQL into batches of 1 year in size (the data was clustered on time, so the reads where not wasteful), and inserted the result sets into a table. Lastly I ran batches on an extra large warehouse as compared to medium.
I do not know the exact answer where the local/remote disk is, but many EC2 instance some with local disk, so it's possible they use these instances, otherwise it would likely be EBS. I believe remote is S3.
But the moral of the story is, just a PC using swap memory, it's nice to not just have the operation instantly fail, but most of the time you are better off if it did, because how long it's going to take is painful.
I'm finishing a web app that consumes a lot of data. Currently there are 150 tables per db, many tables with 1,000 or less records, some with 10,000 or less records and just a few with 100,000 or less records and having a database per customer. I'm setting up a server with 64GB RAM, 1 TB SSD NVMe, Xeon 4 cores at 4GHz and 1GB bandwidth. I'm planning to make 10 instances of MariaDb with 6GB RAM each and the rest for the OS (Ubuntu 18 64-bit) putting 10-15 databases per instance.
Do you think this could be a good approach for the project?
Although 15000 tables is a lot, I see no advantage in setting up multiple instances on a single machine. Here's the rationale:
Unless you have lousy indexes, even a single instance will not chew up all the CPUs.
Having 10 instance means the ancillary data (tables, caches, other overhead) will be repeated 10 times.
When one instance is busy, it cannot rob RAM from the other instances.
If you do a table scan on 100K rows in a 6G instance (perhaps 4G for innodb_buffer_pool_size), it may push the other users in that instance out of the buffer_pool. Yeah, there are DBA things you could do, but remember, you're not a DBA.
150 tables? Could it be that there is some "over-normalization"?
I am actually a bit surprised that you are worried about 150 tables where the max size of some tables is 100k or fewer rows?
That certainly does not look like much of a problem for modern hardware that you are planning to use. I will, however, suggest allocating more RAM and use a large enough InnoDB bufferpool size, 70%, of your total ram is the suggested number by MariaDB.
I have seen clients running MariaDB with much larger data sets and a lot more tables + very high concurrency!
Anyways, the best way to proceed is, of course, to test during your UAT cycles with the similar to production data loads. I am still sure it's not going to be a problem.
If you can provide a Master/Slave setup and use MariaDB MaxScale on top, while MaxScale can provide you with automatic db server, connection, failover & transaction replay which gives you the highest level of availability, it also takes care of load balancing with its cool "Read/Write" splitting service. Your servers load will be balanced and overall a smooth experience :)
Not too sure if the above is something you have planned for in your design but just putting it here for a second look that your application will certainly benefit.
Hope this helps.
Cheers,
Faisal.
I've read quite a bit about SQL Servers using SSDs performing much better than traditional hard drives. In load tests with my app in a test environment, though, I'm able to keep my test DB server (SQL 2005) pegged between 75% and 100% CPU usage without much of a strain on disk access (as far as I can tell). My data set is still pretty small; database backups are under 100 MB. The test server I'm using is not new, but is also no slouch.
So, my questions:
Is the CPU the bottleneck (as opposed to the storage) because the dataset is small and therefore fits in memory?
Will this change once the dataset grows so paging is necessary?
Approximately how big (as a percentage of system memory) does the dataset have to get before SQL Server starts paging? Or does that depend on a lot of other factors?
As the app and its dataset grows, are there other bottlenecks that will tend to crop up besides CPU, storage, and lack of proper indexes?
Yes
Yes
If you have SQL Server configured to use as much memory as it can get, probably when it exceeds the max system memory. But it's very setup dependant on what causes paging (the query that is being executed is the most prevalent cause).
I/O between the request machine and server is the only one that I can think of, and that only matters if you are retrieving large datasets. I also would not group a lack of indexes as a bottleneck, rather indexes enable better performance with regard to searching.
As long as the CPU is the bottleneck on your dedicated SQL-Server machine, you don't have to worry about disk speed (assuming nothing's wrong with the machine). SQL-Server WILL use heavy memory caching. SQL-Server has built-in strategies to perform best under a given load and available resources. Just don't worry about it!
I have a buddy who runs a web app for people listing cars for sale. There are a few thousand clients who use it, and each client has hundreds and sometimes thousands of rows in the database (some have been on for 5 years with hundreds of cars selling each month, and 10s of rows per sale (comments, messages, etc)). He has run this system in one SQL Server database in one physical server with like 20GB or RAM and a couple processors for the whole time, with no problems. Is this some sort of miracle?
Just like most programmers, I'm no DBA and just get by, thanks to ORMs, etc. Everywhere I look, people talk about having the need to shard or get a separate database server for big users of a web app. Why is this? Is it really that inefficient to have a large DB with lots or rows? Should I plan to use Cassandra or something, or can I rely on scaling up well with Postgres?
I personally don't think what you've described is that large of a database. The server (20 gigs of ram? ;)) sounds decent. It's more about usage and design. If the database is indexed and well designed, it can grow much, much larger on the current hardware.
Before doing any sort of switch, I'd simply look at archiving useless data and optimising queries if there's a fear of performance issues.
The reason for sharding and separate db servers is that at some point it's going to be cheaper to use multiple cheaper machines than one expensive one. Hardware price doesn't scale linearly with performance and once you reach a certain point it'll be much cheaper to get twice as many machines as to get a machine that's twice as fast.
You should have no problem in SQL server, Oracle, or any modern relational or non-relational database. I have administered databases with 100's of millions of records and Terabytes of data.
Typically you split components up across different servers so you can manage up time, resilience, and performance more easily.
It's certainly quite possible to have one monster machine which does it all, but then you may need another monster machine in case your motherboard dies, or your datacenter is unavailable.
By splitting a web site or application up, amongst different server's it's easier to get cheaper machines, and more of them.
Thus you can build in resilience, and not have components which have similiar demands on hardware clashing.
It's also important to think about restore times for servers, and recovery plans.
What happens when your machine dies, can you replace it in the agreed upon time? Can you restore from backups in that time?
SQL Server or other enterprise class databases shouldn't have any problems with 10's or 100GB databases, as long as they not designed too badly. (We have a few machines with that capacity/use which aren't struggling at all.).
In my mind that's nothing. Having tens of millions of rows on multiple tables with database size exceeding 10 GB has not caused problems for MS SQL Server. Of course it is not too fast with that much data, but otherwise it works just fine.
And to answer the question, too big is so big it does cause problems. And when it starts causing problems depends on the table structure and your performance demands.
Databases are extremely efficient at storing and retrieving relational data (i.e. data that is structured and has references to other data) - that's what they're designed to do. Honestly, 99% of the people spewing about key-value stores and Cassandra and whatnot have no clue what they're doing. A database server is just fine for storing large volumes of data, particularly if you're willing to put a bit of work into tuning it properly.
That said, there are use cases for Cassandra et. al. - if you have mostly unstructured key/value data or don't need consistency or want to shard for redundancy, it may be worth investigating.
Unless you're an extremely popular website, you probably can get by just fine with a decent database server - don't switch until you've determined why you need to switch. Switching is fine, just make sure you are switching because it serves your needs better, and not because it's the "cool web-scale thing to do"
I want to access my sql server database files in a INTEL SS4000-E storage. It´s a NAS Storage. Could it be possible to work with it as a storage for sql server 2000? If not, what is the best solution?
I strongly recommend against it.
Put your data files locally on the server itself, with RAID mirrored drives. The reasons are twofold:
SQL Server will run much faster for all but the smallest workloads
SQL Server will be much less prone to corruption in case the link to the NAS gets broken.
Use the NAS to store backups of your SQL Server, not to host your datafiles. I don't know what your database size will be, or what your usage pattern will be, so I can't tell you what you MUST have. At a minimum for a database that's going to take any significant load in a production environment, I would recommend two logical drives (one for data, one for your transaction log), each consisting of a RAID 1 array of the fastest drives you can stomach to buy. If that's overkill, put your database on just two physical drives, (one for the transaction log, and one for data). If even THAT is over budget, put your data on a single drive, back up often. But if you choose the single-drive or NAS solution, IMO you are putting your faith in the Power of Prayer (which may not be a bad thing, it just isn't that effective when designing databases).
Note that a NAS is not the same thing as a SAN (on which people typically DO put database files). A NAS typically is much slower and has much less bandwidth than a SAN connection, which is designed for very high reliability, high speed, advanced management, and low latency. A NAS is geared more toward reducing your cost of network storage.
My gut reaction - I think you're mad risking your data on a NAS. SQL's expectation is continuous low-latency uninterrupted access to your storage subsystem. The NAS is almost certainly none of those things - you local or SAN storage (in order of performance, simplicity and therefore preference) - leave the NAS for offline file storage/backups.
The following KB lists some of the constraints and issues you'd encounter trying to use a NAS with SQL - while the KB covers SQL 7 through 2005, a lot of the information still applies to SQL 2008 too.
http://support.microsoft.com/kb/304261
local is almost always faster than networked storage.
Your performance for sql will depend on how your objects, files, and filegroups are defined, and how consumers use the data.
Well "best" means different things to different people, but I think "best" performance would be a TMS RAMSAN or a RAID of SSDs... etc
Best capacity would be achieved with a RAID of large HDDs...
Best reliability/data saftey would be achieved with Mirroring across many drives, and regular backups (off site preferably)...
Best availability... I don't know... maybe a clone the system and have a hot backup ready to go at all times.
Best security would require encryption, but mainly limiting physical access to the machine (and it's backups) is enough unless it's internet connected.
As the other answers point out, there will be a performance penalty here.
It is also worth mentioning that these things sometimes implement a RAM cache to improve I/O performance, if that is the case and you do trial this config, the NAS should be on the same power protection / UPS as the server hardware, otherwise in case of power outtage the NAS may 'loose' the part of the file in cache. ouch!
It can work but a dedicated fiber attached SAN will be better.
Local will usually be faster but it has limited size and won't scale easily.
I'm not familiar with the hardware but we initially deployed a warehouse on a shared NAS. Here's what we found.
We were regularly competing for resources on the head unit -- there was only so much bandwidth that it could handle. Massive warehouse queries and data loads were severely impacted.
We needed 1.5 TB for our warehouse (data/indexes/logs) we put each of these resources onto a separate set of LUNS (like you might do with attached storage). Data was spanning just 10 disks. We ran into all sorts of IO bottlenecks with this. the better solution was to create one big partition across lots of small disks and store data, index and logs all in the same place. This sped things up considerably.
If you're dealing with a moderately used OLTP system, you might be fine but a NAS can be troublesome.