OpenStack Swift deployment documentation says:
Swift’s disk usage pattern is the worst case possible for RAID, and performance degrades very quickly using RAID 5 or 6.
But I failed to find any elaboration or explanation of that. So, before I dig deep into the Swift source code, I'd like to ask the community:
what should the RAID-friendly "disk usage pattern" be?
what's so special about Swift's disk usage?
Swift is totally random IO model due to its Ring data structure. In short, Ring maps files uniformly to all disks.
RAID 5 or RAID 6 performance is very bad if you have a high random write workload. See more information here
The scenario is similar to database. Database also stores files somewhat uniformly, such as mongodb. You will find they do not suggest RAID 5 or RAID 6 either. Only RAID 10 is recommended.
Why at first place you need RAID with Swift?
Swift natively uses XFS & most of the operations are handled by its native algorithm called RING.
Alternatively if you want to dig deep inside RING algorithm my colleagues did a video deep dive in on RING.
Hope it helps,
Atul
what should the RAID-friendly "disk usage pattern" be?
People use RAID card for the following reasons:
1) protect from single drive failures (except for RAID 0)
2) gain higher I/O performance than single drives (RAID 5,6,10,50, etc, and write back cache etc. with BBU)
3) Use more drives than a motherboard can support with RAID/HBA cards
4) Some storage management features (GUI or command line tools)
what's so special about Swift's disk usage?
Swift disk I/O are
1) mostly random on A/C/O servers
2) high concurrency in parallel
3) 6x amplification factor for put one object (write 3x object and update 3x containers at least, let alone other replication process, auditor etc)
Openstack Swift is designed to use the commodity servers and hard drives, meaning lowest cost on reasonably good quality hardware, which often do not include RAID card(s). However, one would need a RAID/HBA card to use 8-10+ HDDs in a server, so in practice many would use RAID card but configure each HDD as single drive RAID0, or use a HBA card, if the motherboard can not support the number of HDDs the server chassis can hold.
You certainly can use RAID5, 6, 10, and lose some capacity to gain some protection and performance, but that often has higher cost than needed. Swift has tunable replication factors which is default to be 3x.
Related
I have been monitoring the performance of an OLTP database (approx. 150GB); the average disk sec/read and average disk sec/write values are exceeding 20 ms over a 24hr period.
I need to arrive at a clear explanation as to why the business application has no influence over the 'less-than-stellar' performance on these counters. I also need to exert some pressure to have the storage folk re-examine their configuration as it applies to the placement of the mdf, ldf and tempdb files on their SAN. At present, my argument is shaky but I am pressing my point with people who don't understand the difference between IOPs and disk latency.
Beyond the limitations of physical hardware and the placement of data files across physical disks, is there anything else that would influence these counter values? For instance: the number of transactions per second, the size of the query, poorly written queries or missing indexes? My readings say 'no' but I need a voice of authority in this debate.
There are "a lot" of factors that can affect the overall latency. To truly rule it as SAN or not, you will want to look at the "Avg. Disk sec/Read counter" and the "Avg. Desk sec/Write Counter", that you mentioned. Just make sure you are looking at the "Physical Disk" object, and not the "Logical Disk" object. The logical disk counter includes the file system overhead, and may be different, depending on different factors.
Once you have the counters for the physical disks, you will want to compare them to the latency counters for the Storage unit, the server is connected to. You mentioned "storage folk" So I'm going to assume that is a different team, hopefully they will be nice and provide the info to you.
If it is a Storage unit issue, then both of these counters should match up pretty good. That indicates the storage unit is truly running slow. If the storage unit counters show significantly better, then it's something in between. Depending on what type of storage network you are using this would be the HBA/NIC/Switches that connect the server and storage together. Or if it's a VM then the host machine stats would prove useful as well.
Apart from obvious reasons such as "not enough memory for buffer pool", latency mostly depends on how your storage is actually implemented.
If your server has external SAN, usually its problem is that it might give you stellar throughput, but it will never (again, usually) give you stellar latency. It's just the way things are. It might become a real headache for heavy loaded OLTP systems, sure.
So, if you are about to squeeze every last microsecond from your storage, most probably you will need local drives. That, and your RAID 10 should have enough spindles to cope with the load.
Is storage capacity of in-memory database limited to size of RAM? If yes, is there any ways to increase its capacity except for increasing RAM size. If no, please give some explanations.
As previously mentioned, in-memory storage capacity is limited by the addressable memory, not by the amount of physical memory in the system. Simon was also correct that the OS will swap memory to the page file, but you really want to avoid that. In the context of the DBMS, the OS will do a worse job of it than if you simply used a persistent database with as large of a cache as you have physical memory to support. IOW, the DBMS will manage its cache more intelligently than the OS would manage paged memory containing in-memory database content.
On a 32 bit system, each process is limited to a total of 3GB of RAM, whether you have 3GB physically or 512MB. If you have more data (including the in-mem DB) and code then will fit into physical RAM then the Page file on disc is used to swap out memory that is currently not being used. Swapping does slow everything down though. There are some tricks you can use for extending that: Memory-mapped files, /3GB switch; but these are not easy to implement.
On 64 bit machines, a processes memory limitation is huge - I forget what it is but it's up in the TB range.
VoltDB is an in-memory SQL database that runs on a cluster of 64-bit Linux servers. It has high performance durability to disk for recovery purposes, but tables, indexes and materialized views are stored 100% in-memory. A VoltDB cluster can be expanded on the fly to increase the overall available RAM and throughput capacity without any down time. In a high-availability configuration, individual nodes can also be stopped to perform maintenance such as increasing the server's RAM, and then rejoined to the cluster without any down time.
The design of VoltDB, led by Michael Stonebraker, was for a no-compromise approach to performance and scalability of OLTP transaction processing workloads with full ACID guarantees. Today these workloads are often described as Fast Data. By using main memory, and single-threaded SQL execution code distributed for parallel processing by core, the data can be accessed as fast as possible in order to minimize the execution time of transactions.
There are in-memory solutions that can work with data sets larger than RAM. Of course, this is accomplished by adding some operations on disk. Tarantool's Vinyl, for example, can work with data sets that are 10 to 1000 times the size of available RAM. Like other databases of recent vintage such as RocksDB and Bigtable, Vinyl's write algorithm uses LSM trees instead of B trees, which helps with its speed.
We are planning our new EBS structure on amazon to get the best performance out of SQL Server. During the process some doubts appeared:
1 - Using the Amazon calculator (http://calculator.s3.amazonaws.com/index.html) we got the costs below:
General purpose (SSD) - 1000GB - 3000 IOPS = $184,30
Provisioned IOPS (SSD) - 1000GB - 3000 IOPS = $511,00
This amount is a huge diference in a month for the same performance (???), I'm aware about the "IOPS burst implementation" on General purpose SSD, but according to documentation:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html
When the volume size is 1000 GB the burst duration is "infinite" (Always 3000 IOPS).
Is it safe to say that the performance between the two disks above are exactly the same?
2 - We need about 1700 GB for 100 databases, what layout should we use?
Options:
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and distribute the workload among this two.
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and put then together with RAID 0 ? (We will be able to have 6000 IOPS burst, but should I be worried about EBS fault?)
Get four disks (GP SSD) with 1000GB (3000 IOPS) each and use RAID 10? (Is it necessary with EBS?)
Give your suggestion, i will be glad to hear.
From Amazon support, hope this helps!
Greetings
The disk cost question is easy enough to answer. General purpose (SSD) and Provisioned IOPS (SSD) use similar technology. Side by side they can achieve the same speeds, the only difference being that GP2 maximum sped is 3000 and PIOPs is 4000, per volume. One reason PIOPS is much more expensive is that you also pay for the number of IO you use, where as GP2 there is no per IO cost.
As for the design of the 1700GB datastore, there are 2 main factors. Redundancy and Performance. And of course cost is a big factor. To provide proper guidance here we would need to know what your actual needs are going to be then we could suggest some solutions. However, there are a couple of main RAID levels etc that match what you suggested that we can talk about.
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and distribute the workload among this two.
No RAID. I take it you mean just have some databases on one volume and some on the other? This to me, is actually fine. All i would do in addition is backup the DBs to some other locally attached EBS volumes. This would be for workloads no greater that 3000 IO (read and writes combined). It's also easily expandable. Just add more disk.
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and put then together with RAID 0 ? (We will be able to have 6000 IOPS burst, but should I be worried about EBS fault?)
RAID 0. All you have done here is make things twice as fast. But lose one disk and you lose everything. Again, if you are happy to restore from backup if a disk fails, this is a fast cheap config, for upto 6000 IO. Not easily expandable.
Get four disks (GP SSD) with 1000GB (3000 IOPS) each and use RAID 10? (Is it necessary with EBS?)
RAID 5, 6, and 10. These are all faster and more redundant. Arguably, RAID 10 is the best config for database, and probably the right config for you. With 1700 GB of data, if things go wrong there will be lots and lots of unhappy people.
Any suggestions?
Have you considered Amazon RDS? RDS has lots of advantages. We do all the heavy lifting, including multi AZ deployments, and RDS can expand vertically (CPU) and horizontally (Space) as your needs grow.
http://aws.amazon.com/rds/details/
The other thing to consider with GP2 is.... you 'might' not need to provision 1TB volumes. You probably do not need the 3000 IO 'infinity' burst model. Lets say you do want to run at 3000 IO all the time. Why not provision 5 x 200GB volumes, where each volume has 3 IO per GB. So 5x200x3=3000IO baseline. Put the 5 volumes in raid 5 (for example) and you should get around 3000IO all day long, and never run out of credit if you dont go over that (and IO is equally distributed)
However, those volumes can each burst to 3000 IO for 30 minutes continuous before you get rate limited to 600IO per vol. Which is still 3000IO in total. So... in this config you can burst to 15,000IO at anytime and when you do get limited you still have the 3000IO you predicted you need. Just don't run at over 3000 for more than needed or you'll have no burst left.
Neat huh? I think it is worthwhile to call or chat in to discuss your actual needs and answer any questions. Ultimately though, you will need to test and benchmark which ever design you decide to go with as talking about things and actual results will always differ! I imagine you guys are quite advanced but - here is a great example benchmark if you want to do some simple tests on various designs to help you decide what is best.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_piops.html
I am investigating different structures for our database, which is expected to contain millions of files. I have narrowed it down to two different models; one of which is 4 times faster and uses 3 times less CPU, but uses 4 times more IO reads than the other.
So what is more expensive in both money and server bottlenecks, considering we are planning to host it in either Amazon or Azure cloud, IO or CPU?
It totally depends on the type of IO device and the size of the virtualized instance used. In a cloud hosted environment the real hardware specs are totally abstracted into marketing terms like EC2 Compute Unit. The only real way to know is to spin up in all environments and load test. Anything else is just a plain old guess.
Just want to add one more variable - Memory.
High memory instances can dramatically reduce the IOPS / CPU requirements.
For example - a MongoDB instance which have most of its working set in memory - hardly do IO calls.
And I agree with jeremyjjbrown - test, test, test.
Your KPI would be transactions (R/W) per seconds and transactions per Dollar.
We are using Mnesia as a primary Database for a very large system. Mnesia Fragmented Tables have behaved so well over the testing period. System has got about 15 tables, each replicated across 2 sites (nodes), and each table is highly fragmented. During the testing phase, (which focused on availability, efficiency and load tests), we accepted the Mnesia with its many advantages of complex structures will do for us, given that all our applications running on top of the service are Erlang/OTP apps. We are running Yaws 1.91 as the main WebServer.
For efficiently configuring Fragmented Tables, we used a number of references who have used mnesia in large systems:
These are: Mnesia One Year Later Blog, Part 2 of the Blog, Followed it even here, About Hashing. These blog posts have helped us fine tune here and there to a better performance.
Now, the problem. Mnesia has table size limits, yes we agree. However, limits on number of fragments have not been mentioned anywhere. For performance reasons, and to cater for large data, about how many fragments would keep mnesia "okay" ?.
In some of our tables, we have 64 fragments. with n_disc_only_copies set to the number of nodes in the cluster so that each node has a copy per fragment. This has helped us solve issues of mnesia write failure if a given node is out of reach at an instant. Also in the blog above, he suggests that the number of fragments should be a power of 2, this statement (he says) was investigated from the way mnesia does its hashing of records. We however need more explanation on this, and which power of two are being talked about here: 2,4,16,32,64,128,...?
The system is intended to run on HP Proliant G6, containing Intel processors (2 processors, each 4 cores, 2.4 GHz speed each core, 8 MB Cache size), 20 GB RAM size, 1.5 Terabytes disk space. Now, 2 of these high power machines are in our disposal. System Database should be replicated across the two. Each server runs Solaris 10, 64 bit.
At what number of fragments may mnesia's performance start to de-grade? Is it okay if we increase the number of fragments from 64 to 128 for a given table? how about 65536 fragments (2 ^ 16) ? How do we scale out our mnesia to make use of the Terabyte space by using fragmentation?
Please do provide the answers to the questions and you may provide advice on any other parameters that may enhance the System.
NOTE: All tables that are to hold millions of records are created in disc_only_copies type, so no RAM problems. The RAM will be enough for the few RAM Tables we run. Other DBMS like MySQL Cluster and CouchDB will also contain data and are using the same hardware with our Mnesia DBMS. MySQL Cluster is replicated across the two servers (each holding two NDB Nodes, a MySQL server), the Management Node being on a different HOST.
The hint of having a power of two number of fragments is simply related to the fact the default fragmentation module mnesia_frag uses linear hashing so using 2^n fragments assures that records are equally distributed (more or less, obviously) between fragments.
Regarding the hardware at disposal, it's more a matter of performance testing.
The factors that can reduce performance are many and configuring a database like Mnesia is just one single part of the general problem.
I simply advice you to stress test one server and then test the algorithm on both servers to understand if it scales correctly.
Talking about Mnesia fragments number scaling remember that by using disc_only_copies most of the time is spent in two operations:
decide which fragment holds which record
retrieve the record from corresponding dets table (Mnesia backend)
The first one is not really dependent from the number of fragments considered that by default Mnesia uses linear hashing.
The second one is more related to hard disk latency than to other factors.
In the end a good solution could be to have more fragments and less records per fragment but trying at the same time to find the middle ground and not lose the advantages of some hard disk performance boosts like buffers and caches.