Cassandra hardware configuration - database

I have a question regarding hardware related performance issues.
Our cassandra nodes had 4 cores with 2GB RAM and we suffered with unreasonable response time (1.5 seconds average on read for 200 calls / sec).
We then upgraded the machines to 8 cores with 8GB RAM and immediately saw an improvement (around 300ms now).
However server analytics doesn't show any peek or extra use of cpu power.
How can this be explained? does an upgrade from 4 cores to 8 cores explain such a performance boost even if it seems like the server's cpu usage is unaffected?
Thanks

Cassandra needs more memory to hold data in memtables and for faster read response times.
8GB to 16GB is what we assign to Cassandra process with JVM parameters tweaked, with 4 CPUs each quad-core so 16 cores per node with sata drives.
Make sure commit log and data dir are on separate disks.

If you left configuration parameters by default, you saw an improvement in performance partly because your key cache is now larger.
key_cache_size_in_mb = (min(5% of Heap (in MB), 100MB))
cassandra-env.sh automatically decide how much heap to give to the JVM (if you did not change it):
set max heap size based on the following
max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
calculate 1/2 ram and cap to 1024MB
calculate 1/4 ram and cap to 8192MB
pick the max
Since your max heap size was increased, your key cache was increased as well.

Related

Sawtooth CPU load pattern caused by memcpy(). A cpu-cache or memory-management issue?

As it may sound like an ffmpeg related issue, I believe it is not.
We have a system that processes live TV feeds by using ffmpeg's filters.
We capture frames from a video capture card and copy it into our own data structures.
We copy the frame info ffmpeg's native structures.
We run the filters.
We copy the resulting frame from ffmpeg's native structures back into our own structures.
A single frame uses 4.15 Mb dynamically allocated memory.
Frame buffers are allocated by using _aligned_malloc().
The server is a 2x Intel Xeon E5-2697v4 Windows Server 2016 box with 64Gb memory.
2 NUMA nodes. A process assigned to each node.
2 channels per process, a total of 4 channels per server, each at 25 fps.
A single default process heap is used for all memory allocations.
In addition to frame buffers, versatile dynamic memory allocations take place for multiple purposes.
Each process uses 2Gb of physical memory.
When we run the system, everything works fine. A flat CPU load is observed.
After a while (generally a couple of hours), we start seeing a sawtooth pattern in the CPU load:
When we investigate by using Intel's VTune Amplifier 2018, it shows that memcpy() in step2 consumes a lot of CPU time during those high-CPU periods.
Below is what we see from VTune 2018 Hot-Spot Analysis
**LOW CPU PERIOD (a total of 8,13 sec)**
SleepConditionVariableCS 14,167 sec CPU time
WaitForSingleObjectEx 14,080 sec CPU time
memcpy 3,443 sec CPU time with the following decomposition:
-- get_frame(step4) -- 2,568
-- put_frame(step2) -- 0,740
-- decklink_cb(step1) -- 0,037
**HIGH CPU PERIOD (a total of 8,13 sec)**
memcpy 16,812 sec CPU time with the following decomposition:
-- put_frame(step2) -- 10,429
-- get_frame(step4) -- 3,692
-- decklink_cb(step1) -- 2,236
SleepConditionVariableCS 14,765 sec CPU time
WaitForSingleObjectEx 13,928 sec CPU time
_aligned_free() 3,532 sec CPU time
Below is the graph of the time it takes to perform the memcpy() operation in step 2.
Y-axis is the time in milliseconds.
Below is the graph of total processing time for a single frame.
Y-axis is the time in milliseconds.
"0" readings are frame drops due to delay and should be neglected.
When the CPU gets higher and memcpy() starts to take longer, the handle count of our process decreases.
We logged the addresses that is returned from _aligned_malloc() in step2.
This is the buffer location in memory that we copy 4.15 Mb frame data to.
For low CPU periods, the addresses returned from _aligned_malloc() are very close to each other, ie. the difference between the addresses returned from two consecutive memory allocation operations tends to be small.
For high CPU periods, the addresses returned from _aligned_malloc() span a very large range.
The sawtooth pattern we see in the total CPU load seems to be due to the memcpy() operation in step2.
For the high CPU periods, the buffer addresses that is returned from _aligned_malloc() seems to have bad locality when compared to the buffer addresses returned during low CPU periods.
We believe, the operating system starts doing something extra when we memcpy(). Some additional task during page faults? Some cleanup stuff? Is this a cache issue? Don't know.
Is there anybody that can comment on the reason and the solution to this situation?

High Paging file % Usage while memory is not full

I was handed a server hosting SQL Server and I was asked to fined the causes of its bad performance problems.
While monitoring PerfMon I found that:
Paging file: % Usage = 25% average for 3 days.
Memory: Pages/sec > 1 average for 3 days.
What I know that if % Usage is > 2% then there is too much paging because of memory pressure and lack in memory space. However, if when I opened Resource Monitor, Memory tab, I found:
-26 GB in use (out of 32 GB total RAM)
-2 GB standby
-and 4 GB Memory free !!!!!!
If there is 4 GB free memory why the paging?! and most importantly why it (paging %) is too high?!!
Someone please explain this situation and how paging file % usage can be lowered to normal.
Note that SQL Server Max. memory is set to 15GB
Page file usage on its own isn't a major red flag. The OS will tend to use a page file even when there's plenty of RAM available, because it allows it to dump the relevant parts of memory from RAM when needed - don't think of the page file usage as memory moved from RAM to HDD - it's just a copy. All the accesses will still use RAM, the OS is simply preparing for a contingency - if it didn't have the memory pre-written to the page file, the memory requests would have to wait for "old" memory to be dumped before freeing the RAM for other uses.
Also, it seems you're a bit confused about how paging works. All user-space memory is always paged, this has nothing to do with the page file itself - it simply means you're using virtual memory. The metric you're looking for is Hard faults per second (EDIT: uh, I misread which one you're reading - Pages/sec is how many hard faults there are; still, the rest still applies), which tells you how often the OS had to actually read data from the page file. Even then, 1 per second is extremely low. You will rarely see anything until that number goes above fifty per sec or so, and much higher for SSDs (on my particular system, I can get thousands of hard faults with no noticeable memory lag - this varies a lot based on the actual HDD and your chipset and drivers).
Finally, there's way too many ways SQL Server performance can suffer. If you don't have a real DBA (or at least someone with plenty of DB experience), you're in trouble. Most of your lines of inquiry will lead you to dead-ends - something as complex and optimized as a DB engine is always hard to diagnose properly. Identify signs - is there a high CPU usage? Is there a high RAM usage? Are there queries with high I/O usage? Are there specific queries that are giving you trouble, or does the whole DB suffer? Are your indices and tables properly maintained? Those are just the very basics. Once you have some extra information like this, try DBA.StackExchange.com - SO isn't really the right place to ask for DBA advice :)
Just some shots in the dark really, might be a little random but I could hardly spot something straight away:
might there be processes that select uselessly large data sets or run too frequent operations? (e.g. the awful app developers' practice to use SELECT * everywhere or get all data and then filter it on application level or run DB queries in loops instead of getting record sets once, etc.)
is indexing proper? (e.g. are leaf elements employed where possible to reduce the key lookup operations, are heavy queries backed up with proper indices to avoid table & index scans etc.)
how is data population managed? (e.g. is it possible that there are too many page splits due to improper clustered indices or parallel inserts, are there some index rebuilds taking place etc.)

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.
What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?
If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?
Can we do the same by using the disk's original 4KB block size?
What does 64MB block size mean?
The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.
If yes, what is the advantage of doing that?
HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.
HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).
In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.
So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.
In normal OS block size is 4K and in hadoop it is 64 Mb.
Because for easy maintaining of the metadata in Namenode.
Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.
If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.
Conclusion:
To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.
It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.
Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.
If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.
The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
Having a much smaller block size would cause seek overhead to increase.
Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.
See Google Research Publication: MapReduce
http://research.google.com/archive/mapreduce.html
Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus the time to transfer
a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10 ms and
the transfer rate is 100 MB/s, to make the seek time 1% of the
transfer time, we need to make the block size around 100 MB. The
default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in
MapReduce normally operate on one block at a time, so if you have too
few tasks (fewer than nodes in the cluster), your jobs will run slower
than they could otherwise.

Uploading Large(8GB) File Issue using Weka

I am trying to upload a 8GB file to weka for usage of Apriori Algorithm. The server configuration is as follows :-
Its 8 processor server with 4 cores in each physical address space = 40bits and virtual address space =48 bits. Its a 64 bits processor.
Physical Memory =26GB and SWAP =27GB
JVM = 64bit. We have allocated 32GB for JVM Heap using XmX option. Our concern is that the loading of such a huge file is taking a very long time(around 8 hours) and java is utilizing 107% CPU and 91% memory and it has not shown Out of memory exception and weka is showing reading from file.
Please help me how do I handle huge file and what exactly is happening here?
Reagards,
Aniket
I can't speak to Weka, I don't know your data set, or how many elements are in it. The number of elements matter as in a 64b JVM, the pointers are huge, and they add up.
But do NOT create a JVM larger than physical RAM. Swap is simply not an option for Java. A swapping JVM is a dead JVM. Swap is for idle processes rarely used.
Also note that the Xmx value and the physical heap size are not the same, physical size will always be larger than the Xmx size.
You should pre-allocate your JVM heap (Xms == Xmx) and try out various values until MOST of your physical RAM is consumed. This will limit full GCs and memory fragmentation. It also helps (a little) to do this on a fresh system if you're allocating such a large portion of the total memory space.
But whatever you do, do not let Java swap. Swapping and Garbage Collectors do not mix.

For LevelDB, how can I get the performance of random writes as same as claimed "official" performance report?

One the official site of leveldb(http://code.google.com/p/leveldb/), there is a performance report. I pasted as below.
Below is from official leveldb benchmark
Here is a performance report (with explanations) from the run of the included db_bench program. The results are somewhat noisy, but should be enough to get a ballpark performance estimate.
Setup
We use a database with a million entries. Each entry has a 16 byte key, and a 100 byte value. Values used by the benchmark compress to about half their original size.
LevelDB: version 1.1
CPU: 4 x Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz
CPUCache: 4096 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
Raw Size: 110.6 MB (estimated)
File Size: 62.9 MB (estimated)
Write performance
The "fill" benchmarks create a brand new database, in either sequential, or random order.
The "fillsync" benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The "overwrite" benchmark does random writes that update existing keys in the database.
fillseq : 1.765 micros/op; 62.7 MB/s
fillsync : 268.409 micros/op; 0.4 MB/s (10000 ops)
fillrandom : 2.460 micros/op; 45.0 MB/s
overwrite : 2.380 micros/op; 46.5 MB/s
Each "op" above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.
Below is from My leveldb benchmark
I did some benchmark for leveldb but got write speed 100 times less than the report.
Here is my experiment settings:
CPU: Intel Core2 Duo T6670 2.20GHz
3.0GB memory
32-bit Windows 7
without compression
options.write_buffer_size = 100MB
options.block_cache = 640MB
What I did is very simple: I just put 2 million {key, value} and no reads at all. The key is a byte array which has 20 random bytes and the value is a byte array too with 100 random bytes. I constantly put newly random {key, value} for 2 million times, without any operation else.
In my experiment, I can see that the speed of writing decreases from the very beginning. The instant speed (measuring the speed of every 1024 writes) swings between 50/s to 10, 000/s. And my overall average speed of writes for 2 million pairs is around 3,000/s. The peak speed of writes is 10, 000/s.
As the report claimed that the speed of writes can be 400, 000/s, the write speed of my benchmark is 40 to 130 times slower and I am just wondering what's wrong with my benchmark.
I don't need to paste my testing codes here as it is super easy, I just have a while loop for 2 million times, and inside the loop, for every iteration, I generate a 20 bytes of key, and 100 bytes of value, and then put them to the leveldb database. I also measured the time spent on {key, value} generation, it costs 0 ms.
Can anyone help me with this? How can I achieve 400, 000/s writes speed with leveldb? What settings I should improve to?
Thanks
Moreover
I just ran the official db_bench.cc on my machie. It is 28 times slower than the report.
I think as I used their own benchmark program, the only difference between my benchmark and theirs is the machine.
You have 2 million key-value pairs and each key value pair is a total of 120 bytes, so 2 million * 120 bytes = 228 MB of data! Your cache is 640 MB so it's quite possible that all of your data is still in RAM and it never really got to disk. As Kitsune pointed out: your hardware is nowhere near as fast as the one that Google tested with and if Google had the same cache size then that could easily produce 30 times difference.
Other potential issues:
It's difficult to know exactly how "random" were the keys: LevelDB performs differently depending on the distribution of keys (even if it's "random").
20 byte keys would be less efficient than 16 bytes keys, because they don't align as well.
Depending on your hard drive, your disk write speed might be slower (you check yours).
We can keep going on and on and on, but there are just too many variables to consider. If you post some code that demonstrates how your test runs, then we can recommend some optimizations so you can get better performance.
When you run the same benchmark on completely different hardware you're bound to see some differences.
Your CPU is ~9x weaker 2xCores#2.2GHz vs 16xCores#2.4GHz
Your hard drive and the drive of the official benchmark were not mentioned (fiber NAS vs a solid state drive SSD vs a hard disk drive HDD)
Can't compare apples to oranges or apples to [unknown fruit].

Resources