Huge number of "dTLB-load-misses" when DPDK forwarding test - arm

Recently I am trying to do some forwarding test with DPDK "testpmd" application. And I find something interesting.
When 512 descriptors are used for TX and RX, the performance is better than using 4096 descriptors. After checking the counters with perf command, I find that a huge number of "dTLB-load-misses" is observed. And it is about more than 100 times of that with 512 descriptors. But the page-faults are always zero. With the ":u" and ":k" arguments, it seems that most of the TLB misses are in the user space. All the buffers are in one huge page for storing the data of network payloads, and the hugepage is 512MB size. Each buffer is less than 3KB. The buffer and the descriptors are one-to-one map.
So is there any clue that I can find the huge number of TLB misses? And will it have some effect to the performance (degradation)?

In general, CPU TLB cache capacity depends on page size. This means that for 4KB pages and for 512MB pages there are might be different number L1/L2 TLB cache entries.
For example, for ARM Cortex-A75:
The data micro TLB is a 48-entry fully associative TLB that is used by load and store operations. The cache entries have 4KB, 16KB, 64KB, and 1MB granularity of VA to PA mappings only.
Source: ARM Info Center
For ARM Cortex-A55:
The Cortex-A55 L1 data TLB supports 4KB pages only.
Any other page sizes are fractured after the L2 TLB and the appropriate page size sent to the L1 TLB.
Source: ARM Info Center
Basically, this means that the 512MB huge page mappings will be fractured to some smaller size (down to 4K) and only those small pieces will be cached in L1 dTLB.
So even if your application is fit into a single 512MB page, still the performance will depend greatly on actual memory footprint.

Related

what does STREAM memory bandwidth benchmark really measure?

I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark.
Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache?
* (a) Each array must be at least 4 times the size of the
* available cache memory. I don't worry about the difference
* between 10^6 and 2^20, so in practice the minimum array size
* is about 3.8 times the cache size.
I originally assume STREAM measures the peak memory bandwidth. But I later found that when I add extra arrays and array accesses, I can get larger bandwidth numbers. So it looks to me that STREAM doesn't guarantee to saturate memory bandwidth. Then my question is what does STREAM really measures and how do you use the numbers reported by STREAM?
For example, I added two extra arrays and make sure to access them together with the original a/b/c arrays. I modify the bytes accounting accordingly. With these two extra arrays, my bandwidth number is bumped up by ~11.5%.
> diff stream.c modified_stream.c
181c181,183
< c[STREAM_ARRAY_SIZE+OFFSET];
---
> c[STREAM_ARRAY_SIZE+OFFSET],
> e[STREAM_ARRAY_SIZE+OFFSET],
> d[STREAM_ARRAY_SIZE+OFFSET];
192,193c194,195
< 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
< 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
---
> 5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
> 5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
270a273,274
> d[j] = 3.0;
> e[j] = 3.0;
335c339
< c[j] = a[j]+b[j];
---
> c[j] = a[j]+b[j]+d[j]+e[j];
345c349
< a[j] = b[j]+scalar*c[j];
---
> a[j] = b[j]+scalar*c[j] + d[j]+e[j];
CFLAGS = -O2 -fopenmp -D_OPENMP -DSTREAM_ARRAY_SIZE=50000000
My last level cache is around 35MB.
Any commnet?
Thanks!
This is for a Skylake Linux server.
Memory accesses in modern computers are a lot more complex than one might expect, and it is very hard to tell when the "high-level" model falls apart because of some "low-level" detail that you did not know about before....
The STREAM benchmark code only measures execution time -- everything else is derived. The derived numbers are based on both decisions about what I think is "reasonable" and assumptions about how the majority of computers work. The run rules are the product of trial and error -- attempting to balance portability with generality.
The STREAM benchmark reports "bandwidth" values for each of the kernels. These are simple calculations based on the assumption that each array element on the right hand side of each loop has to be read from memory and each array element on the left hand side of each loop has to be written to memory. Then the "bandwidth" is simply the total amount of data moved divided by the execution time.
There are a surprising number of assumptions involved in this simple calculation.
The model assumes that the compiler generates code to perform all the loads, stores, and arithmetic instructions that are implied by the memory traffic counts. The approach used in STREAM to encourage this is fairly robust, but an advanced compiler might notice that all the array elements in each array contain the same value, so only one element from each array actually needs to be processed. (This is how the validation code works.)
Sometimes compilers move the timer calls out of their source-code locations. This is a (subtle) violation of the language standards, but is easy to catch because it usually produces nonsensical results.
The model assumes a negligible number of cache hits. (With cache hits, the computed value is still a "bandwidth", it is just not the "memory bandwidth".) The STREAM Copy and Scale kernels only load one array (and store one array), so if the stores bypass the cache, the total amount of traffic going through the cache in each iteration is the size of one array. Cache addressing and indexing are sometimes very complex, and cache replacement policies may be dynamic (either pseudo-random or based on run-time utilization metrics). As a compromise between size and accuracy, I picked 4x as the minimum array size relative to the cache size to ensure that most systems have a very low fraction of cache hits (i.e., low enough to have negligible influence on the reported performance).
The data traffic counts in STREAM do not "give credit" to additional transfers that the hardware does, but that were not explicitly requested. This primarily refers to "write allocate" traffic -- most systems read each store target address from memory before the store can update the corresponding cache line. Many systems have the ability to skip this "write allocate", either by allocating a line in the cache without reading it (POWER) or by executing stores that bypass the cache and go straight to memory (x86). More notes on this are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
Multicore processors with more than 2 DRAM channels are typically unable to reach asymptotic bandwidth using only a single core. The OpenMP directives that were originally provided for large shared-memory systems now must be enabled on nearly every processor with more than 2 DRAM channels if you want to reach asymptotic bandwidth levels.
Single-core bandwidth is still important, but is typically limited by the number of cache misses that a single core can generate, and not by the peak DRAM bandwidth of the system. The issues are presented in http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
For the single-core case, the number of outstanding L1 Data Cache misses much too small to get full bandwidth -- for your Xeon Scalable processor about 140 concurrent cache misses are required for each socket, but a single core can only support 10-12 L1 Data Cache misses. The L2 hardware prefetchers can generate additional memory concurrency (up to ~24 cache misses per core, if I recall correctly), but reaching average values near the upper end of this range requires simultaneous accesses to more 4KiB pages. Your additional array reads give the L2 hardware prefetchers more opportunity to generate (close to) the maximum number of concurrent memory accesses. An increase of 11%-12% is completely reasonable.
Increasing the fraction of reads is also expected to increase the performance when using all the cores. In this case the benefit is primarily by reducing the number of "read-write turnaround stalls" on the DDR4 DRAM interface. With no stores at all, sustained bandwidth should reach 90% peak on this processor (using 16 or more cores per socket).
Additional notes on avoiding "write allocate" traffic:
In x86 architectures, cache-bypassing stores typically invalidate the corresponding address from the local caches and hold the data in a "write-combining buffer" until the processor decides to push the data to memory. Other processors are allowed to keep and use "stale" copies of the cache line during this period. When the write-combining buffer is flushed, the cache line is sent to the memory controller in a transaction that is very similar to an IO DMA write. The memory controller has the responsibility of issuing "global" invalidations on the address before updating memory. Care must be taken when these streaming stores are used to update memory that is shared across cores. The general model is to execute the streaming stores, execute a store fence, then execute an "ordinary" store to a "flag" variable. The store fence will ensure that no other processor can see the updated "flag" variable until the results of all of the streaming stores are globally visible. (With a sequence of "ordinary" stores, results always become visible in program order, so no store fence is required.)
In the PowerPC/POWER architecture, the DCBZ (or DCLZ) instruction can be used to avoid write allocate traffic. If the line is in cache, its contents are set to zero. If the line is not in cache, a line is allocated in the cache with its contents set to zero. One downside of this approach is that the cache line size is exposed here. DCBZ on a PowerPC with 32-Byte cache lines will clear 32 Bytes. The same instruction on a processor with 128-Byte cache lines will clear 128 Bytes. This was irritating to a vendor who used both. I don't remember enough of the details of the POWER memory ordering model to comment on how/when the coherence transactions become visible with this instruction.
The key point here, as pointed out by Dr. Bandwidth's answer, is that STREAMS only counts the useful bandwidth seen by the source code. (He's the author of the benchmark.)
In practice the write stream will incur read bandwidth costs as well for the RFO (Read For Ownership) requests. When a CPU want to write 16 bytes (for example) to a cache line, first it has to load the original cache line and then modify it in L1d cache.
(Unless your compiler auto-vectorized with NT stores that bypass cache and avoid that RFO. Some compilers will do that for loops they expect to write an array too larger for cache before any of it is re-read.)
See Enhanced REP MOVSB for memcpy for more about cache-bypassing stores that avoid an RFO.
So increasing the number of read streams vs. write streams will bring software-observed bandwidth closer to actual hardware bandwidth. (Also a mixed read/write workload for the memory may not be perfectly efficient.)
The purpose of the STREAM benchmark is not to measure the peak memory bandwidth (i.e., the maximum memory bandwidth that can be achieved on the system), but to measure the "memory bandwidth" of a number of kernels (COPY, SCALE, SUM, and TRIAD) that are important to the HPC community. So when the bandwidth reported by STREAM is higher, it means that HPC applications will probably run faster on the system.
It's also important to understand the meaning of the term "memory bandwidth" in context of the STREAM benchmark, which is explained in the last section of the documentation. As mentioned in that section, there are at least three ways to count the number of bytes for a benchmark. The STREAM benchmark uses the STREAM method, which count the number of bytes read and written at the source code level. For example, in the SUM kernel (a(i) = b(i) + c(i)), two elements are read and one element is written. Therefore, assuming that all accesses are to memory, the number of bytes accessed from memory per iteration is equal to the number of arrays multiplied by the size of an element (which is 8 bytes). STREAM calculates bandwidth by multiplying the total number of elements accessed (counted using the STREAM method) by the element size and dividing that by the execution time of the kernel. To take run-to-run variations into account, each kernel is run multiple times and the arithmetic average, minimum, and maximum bandwidths are reported.
As you can see, the bandwidth reported by STREAM is not the real memory bandwidth (at the hardware level), so it doesn't even make sense to say that it is the peak bandwidth. In addition, it's almost always much lower than the peak bandwidth. For example, this article shows how ECC and 2MB pages impact the bandwidth reported by STREAM. Writing a benchmark that actually achieves the maximum possible memory bandwidth (at the hardware level) on modern Intel processors is a major challenge and may be a good problem for a whole Ph.D. thesis. In practice, though, the peak bandwidth is less important than the STREAM bandwidth in the HPC domain. (Related: See my answer for information on the issues involved in measuring the memory bandwidth at the hardware level.)
Regarding your first question, notice that STREAM just assumes that all reads and writes are satisfied by the main memory and not by any cache. Allocating an array that is much larger than the size of the LLC helps in making it more likely that this is the case. Essentially, complex and undocumented aspects of the LLC including the replacement policy and the placement policy need to be defeated. It doesn't have to be exactly 4x larger than the LLC. My understanding is that this is what Dr. Bandwidth found to work in practice.

Is there any benefit to using 4kb allocation pools?

Until recently, I thought CPU caches (at least, some levels) were 4kb in size, so grouping my allocations into 4kb slabs seemed like a good idea, to improve cache locality.
However, reading Line size of L1 and L2 caches I see that all levels of CPU cache lines are 64 bytes, not 4kb. Does this mean my 4kb slabs are useless, and I should just go back to using regular malloc?
Thanks!
4KB does have a significance: it is a common page size for your operating system, and thus for entries in your TLB.
Memory-mapped files, swap space, kernel I/Os (write to file, or socket) will operate on this page size even if you're using only 64 bytes of the page.
For this reason, it can still be somewhat advantageous to use page-sized pools.

How to use Intel Westmere 1GB pages on Linux?

Edit: I updated my question with the details of my benchmark
For benchmarking purposes, I am trying to setup 1GB pages in a Linux 3.13 system running on top of two Intel Xeon 56xx ("Westmere") processors. For that I modified my boot parameters to add support for 1GB pages (10 pages). These boot parameters only contain 1GB pages and not 2MB ones. Running hugeadm --pool-list leads to:
Size Minimum Current Maximum Default
1073741824 10 10 10 *
My kernel boot parameters are taken into account. In my benchmark I am allocating 1GiB of memory that I want to be backed by a 1GiB huge page using:
#define PROTECTION (PROT_READ | PROT_WRITE)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
uint64_t size = 1UL*1024*1024*1024;
memory = mmap(0, size, PROTECTION, FLAGS, 0, 0);
if (memory == MAP_FAILED) {
perror("mmap");
exit(1);
}
sleep(200)
Looking at the /proc/meminfo while the bench is sleeping (sleep call above), we can see that one huge page has been allocated:
AnonHugePages: 4096 kB
HugePages_Total: 10
HugePages_Free: 9
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Note: I disabled THP (through the /sys file system) before running the bench, so I guess the AnonHugePages field reported by /proc/meminfo represents the huge pages allocated by THP before stopping it.
At this point we can think that all is fine, but unfortunately my bench leads me to think that many 2MiB pages are used and not one 1GiB page. Here is the explanation:
This bench randomly access the allocated memory through pointer's chasing: a first step fills the memory to enable pointer chasing (each cell points to another cell) and in a second step the bench navigates through the memory using
pointer = *pointer;
Using the perf_event_open system call, I am counting data TLB read misses for the second step of the bench only. When the memory allocated size is 64MiB, I count a very small number, 0,01 % of my 6400000 memory accesses, of data TLB read misses. All the accesses are saved in the TLB. In other words, 64MiB of memory can be kept in the TLB. As soon as the allocated memory size is greater than 64 MiB I see data tlb read misses. For a memory size equals to 128 MiB, I have 50% of my 6400000 memory accesses that missed in the TLB. 64MiB appears to be the size that can fit in the TLB and 64MiB = 32 entries (as reportd below) * 2MiB pages. I conclude that I am not using 1GiB pages but 2MiB ones.
Can you see any explanation for that behavior ?
Moreover, the cpuid tool, reports the following about the tlb on my system:
cache and TLB information (2):
0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x55: instruction TLB: 2M/4M pages, fully, 7 entries
0xb0: instruction TLB: 4K, 4-way, 128 entries
0xca: L2 TLB: 4K, 4-way, 512 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
As you can see, there is no information about 1GiB pages. How many such pages can be cached in the TLB ?
TL;DR
You (specifically, your processor) cannot benefit from 1GB pages in this scenario, but your code is correct without modifications on systems that can.
Long version
I followed these steps to attempt to reproduce your problem.
My System: Intel Core i7-4700MQ, 32GB RAM 1600MHz, Chipset H87
svn co https://github.com/ManuelSelva/c4fun.git
cd c4fun.git/trunk
make. Discovered a few dependencies were needed. Installed them. Build failed, but mem_load did build and link, so did not pursue the rest further.
Rebooted the system, appending at GRUB time to the boot arguments the following:
hugepagesz=1G hugepages=10 default_hugepagesz=1G
which reserves 10 1GB pages.
cd c4fun.git/trunk/mem_load
Ran several tests using memload, in random-access pattern mode and pinning it to core 3, which is something that isn't 0 (the bootstrap processor).
./mem_load -a rand -c 3 -m 1073741824 -i 1048576
This resulted in approximately nil TLB misses.
./mem_load -a rand -c 3 -m 10737418240 -i 1048576
This resulted in approximately 60% TLB misses. On a hunch I did
./mem_load -a rand -c 3 -m 4294967296 -i 1048576
This resulted in approximately nil TLB misses. On a hunch I did
./mem_load -a rand -c 3 -m 5368709120 -i 1048576
This resulted in approximately 20% TLB misses.
At this point I downloaded the cpuid utility. It gave me this for cpuid -1 | grep -i tlb:
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
As you can see, my TLB has 4 entries for 1GB pages. This explains well my results: For 1GB and 4GB arenas, the 4 slots of the TLB are entirely sufficient to satisfy all accesses. For 5GB arenas and random-access pattern mode, 4 of the 5 pages only can be mapped through the TLB, so chasing a pointer into the remaining one will cause a miss. The probability of chasing a pointer into the unmapped page is 1/5, so we expect a miss rate of 1/5 = 20% and we get that. For 10GB, 4/10 pages are mapped and 6/10 aren't so the miss rate will be 6/10=60%, and we got that.
So your code works without modifications on my system at least. Your code does not appear to be problematic then.
I then did some research on CPU-World, and while not all CPUs are listed with TLB geometry data, some are. The only one I saw that matched your cpuid printout exactly (there could be more) is the Xeon Westmere-EP X5650; CPU-World does not explicitly say that the Data TLB0 has entries for 1GB pages, but does say the processor has "1 GB large page support".
I then did more research and finally nailed it. An author at RealWorldTech makes an (admittedly, I must yet find a source for this) off-hand comment in the discussion of the memory subsystem of Sandy Bridge. It reads as follows:
After address generation, uops will access the DTLB to translate from a virtual to a physical address, in parallel with the start of the cache access. The DTLB was mostly kept the same, but the support for 1GB pages has improved. Previously, Westmere added support for 1GB pages, but fragmented 1GB pages into many 2MB pages since the TLB did not have any 1GB page entries. Sandy Bridge adds 4 dedicated entries for 1GB pages in the DTLB.
(Emphasis added)
Conclusion
Whatever nebulous concept "CPU supports 1GB pages" represents, Intel thinks it does not imply "TLB supports 1GB page entries". I'm afraid that you will not be able to use 1GB pages on an Intel Westmere processor to reduce the number of TLB misses.
That, or Intel is hoodwinking us by distinguishing huge pages (in the TLB) from large pages.

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.
What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?
If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?
Can we do the same by using the disk's original 4KB block size?
What does 64MB block size mean?
The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.
If yes, what is the advantage of doing that?
HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.
HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).
In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.
So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.
In normal OS block size is 4K and in hadoop it is 64 Mb.
Because for easy maintaining of the metadata in Namenode.
Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.
If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.
Conclusion:
To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.
It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.
Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.
If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.
The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
Having a much smaller block size would cause seek overhead to increase.
Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.
See Google Research Publication: MapReduce
http://research.google.com/archive/mapreduce.html
Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus the time to transfer
a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10 ms and
the transfer rate is 100 MB/s, to make the seek time 1% of the
transfer time, we need to make the block size around 100 MB. The
default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in
MapReduce normally operate on one block at a time, so if you have too
few tasks (fewer than nodes in the cluster), your jobs will run slower
than they could otherwise.

How many bytes the cache controller fetches a time from main memory to L2 cache?

I just read two articles over this topic which provide infomration inconsistent, so I want to know which one is correct. Perhaps both are correct, but under what context?
The first one states that we fetch a page size a time
The cache controller is always observing the memory positions being loaded and loading data from several memory positions after the memory position that has just been read.
To give you a real example, if the CPU loaded data stored in the address 1,000, the cache controller will load data from ”n“ addresses after the address 1,000. This number ”n“ is called page; if a given processor is working with 4 KB pages (which is a typical value), it will load data from 4,096 addresses below the current memory position being load (address 1,000 in our example). In following Figure, we illustrate this example.
The second one states that we fetch sizeof(cache line) + sizeof(prefetcher) a time
So we can summarize how the memory cache works as:
The CPU asks for instruction/data stored in address “a”.
Since the contents from address “a” aren’t inside the memory cache, the CPU has to fetch it
directly from RAM.
The cache controller loads a line (typically 64 bytes) starting at address “a” into the memory
cache. This is more data than the CPU requested, so if the program continues to run sequentially
(i.e. asks for address a+1) the next instruction/data the CPU will ask will be already loaded in the
memory cache.
A circuit called prefetcher loads more data located after this line, i.e. starts loading the contents
from address a+64 on into the cache. To give you a real example, Pentium 4 CPUs have a 256-byte
prefetcher, so it loads the next 256 bytes after the line already loaded into the cache.
Completely hardware implementation dependent. Some implementations load a single line from main memory at a time — and cache line sizes vary a lot between different processors. I've seen line sizes from 64 bytes all the way up to 256 bytes. Basically what the size of a "cache line" means is that when the CPU requests memory from main RAM, it does so n bytes at a time. So if n is 64 bytes, and you load a 4-byte integer at 0x1004, the MMU will actually send 64 bytes across the bus, all the addresses from 0x1000 to 0x1040. This entire chunk of data will be stored in the data cache as one line.
Some MMUs can fetch multiple cache lines across the bus per request -- so that making a request at address 0x1000 on a machine that has 64 byte caches actually loads four lines from 0x1000 to 0x1100. Some systems let you do this explicitly with special cache prefetch or DMA opcodes.
The article through your first link, however, is completely wrong. It confuses the size of an OS memory page with a hardware cache line. These are totally different concepts. The first is the minimum size of virtual address space the OS will allocate at once. The latter is a detail of how the CPU talks to main RAM.
They resemble each other only in the sense that when the OS runs low on physical memory, it will page some not-recently-used virtual memory to disk; then later on, when you use that memory again, the OS loads that whole page from disk back into physical RAM. This is analogous (but not related) to the way that the CPU loads bytes from RAM, which is why the author of "Hardware Secrets" was confused.
A good place to learn all about computer memory and why caches work the way they do is Ulrich Drepper's paper, What Every Programmer Should Know About Memory.

Resources