I'm trying to get a better understanding of btrfs compression. I've seen this useful post in reddit
(www.reddit.com/r/linux/comments/1mn5zy/), from which I gather compresison is done for extents, i.e. 150KB size units. I'm trying to understand:
small random writes - does a small random write cause the entire extent to be re-compressed and re-written to disk?
page cache - are the extents saved in their compressed form in the page cache, and decompressed on demand? if so, this saves a lot of RAM at the cost of repeated decompression (i.e. CPU time)
Related
Like appending log entries at the tail of a file, or just like mysql recording its redo log, people always say sequential write is much faster than random write. But Why? I mean, when you write data on disk, the seek time and rotate time dominate the performance. But between your two consecutive sequential writes, there're maybe lots of other write requests(like nginx records access.log). Those write requests may move the magnetic head to other tracks, and when your process does the sequential write, it needs to move the magnetic head back again, and incur the rotation time. Even though when there's no other process, the magnetic head could stand still, you also need to wait for the rotation.
So, is it true that sequential writes is better than random write just because, in many cases, sequential write doesn't contain the seek time while random write always contains the seek time, but both sequantial and random write contain the rotation time?
The write performance of a disk is influenced by the physical properties of the storage device (e.g. the physical rotational speed in revolutions per minute in case of a mechanical disk), the disk I/O unit to I/O request size ratio and the OS/application.
Physical properties of an HDD
One major drawback of traditional mechanical disks is that, for an I/O request to be honored, the head has to reach the desired starting position (seek delay) and the platter has to reach the desired starting position (rotational delay).
This is true for sequential and random I/O. However, with sequential I/O, this delay gets considerably less noticeable because more data can be written without repositioning the head. An "advanced format" hard disk has a sector size of 4096 bytes (the smallest I/O unit) and a cylinder size in the megabyte-range. A whole cylinder can be read without repositioning the head. So, yes, there's a seek and rotational delay involved but the amount of data that can be read/written without further repositioning is considerably higher. Also, moving from one cylinder to the next is significantly more efficient than moving from the innermost to the outermost cylinder (worst-case seek).
Writing 10 consecutive sectors involves a seek and rotational delay once, writing 10 sectors spread across the disk involves 10 counts of seeks and rotational delay.
In general, both, sequential and random I/O involve seek and rotational delays. Sequential I/O takes advantage of sequential locality to minimize those delays.
Physical properties of an SSD
As you know, a solid-state disk does not have moving parts because it's usually built from flash memory. The data is stored in cells. Multiple cells form a page - the smallest I/O unit ranging from 2K to 16K in size. Multiple pages are managed in blocks - a block contains between 128 and 256 pages.
The problem is two-fold. Firstly, a page can only be written to once. If all pages contain data, they cannot be written to again unless the whole block is erased. Assuming that a page in a block needs to be updated and all pages contain data, the whole block must be erased and rewritten.
Secondly, the number of write cycles of an individual block is limited. To prevent approaching or exceeding the maximum number of write cycles faster for some blocks than others, a technique called wear leveling is used so that writes are distributed evenly across all blocks independent from the logical write pattern. This process also involves block erasure.
To alleviate the performance penalty of block erasure, SSDs employ a garbage collection process that frees used pages that are marked as stale by writing block pages excluding stale pages to a new block and erasing the original block.
Both aspects can cause more data to be physically read and written than required by the logical write. A full page write can trigger a read/write sequence that is 128 to 256 times larger, depending on the page/block relationship. This is known as write amplification. Random writes potentially hit considerably more blocks than sequential writes, making them significantly more expensive.
Disk I/O unit to I/O request size ratio
As outlined before, a disk imposes a minimum on the I/O unit that can be involved in reads and writes. If a single byte is written to disk, the whole unit has to be read, modified, and written.
In contrast to sequential I/O, where the likelihood of triggering large writes as the I/O load increases is high (e.g. in case of a database transaction log), random I/O tends to involve smaller I/O requests. As those requests become smaller than the smallest I/O unit, the overhead of processing those increases, adding to the cost of random I/O. This is another example of write amplification as a consequence of storage device characteristics. In this case, however, HDD and SSD scenarios are affected.
OS/application
The OS has various mechanisms to optimize both, sequential and random I/O.
A write triggered by an application is usually not processed immediately (unless requested by the application by means of synchronous/direct I/O or a sync command), the changes are executed in-memory based on the so-called page cache and written to disk at a later point-in-time.
By doing so, the OS maximizes the total amount of data available and the size of individual I/Os. Individual I/O operations that would have been inefficient to execute can be aggregated into one potentially large, more efficient operation (e.g. several individual writes to a specific sector can become a single write). This strategy also allows for I/O scheduling, choosing a processing order that is most efficient for executing the I/Os even though the original order as defined by the application or applications was different.
Consider the following scenario: a web server request log and a database transaction log are being written to the same disk. The web server write operations would normally interfere with the database write operations if they were executed in order, as issued by the applications involved. Due to the asynchronous execution based on the page cache, the OS can reorder those I/O requests to trigger two large sequential write requests each. While those are being executed, the database can continue to write to the transaction log without any delay.
One caveat here is that, while this should be true for the web server log, not all writes can be reordered at will. A database triggers a disk sync operation (fsync on Linux/UNIX, FlushFileBuffers on Windows) whenever the transaction log has to be written to stable storage as part of a commit. Then, the OS cannot delay the write operations any further and has to execute all previous writes to the file in question immediately. If the web server was to do the same, there could be a noticeable performance impact because the order is then dictated by those two applications. That is why it's a good idea to place a transaction log on a dedicated disk to maximize sequential I/O throughput in the presence of other disk syncs / large amounts of other I/O operations (the web server log shouldn't be a problem). Otherwise, asynchronous writes based on the page cache might not be able to hide the I/O delays anymore as the total I/O load and/or the number of disk syncs increase.
The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.
What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?
If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?
Can we do the same by using the disk's original 4KB block size?
What does 64MB block size mean?
The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.
If yes, what is the advantage of doing that?
HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.
HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).
In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.
So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.
In normal OS block size is 4K and in hadoop it is 64 Mb.
Because for easy maintaining of the metadata in Namenode.
Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.
If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.
Conclusion:
To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.
It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.
Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.
If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.
The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
Having a much smaller block size would cause seek overhead to increase.
Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.
See Google Research Publication: MapReduce
http://research.google.com/archive/mapreduce.html
Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus the time to transfer
a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10 ms and
the transfer rate is 100 MB/s, to make the seek time 1% of the
transfer time, we need to make the block size around 100 MB. The
default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in
MapReduce normally operate on one block at a time, so if you have too
few tasks (fewer than nodes in the cluster), your jobs will run slower
than they could otherwise.
I have a program that processes a large dataset consisting of a large number (300+) of sizable memory (40MB+) mapped files. All the files are needed together though they are accessed in a sequential way. at the moment I am memory mapping the files and then using madvise with MADV_SEQUENTIAL since I don't want the thing to be any more of a memory hog than it needs to be (without any madvise the consumption becomes a problem). The problem it that the program runs much slower (like 50x slower) than the diskio of the system would indicate it should, and becomes worse faster than linearly. as the number of files are involved are increased. Processing 100 files is more than 10x faster than 300 files despite being only 3x the data. I suspect that the memory mapped files are generating a page fault every time a 4kb page is crossed, net result disk seek time is greater than disk transfer time.
Can anyone think of a better way than using madvise with MADV_WILLNEED and MADV_DONTNEED every so often, and if this is the best way, any ideas as to how far to look ahead?
I have a IO intensive simulation program, that logs the simulation trace / data to a file at every iterations. As the simulation runs for more than millions of iterations or so and logs the data to a file in the disk (overwrite the file each time), I am curious to know if that would spoil the harddisk as most of storage disk has a upper limit to write/erase cycles ( eg. flash disk allow up to 100,000 write/erase cycles). Will splitting the file in to multiple files be a better option?
You need to recognize that a million write calls to a single file may only write to each block of the disk once, which doesn't cause any harm to magnetic disks or SSD devices. If you overwrite the first block of the file one million times, you run a greater risk of wearing things out, but there are lots of mitigating factors. First, if it is a single run of a program, the o/s is likely to keep the disk image in memory without writing to disk at all in the interim — unless, perhaps, you're using a journalled file system. If it is a journalled file system, then the actual writing will be spread over lots of different blocks.
If you manage to write to the same block on a magnetic spinning hard disk a million times, you are still not at serious risk of wearing the disk out.
A Google search on 'hard disk write cycles' shows a lot of informative articles (more particularly, perhaps, about SSD), and the related searches may also help you out.
On an SSD, there is a limited amount of writes (or erase cycles to be more accurate) to any particular block. It's probably more than 100K to 1 million to any given block, and SSD's use "wear loading" to avoid unnecessary "writes" to the same block every time. SSD's can only write zeros, so when you "reset" a bit to one, you have to erase the whole block. [One could put an inverter on the cell to make it the other way around, but you get one or t'other, so it doesn't help much].
Real hard disks are more of a mechanical device, so there isn't so much of a with how many times you write to the same place, it's more the head movements.
I wouldn't worry too much about it. Writing one file should be fine, it has little consequence whether you have one file or many.
I have a program that reads from a file and performs operations on it (count frequencies of words)....I have 4 different file sizes, i get cache speed on all but the largest. Why does the largest file only run at disk speed no matter how many times i run it? Does too much ram usage restrict the cache from running? The large file is 27 gb. Running on windows. This is file caching, not CPU caching
Cache == memory. Run out of memory, you run out of cache. If you have a file that is greater than the size of the cache, and you're streaming through it, it's as if you had pretty much no cache at all. Cache only helps when you read the data again, it has no effect on the first time.
When the file is greater than the memory, then there is never any of the original file left in memory when you try to re-use it, thus the cache has pretty much no value in that case. The other dark side is that when you do that, you may well lose the cache on all of the other small files that the system accesses often and are no longer cached. So it may take a bit longer for things to reload and get back up to speed.