RAMDISK data transfers in C (fread) - c

Right, so I'm trying to optimize a software that needs to read a huge image file (1.3 GB) in C/OpenCL in order to transfer it to the device by 40 MB blocks.
I created a RAMDISK with tmpfs to store the file but when I analyze bitrates I find that using a RAMDISK is actually a bit slower than using my SSD to read the image file.
So I'm wondering, does the open operation (using fopen) do a RAM-to-RAM transfer to store data in the buffer ? Or is it the filesystem's overhead that causes this performance issue ?

Related

C Disk I/O - write after read at the same offset of a file will make read throughput very low

Background:
I'm developing a database related program, and I need to flush dirty metadata from memory to disk sequentially.
/dev/sda1 is volumn format, so data on /dev/sda1 will be accessed block by block and the blocks are adjacent physically if accessed sequentially.
And I use direct I/O, so the I/O will bypass the caching mechanism of the file system and access directly the blocks on the disk.
Problems:
After opening /dev/sda1, I'll read one block, update the block and write the block back to the same offset from the beginning of /dev/sda1, iteratively.
The code are like below -
//block_size = 256KB
int file = open("/dev/sda1", O_RDWR|O_LARGEFILE|O_DIRECT);
for(int i=0; i<N; i++) {
pread(file, buffer, block_size, i*block_size);
// Update the buffer
pwrite(file, buffer, block_size, i*block_size);
}
I found that if I don't do pwrite, read throughput is 125 MB/s.
If I do pwrite, read throughput will be 21 MB/s, and write throughput is 169 MB/s.
If I do pread after pwrite, write throughput is 115 MB/s, and read throughput is 208 MB/s.
I also tried read()/write() and aio_read()/aio_write(), but the problem remains. I don't know why write after read at the same position of a file will make the read throughput so low.
If accessing more blocks at a time, like this
pread(file, buffer, num_blocks * block_size, i*block_size);
The problem will mitigate, please see the chart.
And I use direct I/O, so the I/O will bypass the caching mechanism of the file system and access directly the blocks on the disk.
If you don't have file system on the device and directly using the device to read/write, then there is no file system cache comes into the picture.
The behavior you observed is typical of disk access and IO behavior.
I found that if I don't do pwrite, read throughput is 125 MB/s
Reason: The disk just reads data, it doesn't have to go back to the offset and write data, 1 less operation.
If I do pwrite, read throughput will be 21 MB/s, and write throughput is 169 MB/s.
Reason: Your disk might have better write speed, probably disk buffer is caching write rather than directly hitting the media.
If I do pread after pwrite, write throughput is 115 MB/s, and read throughput is 208 MB/s.
Reason: Most likely data written is being cached at disk level and so read gets data from cache instead of media.
To get optimal performance, you should use asynchronous IOs and number of blocks at a time. However, you have to use reasonable number of blocks and can't use very large number. Should find out what is optimal by trial and error.

How to read/write at Maximum Speed from Hard Disk.Multi Threaded program I coded cannot go above 15 mb/ sec

I have a 5 gb 256 Files in csv which I need to read at optimum speed and then write back
data in Binary form .
I made following arrangments to achieve it :-
For each file, there is one corresponding thread.
Am using C function fscanf,fwrite.
But in Resource Monitor,it shows not more then 12 MB/ Sec of Hard Disk and 100 % Acitve Highest Time.
Google says HardDisk can read/write till 100 MB/Sec.
Machine Configuration is :-
Intel i7 Core 3.4. Has 8 Cores.
Please give me your prespective.
My aim to complete this process within 1 Min .
** Using One Thread it took me 12 Mins**
If all the files reside on the same disk, using multiple threads is likely to be counter-productive. If you read from many files in parallel, the HDD heads will keep moving back and forth between different areas of the disk, drastically reducing throughput.
I would measure how long it takes a built-in OS utility to read the files (on Unix, something like dd or cat into /dev/null) and then use that as a baseline, bearing in mind that you also need to write stuff back. Writing can be costly both in terms of throughput and seek times.
I would then come up with a single-threaded implementation that reads and writes data in large chunks, and see whether I can get it to perform similarly the OS tools.
P.S. If you have 5GB of data and your HDD's top raw throughput is 100MB, and you also need to write the converted data back onto the same disk, you goal of 1 minute is not realistic.

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.
What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?
If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?
Can we do the same by using the disk's original 4KB block size?
What does 64MB block size mean?
The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.
If yes, what is the advantage of doing that?
HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.
HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).
In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.
So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.
In normal OS block size is 4K and in hadoop it is 64 Mb.
Because for easy maintaining of the metadata in Namenode.
Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.
If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.
Conclusion:
To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.
It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.
Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.
If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.
The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
Having a much smaller block size would cause seek overhead to increase.
Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.
See Google Research Publication: MapReduce
http://research.google.com/archive/mapreduce.html
Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus the time to transfer
a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10 ms and
the transfer rate is 100 MB/s, to make the seek time 1% of the
transfer time, we need to make the block size around 100 MB. The
default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in
MapReduce normally operate on one block at a time, so if you have too
few tasks (fewer than nodes in the cluster), your jobs will run slower
than they could otherwise.

NAND jffs2 files system - binary & text files can exceeds the size of NAND

I am writing an embedded application based on the ARM 9 v5 processor, and am using 64MB NAND. My problem is that when I copy the text or binary files of size 3-4 MB, the free physical memory gets reduced by only few KB, whereas ls -l shows the file size in MB.
By repeating the same process I reached one point where df command shows me 10MB size is free and du shows the total size as 239MB.
I have only 64MB of NAND, how am I able to add files up to 239MB of size?
JFFS2 is a compressed filesystem, so it is keeping the files compressed in the disk, which leads to this conflict. du lists the disk usage and df is the available capacity as seen by the filesystem.

Sectors written when over-writing a file?

Imagine there is a file of size 5 MB. I open it in write mode in C and then fill it up with junk data of exactly 5 MB. Will the same disk sectors previously used be overwritten or will the OS select new disk sectors for the new data?
It depends on the file system.
Classically, the answer would be 'yes, the same sectors would be overwritten with the new data'.
With journalled file systems, the answer might be different. With flash drive systems, the answer would almost certainly be 'no; new sectors will be written to avoid wearing out the currently written sectors'.
The filesystem could do anything it wishes. But any real file system will write the data back to the same sectors.
Image if it didn't. Every time your wrote to a file the file system would have to find a new free sector, write to that sector, then update the file system meta data for the file to point to the new sector. This would also cause horrible file fragmentation, because writing a single sector in the middle of your contiguous 5MB file would cause it to fragment. So it's much easier to just write back to the same sector.
The only exception I can think of is JFFS2 because it was designed to support wear leveling on flash.
Now the file system will write to the same sector, but the disk hardware could write anywhere it wants. In fact on SSD/flash drives the hardware, to handle wear leveling, is almost guaranteed to write the data to a different sector. But that is transparent to the OS/file system. (It's possible on hard drives as well due to sector sparing)

Resources