While reading this, I found a reasonable answer, which says:
Case 1: Directly Writing to File On Disk
100 times x 1 ms = 100 ms
I understood that. Next,
Case 3: Buffering in Memory before Writing to File on Disk
(100 times x 0.5 ms) + 1 ms = 51 ms
I didn't understand the 1 ms. What is the difference in between writing 100 data to disk and writing 1 data to disk? Why do both of them cost 1 ms?
The disc access (transferring data to disk) does not happen byte-by-byte, it happens in blocks. So, we cannot conclude if that the time taken for writing 1 byte of data is 1 ms, then x bytes of data will take x ms. It is not a linear relation.
The amount of data written to the disk at a time depends on block size. For example, if a disc access cost you 1ms, and the block size is 512 bytes, then a write of size between 1 to 512 bytes will cost you same, 1 ms only.
So, coming back to the eqation, if you have , say 16 bytes of data to be written in each opeartion for 20 iteration, then,
for direct write case
time = (20 iteration * 1 ms) == 20 ms.
for buffered access
time = (20 iteration * 0.5 ms (bufferring time)) + 1 ms (to write all at once) = 10 + 1 == 11 ms.
It is because of how the disc physical works.
They can take larger buffers (called pages) and save them in one go.
If you want to save the data all the time you need multiple alteration of one page, if you do it using buffer, you edit quickly accessible memory and then save everything in one go.
His example is explaining the costs of operation.
For loading memory to data you have 100 operation of 0.5 s cost and then you have one of altering the disc (IO operation) what is not described in the answer and is probably not obvious, nearly all disc provide the bulk transfer alteration operation. So 1 IO operation means 1 save to a disc, not necessarily 1 bit save (it can be much more data).
When writing 1 byte at a time, each write requires:
disk seek time (which can vary) to place the 'head' over the
correct track on the disk,
disk rotational latency time while waiting for the correct sector of the disk to be under the 'head'.
disk read time while the sector is read, (the rotational latency
and sector read time may have to be performed more than once if the
CRC does not match that saved on the disk
insert the new byte into the correct location in the sector
rotational latency waiting for the proper sector to again be under the 'head'
sector write time (including the new CRC).
Repeating all the above for each byte (esp. since a disk is orders of magnitude slower than memory) takes a LOT of time.
It takes no longer to write a whole sector of data than to update a single byte.
That is why writing a buffer full of data is so very much faster than writing a series of individual bytes.
There are also other overheads like updating the inodes that:
track the directories
track the individual file
Each of those directory and file inodes are updated each time the file is updated.
Those inodes are (simply) other sectors on the disk. Overall, lots of disk activity occurs each time a file is modified.
So modifying the file only once rather than numerous times is a major time saving. Buffering is the technique used to minimize the number of disk activities.
Among other things, data is written to disk in whole "blocks" only. A block is usually 512 bytes. Even if you only change a single byte inside the block, the OS and the disk will have to write all 512 bytes. If you change all 512 bytes in the block before writing, the actual write will be no slower than when changing only one byte.
The automatic caching inside the OS and/or the disk does in fact avoid this issue to a great extent. However, every "real" write operation requires a call from your program to the OS and probably all the way through to the disk driver. This takes some time. In comparison, writing into a char/byte/... array in your own process' memory in RAM does virtually cost nothing.
Related
When you give read a start position - does it slow down read()? Does it have to read everything before the position to find the text it's looking for?
In other words, we have two different read commands,
read(fd,1000,2000)
read(fd,50000,51000)
where we give it two arguments:
read(file descriptor, start, end)
is there a way to implement read so that the two commands take the same amount of computing time?
You don't name a specific file system implementation or one specific language library so I will comment in general.
In general, a file interface will be built directly on top of the OS level file interface. In the OS level interface for most types of drives, data can be read in sectors with random access. The drive can seek to the start of a particular sector (without reading data) and can then read that sector without reading any of the data before it in the file. Because data is typically read in chunks by sector, if the data you request doesn't perfectly align on a sector boundary, it's possible the OS will read the entire sector containing the first byte you requested, but it won't be a lot and won't make a meaningful difference in performance as once the read/write head is positioned correctly, a sector is typically read in one DMA transfer.
Disk access times to read a given set of bytes for a spinning hard drive are not entirely predictable so it's not possible to design a function that will take exactly the same time no matter which bytes you're reading. This is because there's OS level caching, disk controller level caching and a difference in seek time for the read/write head depending upon what the read/write head was doing beforehand. If there are any other processes or services running on your system (which there always are) some of them may also be using the disk and contending for disk access too. In addition, depending upon how your files were written and how many bytes you're reading and how well your files are optimized, all the bytes you read may or may not be in one long readable sequence. It's possible the drive head may have to read some bytes, then seek to a new position on the disk and then read some more. All of that is not entirely predictable.
Oh, and some of this is different if it's a different type of drive (like an SSD) since there's no drive head to seek.
When you give read a start position - does it slow down read()?
No. The OS reads the directory entry to find out where the file is located on the disk, then calculates where on the disk your desired read should be, seeks to that position on the disk and starts reading.
Does it have to read everything before the position to find the text it's looking for?
No. Since it reads sectors at a time, it may read a few bytes before what you requested (whatever is before it in the sector), but sectors are not huge (often 8K) and are typically read in one fell swoop using DMA so that extra part of the sector before your desired data is not likely noticeable.
Is there a way to implement read so that the two commands take the same amount of computing time?
So no, not really. Disk reads, even of identical number of bytes vary a bit depending upon the situation and what else might be happening on the computer and what else might be cached already by the OS or the drive itself.
If you share what problem you're really trying to solve, we could probably suggest alternate approaches rather than relying on a given disk read taking an exact amount of time.
Well, filesystems usually split the data in a file in even-sized blocks. In most file systems the allocated blocks are organized in trees with high branching factor so it is effectively the same time to find the the nth data block than the first data block of the file, computing-wise.
The only general exception to this rule is the brain-damaged floppy disk file system FAT from Microsoft that should have become extinct in 1980s, because in it the blocks of the file are organized in a singly-linked list so to find the nth block you need to scan through n items in the list. Of course decent operating systems then have all sorts of tricks to address the shortcomings here.
Then the next thing is that your reads should touch the same number of blocks or operating system memory pages. Usually operating system pages are 4K nowadays and disk blocks something like 4k too so having every count being a multiple of 4096, 8192 or 16384 is better design than to have decimal even numbers.
i.e.
read(fd, 4096, 8192)
read(fd, 50 * 4096, 51 * 4096)
While it does not affect the computing time in a multiprocessing system, the type of media affects a lot: in magnetic disks the heads need to move around to find the new read position, and the disk must have spun to be in the reading position whereas SSDs have identical random access timings regardless of where on disk the data is positioned. And additionally the operating system might cache frequently accessed locations or expect that the block that is read after N would be N + 1 and hence such order be faster. But most of the time you wouldn't care.
Finally: perhaps instead of read you should consider using memory mapped I/O for random accesses!
Read typically reads data from the given file descriptor into a buffer. The amount of data it reads is from start (arg2) - end (arg3). More generically put the amount of data read can be found with (end-start). So if you have the following reads
read(fd1, 0xffff, 0xffffffff)
and
read(fd2, 0xf, 0xff)
the second read will be quicker because the end (0xff) - the start (0xf) is less than the first reads end (0xffffffff) - start (0xffff). AKA less bytes are being read.
I have to implement an optimal solution to store values of sensor into NOR Flash with time stamps on loosing connection and send to the central server when connection comes back. A queue like implementation is needed. Will anyone please suggest an implementation Open source or proprietary, or any algorithms for same. It should have properties like wear leveling, write fail safe and erase fail safe.
It is a 256Mb Spansion NOR flash(S25FL256S). I need to store only less than 64bytes (including time stamp) every 60 seconds, if there is no connection. Page size of flash is 256 bytes and sector size is 256KB. Erase cycle endurance of flash is 100,000
A simple solution is to use N flash sectors where N > 1. Ideally the sectors will be of equal size, and smaller sectors are more efficient if only a small amount of data storage is needed. The sectors start blank (all 0xFF). Each record is ideally a factor of the total sector size (so 64 or 128 bytes might be good choices. Each record comprises:
<timestamp><data><integrity check>
Each record is written sequentially and contiguously to the flash memory, but critically the <integrity check> is written last. The <integrity check> may be a CRC or simply the one's complement of the <timestamp> for example. I suggest a Unix Epoch timestamp of type time_t as it is simple and universal. Writing the <integrity check> last allows the detection of incomplete records (due to power fail during write for example).
As each sector boundary is crossed the next sector is erased, wrapping from sector N-1 to 0, so you will always have at least N-1 sectors worth of data and up to N sectors worth. (this is why a larger number of small sectors is better then a small number of large sectors). This circular buffer of flash pages provides wear-levelling. For example given 2 * 256k sectors and 64 byte records written once per minute, each page will be erased every 2.844 days, so for an endurance of 100000 erase/write cycles, the flash will endure 779 years. This is overkill, but this uses the minimum number of sectors that happen to be rather large, you need to calculate the endurance for any specific combination of N and sector size:
(((*N* - 1) * sector_size) / record_size) * write_period * flash_endurance
For example 2 x 4Kb sectors with 64 byte records written every minute would last only 12 years.
Initialisation of start-up requires scanning the flash reading the timestamps to find the highest numbered timestamp that is not 0xffffffff (or 0xffffffffffffffff for 64 bit time_t), and which passes the integrity check, so that to know where to write the next record (immediately after this "most recent" record. Note if you really want to run for 779 years - or even past Jan 19 2038, then you will need a 64 bit timestamp, but for most purposes you'll get away with 32 - at least for the product warranty period!).
A 2 * 256kb sector implementation guarantees at least 256kb of history data will be available, for 64 byte records, that a little more than 68 hours. If only 2 4kb sectors were used, the minimum history reduced to 64 minutes. You could log only when the connection is lost, but it may be better to log continuously and then have the remote client request missing data by start/end timestamp whenever it needs it. This approach will allow data to be retrieved even when it is "lost" for reasons other then connection loss detectable by your system, and it could allow multiple clients to be connected that may independently loose connection and need to recover data independently.
I am trying to benchmark file read times (sequential access) for NTFS. My code gets a start time, performs a read of size equal to 4096 bytes (cluster size of NTFS on the system) and records the end time. The difference between the two times is then stored and the process is repeated until end of file is reached. The file size I currently use is 40K, so I get 10 time difference values.
When accessing the file opened (using CreateFile) without FILE_FLAG_NO_BUFFERING, access time for the first block is close to 30 micro-seconds and drops to about 7 micro-seconds for subsequent accesses (due to caching).
When using FILE_FLAG_NO_BUFFERING, the first block's access time is close to 21 milli-seconds and drops to about 175 micro-seconds for subsequent accesses.
Shouldn't the first block access time be same with or without the flag, since it's not buffered? Also, why do access times drop after the first when the flag is used? I was expecting them to remain constant since we've specified we don't want buffering.
amongst other things, the access time include several other (longish) factors beside the actual data transfer time.
Such times include searching the directory structure (the first time only) to find the actual file (this include a 'head seek' time (which is very long as it requires the physical movement of the heads),
then the rotation time to get over the correct sector on the disk,
then the actual data transfer time.
This is followed by a 'head seek' time to the actual beginning of the file cylinder,
followed by a 'sector seek' time to get over the correct sector,
followed by the actual data transfer time.
subsequent reads will not include the accessing of the directory info.
Any access can (but not always) include some 'head seek' time (which varies in length and depends on where the heads are currently located and where the desire data is currently located.
With buffering, the subsequent access times are greatly reduced (on most reads) because the actual transfer will include multiple sectors so only occasionally needs to actually access the disk.
When not buffered, a lot depends on if the disk, itself, performs any buffering (these days, most do perform local buffering) Accessing data that is already in the disk buffer eliminates all seek times (head and sector) making the transfer much faster.
I have a 5 gb 256 Files in csv which I need to read at optimum speed and then write back
data in Binary form .
I made following arrangments to achieve it :-
For each file, there is one corresponding thread.
Am using C function fscanf,fwrite.
But in Resource Monitor,it shows not more then 12 MB/ Sec of Hard Disk and 100 % Acitve Highest Time.
Google says HardDisk can read/write till 100 MB/Sec.
Machine Configuration is :-
Intel i7 Core 3.4. Has 8 Cores.
Please give me your prespective.
My aim to complete this process within 1 Min .
** Using One Thread it took me 12 Mins**
If all the files reside on the same disk, using multiple threads is likely to be counter-productive. If you read from many files in parallel, the HDD heads will keep moving back and forth between different areas of the disk, drastically reducing throughput.
I would measure how long it takes a built-in OS utility to read the files (on Unix, something like dd or cat into /dev/null) and then use that as a baseline, bearing in mind that you also need to write stuff back. Writing can be costly both in terms of throughput and seek times.
I would then come up with a single-threaded implementation that reads and writes data in large chunks, and see whether I can get it to perform similarly the OS tools.
P.S. If you have 5GB of data and your HDD's top raw throughput is 100MB, and you also need to write the converted data back onto the same disk, you goal of 1 minute is not realistic.
The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.
What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?
If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?
Can we do the same by using the disk's original 4KB block size?
What does 64MB block size mean?
The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.
If yes, what is the advantage of doing that?
HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.
HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).
In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.
So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.
In normal OS block size is 4K and in hadoop it is 64 Mb.
Because for easy maintaining of the metadata in Namenode.
Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.
If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.
Conclusion:
To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.
It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.
Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.
If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.
The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.
Having a much smaller block size would cause seek overhead to increase.
Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.
See Google Research Publication: MapReduce
http://research.google.com/archive/mapreduce.html
Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be significantly longer than
the time to seek to the start of the block. Thus the time to transfer
a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10 ms and
the transfer rate is 100 MB/s, to make the seek time 1% of the
transfer time, we need to make the block size around 100 MB. The
default is actually 64 MB, although many HDFS installations use 128 MB
blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in
MapReduce normally operate on one block at a time, so if you have too
few tasks (fewer than nodes in the cluster), your jobs will run slower
than they could otherwise.