NTFS file access time with and without FILE_FLAG_NO_BUFFERING - c

I am trying to benchmark file read times (sequential access) for NTFS. My code gets a start time, performs a read of size equal to 4096 bytes (cluster size of NTFS on the system) and records the end time. The difference between the two times is then stored and the process is repeated until end of file is reached. The file size I currently use is 40K, so I get 10 time difference values.
When accessing the file opened (using CreateFile) without FILE_FLAG_NO_BUFFERING, access time for the first block is close to 30 micro-seconds and drops to about 7 micro-seconds for subsequent accesses (due to caching).
When using FILE_FLAG_NO_BUFFERING, the first block's access time is close to 21 milli-seconds and drops to about 175 micro-seconds for subsequent accesses.
Shouldn't the first block access time be same with or without the flag, since it's not buffered? Also, why do access times drop after the first when the flag is used? I was expecting them to remain constant since we've specified we don't want buffering.

amongst other things, the access time include several other (longish) factors beside the actual data transfer time.
Such times include searching the directory structure (the first time only) to find the actual file (this include a 'head seek' time (which is very long as it requires the physical movement of the heads),
then the rotation time to get over the correct sector on the disk,
then the actual data transfer time.
This is followed by a 'head seek' time to the actual beginning of the file cylinder,
followed by a 'sector seek' time to get over the correct sector,
followed by the actual data transfer time.
subsequent reads will not include the accessing of the directory info.
Any access can (but not always) include some 'head seek' time (which varies in length and depends on where the heads are currently located and where the desire data is currently located.
With buffering, the subsequent access times are greatly reduced (on most reads) because the actual transfer will include multiple sectors so only occasionally needs to actually access the disk.
When not buffered, a lot depends on if the disk, itself, performs any buffering (these days, most do perform local buffering) Accessing data that is already in the disk buffer eliminates all seek times (head and sector) making the transfer much faster.

Related

Giving read() a start position

When you give read a start position - does it slow down read()? Does it have to read everything before the position to find the text it's looking for?
In other words, we have two different read commands,
read(fd,1000,2000)
read(fd,50000,51000)
where we give it two arguments:
read(file descriptor, start, end)
is there a way to implement read so that the two commands take the same amount of computing time?
You don't name a specific file system implementation or one specific language library so I will comment in general.
In general, a file interface will be built directly on top of the OS level file interface. In the OS level interface for most types of drives, data can be read in sectors with random access. The drive can seek to the start of a particular sector (without reading data) and can then read that sector without reading any of the data before it in the file. Because data is typically read in chunks by sector, if the data you request doesn't perfectly align on a sector boundary, it's possible the OS will read the entire sector containing the first byte you requested, but it won't be a lot and won't make a meaningful difference in performance as once the read/write head is positioned correctly, a sector is typically read in one DMA transfer.
Disk access times to read a given set of bytes for a spinning hard drive are not entirely predictable so it's not possible to design a function that will take exactly the same time no matter which bytes you're reading. This is because there's OS level caching, disk controller level caching and a difference in seek time for the read/write head depending upon what the read/write head was doing beforehand. If there are any other processes or services running on your system (which there always are) some of them may also be using the disk and contending for disk access too. In addition, depending upon how your files were written and how many bytes you're reading and how well your files are optimized, all the bytes you read may or may not be in one long readable sequence. It's possible the drive head may have to read some bytes, then seek to a new position on the disk and then read some more. All of that is not entirely predictable.
Oh, and some of this is different if it's a different type of drive (like an SSD) since there's no drive head to seek.
When you give read a start position - does it slow down read()?
No. The OS reads the directory entry to find out where the file is located on the disk, then calculates where on the disk your desired read should be, seeks to that position on the disk and starts reading.
Does it have to read everything before the position to find the text it's looking for?
No. Since it reads sectors at a time, it may read a few bytes before what you requested (whatever is before it in the sector), but sectors are not huge (often 8K) and are typically read in one fell swoop using DMA so that extra part of the sector before your desired data is not likely noticeable.
Is there a way to implement read so that the two commands take the same amount of computing time?
So no, not really. Disk reads, even of identical number of bytes vary a bit depending upon the situation and what else might be happening on the computer and what else might be cached already by the OS or the drive itself.
If you share what problem you're really trying to solve, we could probably suggest alternate approaches rather than relying on a given disk read taking an exact amount of time.
Well, filesystems usually split the data in a file in even-sized blocks. In most file systems the allocated blocks are organized in trees with high branching factor so it is effectively the same time to find the the nth data block than the first data block of the file, computing-wise.
The only general exception to this rule is the brain-damaged floppy disk file system FAT from Microsoft that should have become extinct in 1980s, because in it the blocks of the file are organized in a singly-linked list so to find the nth block you need to scan through n items in the list. Of course decent operating systems then have all sorts of tricks to address the shortcomings here.
Then the next thing is that your reads should touch the same number of blocks or operating system memory pages. Usually operating system pages are 4K nowadays and disk blocks something like 4k too so having every count being a multiple of 4096, 8192 or 16384 is better design than to have decimal even numbers.
i.e.
read(fd, 4096, 8192)
read(fd, 50 * 4096, 51 * 4096)
While it does not affect the computing time in a multiprocessing system, the type of media affects a lot: in magnetic disks the heads need to move around to find the new read position, and the disk must have spun to be in the reading position whereas SSDs have identical random access timings regardless of where on disk the data is positioned. And additionally the operating system might cache frequently accessed locations or expect that the block that is read after N would be N + 1 and hence such order be faster. But most of the time you wouldn't care.
Finally: perhaps instead of read you should consider using memory mapped I/O for random accesses!
Read typically reads data from the given file descriptor into a buffer. The amount of data it reads is from start (arg2) - end (arg3). More generically put the amount of data read can be found with (end-start). So if you have the following reads
read(fd1, 0xffff, 0xffffffff)
and
read(fd2, 0xf, 0xff)
the second read will be quicker because the end (0xff) - the start (0xf) is less than the first reads end (0xffffffff) - start (0xffff). AKA less bytes are being read.

Read a file after write and closing it in C

My code does the following
do 100 times of
open a new file; write 10M data; close it
open the 100 files together, read and merge their data into a larger file
do steps 1 and 2 many times in a loop
I was wondering if I can keep the 100 open w/o opening and closing them too many times. What I can do is fopen them with w+. After writing I set position the beginning to read, after read I set position to the beginning to write, and so on.
The questions are:
if I read after write w/o closing, do we always read all the written data
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
Bases on the comments and discussion I will talk about why I need to do this in my work. It is also related to my other post
how to convert large row-based tables into column-based tables efficently
I have a calculation that generates a stream of results. So far the results are saved in a row-storage table. This table has 1M columns, each column could be 10M long. Actually each column is one attribute the calculation produces. At the calculation runs, I dump and append the intermediate results the table. The intermediate results could be 2 or 3 double values at each column. I wanted to dump it soon because it already consumes >16M memory. And the calculate needs more memoy. This ends up a table like the following
aabbcc...zzaabbcc..zz.........aabb...zz
A row of data are stored together. The problem happens when I want to analyze the data column by column. So I have to read 16 bytes then seek to the next row for reading 16 bytes then keep on going. There are too many seeks, it is much slower than if all columns are stored together so I can read them sequentially.
I can make the calculation dump less frequent. But to make the late read more efficent. I may want to have 4K data stored together since I assume each fread gets 4K by default even if I read only 16bytes. But this means I need to buffer 1M*4k = 4G in memory...
So I was thinking if I can merge fragment datas into larger chunks like that the post says
how to convert large row-based tables into column-based tables efficently
So I wanted to use files as offline buffers. I may need 256 files to get a 4K contiguous data after merge if each file contains 1M of 2 doubles. This work can be done as an asynchronous way in terms of the main calculation. But I wanted to ensure the merge overhead is small so when it runs in parallel it can finish before the main calculation is done. So I came up with this question.
I guess this is very related to how column based data base is constructed. When people create them, do they have the similar issues? Is there any description of how it works on creation?
You can use w+ as long as the maximum number of open files on your system allows it; this is usually 255 or 1024, and can be set (e.g. on Unix by ulimit).
But I'm not too sure this will be worth the effort.
On the other hand, 100 files of 10M each is one gigabyte; you might want to experiment with a RAM disk. Or with a large file system cache.
I suspect that huger savings might be reaped by analyzing your specific problem structure. Why is it 100 files? Why 10 M? What kind of "merge" are you doing? Are those 100 files always accessed in the same order and with the same frequency? Could some data be kept in RAM and never be written at all?
Update
So, you have several large buffers like,
ABCDEFG...
ABCDEFG...
ABCDEFG...
and you want to pivot them so they read
AAA...
BBB...
CCC...
If you already have the total size (i.e., you know that you are going to write 10 GB of data), you can do this with two files, pre-allocating the file and using fseek() to write to the output file. With memory-mapped files, this should be quite efficient. In practice, row Y, column X of 1,000,000 , has been dumped at address 16*X in file Y.dat; you need to write it to address 16*(Y*1,000,000 + X) into largeoutput.dat.
Actually, you could write the data even during the first calculation. Or you could have two processes communicating via a pipe, one calculating, one writing to both row-column and column-row files, so that you can monitor the performances of each.
Frankly, I think that adding more RAM and/or a fast I/O layer (SSD maybe?) could get you more bang for the same buck. Your time costs too, and the memory will remain available after this one work has been completed.
Yes. You can keep the 100 files open without doing the opening-closing-opening cycle. Most systems do have a limit on the number of open files though.
if I read after write w/o closing, do we always read all the written data
It depends on you. You can do an fseek goto wherever you want in the file and read data from there. It's all the way you and your logic.
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
This would definitely save some overhead, like additional unnecessary I/O operations and also in some systems, the content which you write to file is not immediately flushed to physical file, it may be buffered and flushed periodically and or done at the time of fclose.
So, such overheads are saved, but, the real question is what do you achieve by saving such overheads? How does it suit you in the overall picture of your application? This is the call which you must take before deciding on the logic.

Comparing time taken to read() from file system

I have created a program that measures that time taken for a read() to be performed on a file and I do this several times to determine the block size of my file system.
My question:
After plotting this data, everytime I try it, no matter the size I am reading in each iteration, the first read takes significantly longer time compared to any other read. I know that once a block has completed reading, the time to do the next read in the new block will take a bit more time (which I have observed in my plot) but this first read value is much higher than that too.
Does anyone have a filesystems/O.S. based answer to why this is the case?
I can think of a couple of reasons why this might be the case.
The file system might cache (pre-fetch) the data read from disk, so that even if it only returns (say) 1 block to your program, it might have actually read multiple blocks from the disk; so that the next time you do a read, you're actually just pulling more from that cached data. It's also perhaps possible that doing the first read might involve the read head having to move to the start of the file? This is probably very file-system-dependent. I think that cacheing is more likely to be the cause?

Why is buffered I/O faster than unbuffered I/O

While reading this, I found a reasonable answer, which says:
Case 1: Directly Writing to File On Disk
100 times x 1 ms = 100 ms
I understood that. Next,
Case 3: Buffering in Memory before Writing to File on Disk
(100 times x 0.5 ms) + 1 ms = 51 ms
I didn't understand the 1 ms. What is the difference in between writing 100 data to disk and writing 1 data to disk? Why do both of them cost 1 ms?
The disc access (transferring data to disk) does not happen byte-by-byte, it happens in blocks. So, we cannot conclude if that the time taken for writing 1 byte of data is 1 ms, then x bytes of data will take x ms. It is not a linear relation.
The amount of data written to the disk at a time depends on block size. For example, if a disc access cost you 1ms, and the block size is 512 bytes, then a write of size between 1 to 512 bytes will cost you same, 1 ms only.
So, coming back to the eqation, if you have , say 16 bytes of data to be written in each opeartion for 20 iteration, then,
for direct write case
time = (20 iteration * 1 ms) == 20 ms.
for buffered access
time = (20 iteration * 0.5 ms (bufferring time)) + 1 ms (to write all at once) = 10 + 1 == 11 ms.
It is because of how the disc physical works.
They can take larger buffers (called pages) and save them in one go.
If you want to save the data all the time you need multiple alteration of one page, if you do it using buffer, you edit quickly accessible memory and then save everything in one go.
His example is explaining the costs of operation.
For loading memory to data you have 100 operation of 0.5 s cost and then you have one of altering the disc (IO operation) what is not described in the answer and is probably not obvious, nearly all disc provide the bulk transfer alteration operation. So 1 IO operation means 1 save to a disc, not necessarily 1 bit save (it can be much more data).
When writing 1 byte at a time, each write requires:
disk seek time (which can vary) to place the 'head' over the
correct track on the disk,
disk rotational latency time while waiting for the correct sector of the disk to be under the 'head'.
disk read time while the sector is read, (the rotational latency
and sector read time may have to be performed more than once if the
CRC does not match that saved on the disk
insert the new byte into the correct location in the sector
rotational latency waiting for the proper sector to again be under the 'head'
sector write time (including the new CRC).
Repeating all the above for each byte (esp. since a disk is orders of magnitude slower than memory) takes a LOT of time.
It takes no longer to write a whole sector of data than to update a single byte.
That is why writing a buffer full of data is so very much faster than writing a series of individual bytes.
There are also other overheads like updating the inodes that:
track the directories
track the individual file
Each of those directory and file inodes are updated each time the file is updated.
Those inodes are (simply) other sectors on the disk. Overall, lots of disk activity occurs each time a file is modified.
So modifying the file only once rather than numerous times is a major time saving. Buffering is the technique used to minimize the number of disk activities.
Among other things, data is written to disk in whole "blocks" only. A block is usually 512 bytes. Even if you only change a single byte inside the block, the OS and the disk will have to write all 512 bytes. If you change all 512 bytes in the block before writing, the actual write will be no slower than when changing only one byte.
The automatic caching inside the OS and/or the disk does in fact avoid this issue to a great extent. However, every "real" write operation requires a call from your program to the OS and probably all the way through to the disk driver. This takes some time. In comparison, writing into a char/byte/... array in your own process' memory in RAM does virtually cost nothing.

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

Resources