When you give read a start position - does it slow down read()? Does it have to read everything before the position to find the text it's looking for?
In other words, we have two different read commands,
read(fd,1000,2000)
read(fd,50000,51000)
where we give it two arguments:
read(file descriptor, start, end)
is there a way to implement read so that the two commands take the same amount of computing time?
You don't name a specific file system implementation or one specific language library so I will comment in general.
In general, a file interface will be built directly on top of the OS level file interface. In the OS level interface for most types of drives, data can be read in sectors with random access. The drive can seek to the start of a particular sector (without reading data) and can then read that sector without reading any of the data before it in the file. Because data is typically read in chunks by sector, if the data you request doesn't perfectly align on a sector boundary, it's possible the OS will read the entire sector containing the first byte you requested, but it won't be a lot and won't make a meaningful difference in performance as once the read/write head is positioned correctly, a sector is typically read in one DMA transfer.
Disk access times to read a given set of bytes for a spinning hard drive are not entirely predictable so it's not possible to design a function that will take exactly the same time no matter which bytes you're reading. This is because there's OS level caching, disk controller level caching and a difference in seek time for the read/write head depending upon what the read/write head was doing beforehand. If there are any other processes or services running on your system (which there always are) some of them may also be using the disk and contending for disk access too. In addition, depending upon how your files were written and how many bytes you're reading and how well your files are optimized, all the bytes you read may or may not be in one long readable sequence. It's possible the drive head may have to read some bytes, then seek to a new position on the disk and then read some more. All of that is not entirely predictable.
Oh, and some of this is different if it's a different type of drive (like an SSD) since there's no drive head to seek.
When you give read a start position - does it slow down read()?
No. The OS reads the directory entry to find out where the file is located on the disk, then calculates where on the disk your desired read should be, seeks to that position on the disk and starts reading.
Does it have to read everything before the position to find the text it's looking for?
No. Since it reads sectors at a time, it may read a few bytes before what you requested (whatever is before it in the sector), but sectors are not huge (often 8K) and are typically read in one fell swoop using DMA so that extra part of the sector before your desired data is not likely noticeable.
Is there a way to implement read so that the two commands take the same amount of computing time?
So no, not really. Disk reads, even of identical number of bytes vary a bit depending upon the situation and what else might be happening on the computer and what else might be cached already by the OS or the drive itself.
If you share what problem you're really trying to solve, we could probably suggest alternate approaches rather than relying on a given disk read taking an exact amount of time.
Well, filesystems usually split the data in a file in even-sized blocks. In most file systems the allocated blocks are organized in trees with high branching factor so it is effectively the same time to find the the nth data block than the first data block of the file, computing-wise.
The only general exception to this rule is the brain-damaged floppy disk file system FAT from Microsoft that should have become extinct in 1980s, because in it the blocks of the file are organized in a singly-linked list so to find the nth block you need to scan through n items in the list. Of course decent operating systems then have all sorts of tricks to address the shortcomings here.
Then the next thing is that your reads should touch the same number of blocks or operating system memory pages. Usually operating system pages are 4K nowadays and disk blocks something like 4k too so having every count being a multiple of 4096, 8192 or 16384 is better design than to have decimal even numbers.
i.e.
read(fd, 4096, 8192)
read(fd, 50 * 4096, 51 * 4096)
While it does not affect the computing time in a multiprocessing system, the type of media affects a lot: in magnetic disks the heads need to move around to find the new read position, and the disk must have spun to be in the reading position whereas SSDs have identical random access timings regardless of where on disk the data is positioned. And additionally the operating system might cache frequently accessed locations or expect that the block that is read after N would be N + 1 and hence such order be faster. But most of the time you wouldn't care.
Finally: perhaps instead of read you should consider using memory mapped I/O for random accesses!
Read typically reads data from the given file descriptor into a buffer. The amount of data it reads is from start (arg2) - end (arg3). More generically put the amount of data read can be found with (end-start). So if you have the following reads
read(fd1, 0xffff, 0xffffffff)
and
read(fd2, 0xf, 0xff)
the second read will be quicker because the end (0xff) - the start (0xf) is less than the first reads end (0xffffffff) - start (0xffff). AKA less bytes are being read.
Related
I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.
I've run into a situation where lseek'ing forward repeatedly through a 500MB file and reading a small chunk (300-500 bytes) between each seek appears to be slower than read'ing through the whole file from the beginning and ignoring the bytes I don't want. This appears to be true even when I only do 5-10 seeks (so when I only end up reading ~1% of the file). I'm a bit surprised by this -- why would seeking forward repeatedly, which should involve less work, be slower than reading which actually has to copy the data from kernel space to userspace?
Presumably on local disk when seeking the OS could even send a message to the drive to seek without sending any data back across the bus for even more savings. But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
Regardless of whether reading from local disk or a network filesystem, how could this happen? My only guess is the OS is prefetching a ton of data after each location I seek to. Is this something that can normally occur or does it likely indicate a bug in my code?
The magnitude of the difference will be a factor of the ratio of the seek count/data being read to the size of the entire file.
But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
If there's rotational magnetic drives at the other end of the network, the effect will still be present and likely significantly compounded by the round trip time. The network protocol may play a role too. Even solid state drives may take some penalty.
I/O schedulers may reorder requests in order to minimize head movements (perhaps naively even for storage devices without a head). A single bulk request might give you some greater efficiency across many layers. The filesystems have an opportunity to interfere here somewhat.
Regardless of whether reading from local disk or a network filesystem, how could this happen?
I wouldn't be quick to dismiss the effect of those layers -- do you have measurements which show the same behavior from a local disk? It's much easier to draw conclusions without quite so much between you and the hardware. Start with a raw device and bisect from there.
Have you considered using a memory map instead? It's perfect for this use case.
Depending on the filesystem, the specific lseek implementation make create some overhead.
For example, I believe when using NFS, lseek locks the kernel by calling remote_llseek().
I've been experimenting with the performance of reading and writing files on Linux, specifically O_DIRECT, and I'm wondering, both at a hard drive level and the posix/Linux API level, is it possible to write only a few bytes to a sector, without destroying the rest of the sector, and without reading it first?
My experience with disk drives is that they expect data to be sent to them in entire sectors. So, basically, there's no way of writing less than an entire sector and if you wish to change the start of a sector without changing the end, you must read the whole sector, modify and write back. That is partly to do with how the disk head interacts with the platter (for physical disks anyway. In the case of flash drives, it's more likely to be with how small a chunk of the flash can be erased in one go).
In a portable way? Probably not.
In Linux and a few other Unix-like systems, you can open the block device for the drive, seek to a position (probably aligned to the sector size) and write some data to it, but I don't know what effect it would have on the remaining portion of that block.
Your best bet is to try it out on a virtual machine and see what happens. (Obviously, you'll have to have permission to write to the block device.)
Overview
I have a program bounded significantly by IO and am trying to speed it up.
Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.
Some demo code
I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:
With fgets:
char buf[4096];
FILE * fp = fopen(argv[1], "r");
while(fgets(buf, 4096, fp) != 0) {
// do stuff
}
fclose(fp);
return 0;
Runtime for 800mb file:
[juhani#xtest tests]$ time ./readfile /r/40/13479/14960
real 0m25.614s
user 0m0.192s
sys 0m0.124s
The mmap version:
struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;
mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
// do stuff
row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);
Runtime varies quite a bit, but never faster than fgets:
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m28.891s
user 0m0.252s
sys 0m0.732s
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m42.605s
user 0m0.144s
sys 0m0.472s
Other notes
Watching the process run in top, the memmapped version generated a few thousand page faults along the way.
CPU and memory usage are both very low for the fgets version.
Questions
Why is this the case? Is it just because the buffered file access implemented by fopen/fgets is better than the aggressive prefetching that mmap with madvise POSIX_MADV_SEQUENTIAL?
Is there an alternative method of possibly making this faster(Other than on-the-fly compression/decompression to shift IO load to the processor)? Looking at the runtime of 'wc -l' on the same file, I'm guessing this might not be the case.
POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.
The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.
This also has more potential for overlap, since the IO is done by some kernel thread.
You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.
Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:
MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.
Edit:
First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.
What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). So this approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.
In contrast, when you do a read of a file into a buffer the OS pulls the entire structure into RAM (into your buffer). This can apply alot of memory pressure, crowding out other pages, forcing them to be written back to disk. It can lead to thrashing if you're low on memory.
A common optimization technique when using mmap is to page-walk the data into memory: loop through the mmap space, incrementing your pointer by the page size, accessing a single byte per page and triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. With this approach, the OS won't need to do as much lazy loading. You can do this on a separate thread to lead the pages in prior to your main threads access - just make sure you don't run out of RAM or get too far ahead of the main thread, you'll actually begin to degrade performance.
What is the difference between page walking w/ mmap and read() into a large buffer? That's kind of complicated.
Older versions of UNIX, and some current versions, don't always use demand-paging (where the memory is divided up into chunks and swapped in / out as needed). Instead, in some cases, the OS uses traditional swapping - it treats data structures in memory as monolithic, and the entire structure is swapped in / out as needed. This may be more efficient when dealing with large files, where demand-paging requires copying pages into the universal buffer cache, and may lead to frequent swapping or even thrashing. Swapping may avoid use of the universal buffer cache - reducing memory consumption, avoiding an extra copy operation and avoiding frequent writes. Downside is you can't benefit from demand-paging.
With mmap, you're guaranteed demand-paging; with read() you are not.
Also bear in mind that page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use MADV_SEQUENTIAL or other optimizations).
One note when using mmap w/ MADV_SEQUENTIAL - when you use this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x. Usually your data is not mapped to a continuous section of the disk, it's written to blocks that are spread around the disk. So I suggest you be careful and look closely into this.
Remember, too much data in RAM will pollute the RAM, making page faults alot more common elsewhere. One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important.
A nice thing about read() is the data can be stored on the stack (assuming the stack is large enough), which will further speed up processing.
Using read() with a streaming approach is a good alternative to mmap, if it fits your use case. This is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read). Here what you do is, in a loop, read into a buffer, process the data, & then read in the next section / overwrite the old data. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside is that you never have the entire file in memory at once, and it doesn't persist in memory. So it's a one-off approach. If you can use it - great, do it. If not... use mmap.
So whether read or mmap is faster... it depends on many factors. Testing is probably what you need to do. Generally speaking, mmap is nice if you plan on using the data for an extended period, where you will benefit from demand-paging; or if you just can't handle that amount of data in memory at once. Read() is better if you are using a streaming approach - the data doesn't have to persist, or the data can fit in memory so memory pressure isn't a concern. Also if the data won't be in memory for very long, read() may be preferable.
Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. Large, bulk reads are more efficient than calling read() repeatedly a million times (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.
You could multithread the parsing of that buffer. Just make sure the threads access data in different cache blocks - so find the size of the cache block, get your threads working on different portions of the buffer distanced by at least the cache block size.
Some more specific suggestions for your particular problem:
You might try reformatting the data into some binary format. For example, try changing the file encoding to a custom format instead of UTF-8 or whatever it is. That could reduce its size. 3.5 million lines is quite alot of characters to loop through... it's probably ~150 million character comparisons that you are doing.
If you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines - just increment a pointer and test the character you arrive at, making sure it's '\n'. Then do whatever processing you need to do.
You'll need to find a way to maintain the sorted file by inserting new data into appropriate places with this approach.
You can take this a step further - after sorting your file, maintain a list of how many lines of a given length are in the file. Use that to guide your parsing of lines - jump right to the end of each line w/out having to do character comparisons.
If you can't sort the file, just create a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets.
Write algorithms to update that list on insertion/deletion of lines from the file
When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. An alternative might just be to insert all this data into a noSQL database. Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance described above is faster than any database implementation, even noSQL databases.
A few more things:
When you use this streaming approach with read() you must take care to handle the edge cases - where you reach the end of one buffer, and start a new buffer - appropriately. That's called buffer-stitching.
Lastly, on most modern systems when you use read() the data still gets stored in the universal buffer cache and then copied into your process. That's an extra copy operation. You can disable the buffer cache to speed up the IO in certain cases where you're handling big files. Beware, this will disable paging. But if the data is only in memory for a brief time, this doesn't matter.
The buffer cache is important - find a way to reenable it after the IO was finished. Maybe disable it just for the particular process, do your IO in a separate process, or something... I'm not sure about the details, but this is something that can be done.
I don't think that's actually your problem, though, tbh I think the character comparisons - once you fix that it should just be fine.
That's the best I've got, maybe the experts will have other ideas.
Carry onward!
I am considering using a FAT file system for an embedded data logging application. The logger will only create one file to which it continually appends 40 bytes of data every minute. After a couple years of use this would be over one million write cycles. MY QUESTION IS: Does a FAT system change the File Allocation Table every time a file is appended? How does it keep track where the end of the file is? Does it just put an EndOfFile marker at the end or does it store the length in the FAT table? If it does change the FAT table every time I do a write, I would ware out the FLASH memory in just a couple of years. Is a FAT system the right thing to use for this application?
My other thought is that I could just store the raw data bytes in the memory card and put an EndOfFile marker at the end of my data every time I do a write. This is less desirable though because it means the only way of getting data out of the logger is through serial transfers and not via a PC and a card reader.
FAT updates the directory table when you modify the file (at least, it will if you close the file, I'm not sure what happens if you don't). It's not just the file size, it's also the last-modified date:
http://en.wikipedia.org/wiki/File_Allocation_Table#Directory_table
If your flash controller doesn't do transparent wear levelling, and your flash driver doesn't relocate things in an effort to level wear, then I guess you could cause wear. Consult your manual, but if you're using consumer hardware I would have thought that everything has wear-levelling somewhere.
On the plus side, if the event you're worried about only occurs every minute, then you should be able to speed that up considerably in a test to see whether 2 years worth of log entries really does trash your actual hardware. Might even be faster than trying to find the relevant manufacturer docs...
No, a flash file system driver is explicitly designed to minimize the wear and spread it across the memory cells. Taking advantage of the near-zero seek time. Your data rates are low, it's going to last a long time. Specifying a yearly replacement of the media is a simple way to minimize the risk.
If your only operation is appending to one file it may be simpler to forgo a filesystem and use the flash device as a data tape. You have to take into account the type of flash and its block size, though.
Large flash chips are divided into sub-pages that are a power-of-two multiple of 264 (256+8) bytes in size, pages that are a power-of-two multiple of that, and blocks which are a power-of-two multiple of that. A blank page will read as all FF's. One can write a page at a time; the smallest unit one can write is a sub-page. Once a sub-page is written, it may not be rewritten until the entire block containing it is erased. Note that on smaller flash chips, it's possible to write the bytes of a page individually, provided one only writes to blank bytes, but on many larger chips that is not possible. I think in present-generation chips, the sub-page size is 528 bytes, the page size is 2048+64 bytes, and the block size is 128K+4096 bytes.
An MMC, SD, CompactFlash, or other such card (basically anything other than SmartMedia) combines a flash chip with a processor to handle PC-style sector writes. Essentially what happens is that when a sector is written, the controller locates a blank page, writes a new version of that sector there along with up to 16 bytes of 'header' information indicating what sector it is, etc. The controller then keeps a map of where all the different pages of information are located.
A SmartMedia card exposes the flash interface directly, and relies upon the camera, card reader, or other device using it to perform such data management according to standard methods.
Note that keeping track of the whereabouts of all 4,000,000 pages on a 2 gig card would require either having 12-16 megs of RAM, or else using 12-16 meg of flash as a secondary lookup table. Using the latter approach would mean that every write to a flash page would also require a write to the lookup table. I wouldn't be at all surprised if slower flash devices use such an approach (so as to only have to track the whereabouts of about 16,000 'indirect' pages).
In any case, the most important observation is that flash write times are not predictable, but you shouldn't normally have to worry about flash wear.
Did you check what happens to the FAT file system consistency in case of a power failure or reset of your device?
When your device experience such a failure you must not lose only that log entry, that you are just writing. Older entries must stay valid.
No, FAT is not the right thing if you need to read back the data.
You should further consider what happens, if the flash memory is filled with data. How do you get space for new data? You need to define the requirements for this point.