I have a program that processes a large dataset consisting of a large number (300+) of sizable memory (40MB+) mapped files. All the files are needed together though they are accessed in a sequential way. at the moment I am memory mapping the files and then using madvise with MADV_SEQUENTIAL since I don't want the thing to be any more of a memory hog than it needs to be (without any madvise the consumption becomes a problem). The problem it that the program runs much slower (like 50x slower) than the diskio of the system would indicate it should, and becomes worse faster than linearly. as the number of files are involved are increased. Processing 100 files is more than 10x faster than 300 files despite being only 3x the data. I suspect that the memory mapped files are generating a page fault every time a 4kb page is crossed, net result disk seek time is greater than disk transfer time.
Can anyone think of a better way than using madvise with MADV_WILLNEED and MADV_DONTNEED every so often, and if this is the best way, any ideas as to how far to look ahead?
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.
mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.
"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.
I was under the impression that sequential scan of a file would actually be a sequential seek on disk. However, I read recently that the blocks of a file might not be written contiguously on disk by a file system. If inodes are used as a map and each block is obtained by following the block pointer, I am wondering whether the actual mechanism with which a file system retrieves the blocks of a file is actually sequential?
If the answer is file system dependant, it would be great to cite some major filesystems.
Thanks.
Filesystems try to allocate as much sequential blocks as possible during writes. But as they age (i.e lot of creates + deletes over time), fragmentation becomes inevitable. There are heuristics to reduce fragmentation like speculative preallocation, delayed preallocation etc. Applications themselves can do things like preallocation (example fallocate), enabling readahead and running de-fragmentation tools depending on the features available in the filesystem to make the blocks contiguous or at least reads faster.
I was handed a server hosting SQL Server and I was asked to fined the causes of its bad performance problems.
While monitoring PerfMon I found that:
Paging file: % Usage = 25% average for 3 days.
Memory: Pages/sec > 1 average for 3 days.
What I know that if % Usage is > 2% then there is too much paging because of memory pressure and lack in memory space. However, if when I opened Resource Monitor, Memory tab, I found:
-26 GB in use (out of 32 GB total RAM)
-2 GB standby
-and 4 GB Memory free !!!!!!
If there is 4 GB free memory why the paging?! and most importantly why it (paging %) is too high?!!
Someone please explain this situation and how paging file % usage can be lowered to normal.
Note that SQL Server Max. memory is set to 15GB
Page file usage on its own isn't a major red flag. The OS will tend to use a page file even when there's plenty of RAM available, because it allows it to dump the relevant parts of memory from RAM when needed - don't think of the page file usage as memory moved from RAM to HDD - it's just a copy. All the accesses will still use RAM, the OS is simply preparing for a contingency - if it didn't have the memory pre-written to the page file, the memory requests would have to wait for "old" memory to be dumped before freeing the RAM for other uses.
Also, it seems you're a bit confused about how paging works. All user-space memory is always paged, this has nothing to do with the page file itself - it simply means you're using virtual memory. The metric you're looking for is Hard faults per second (EDIT: uh, I misread which one you're reading - Pages/sec is how many hard faults there are; still, the rest still applies), which tells you how often the OS had to actually read data from the page file. Even then, 1 per second is extremely low. You will rarely see anything until that number goes above fifty per sec or so, and much higher for SSDs (on my particular system, I can get thousands of hard faults with no noticeable memory lag - this varies a lot based on the actual HDD and your chipset and drivers).
Finally, there's way too many ways SQL Server performance can suffer. If you don't have a real DBA (or at least someone with plenty of DB experience), you're in trouble. Most of your lines of inquiry will lead you to dead-ends - something as complex and optimized as a DB engine is always hard to diagnose properly. Identify signs - is there a high CPU usage? Is there a high RAM usage? Are there queries with high I/O usage? Are there specific queries that are giving you trouble, or does the whole DB suffer? Are your indices and tables properly maintained? Those are just the very basics. Once you have some extra information like this, try DBA.StackExchange.com - SO isn't really the right place to ask for DBA advice :)
Just some shots in the dark really, might be a little random but I could hardly spot something straight away:
might there be processes that select uselessly large data sets or run too frequent operations? (e.g. the awful app developers' practice to use SELECT * everywhere or get all data and then filter it on application level or run DB queries in loops instead of getting record sets once, etc.)
is indexing proper? (e.g. are leaf elements employed where possible to reduce the key lookup operations, are heavy queries backed up with proper indices to avoid table & index scans etc.)
how is data population managed? (e.g. is it possible that there are too many page splits due to improper clustered indices or parallel inserts, are there some index rebuilds taking place etc.)
I have a IO intensive simulation program, that logs the simulation trace / data to a file at every iterations. As the simulation runs for more than millions of iterations or so and logs the data to a file in the disk (overwrite the file each time), I am curious to know if that would spoil the harddisk as most of storage disk has a upper limit to write/erase cycles ( eg. flash disk allow up to 100,000 write/erase cycles). Will splitting the file in to multiple files be a better option?
You need to recognize that a million write calls to a single file may only write to each block of the disk once, which doesn't cause any harm to magnetic disks or SSD devices. If you overwrite the first block of the file one million times, you run a greater risk of wearing things out, but there are lots of mitigating factors. First, if it is a single run of a program, the o/s is likely to keep the disk image in memory without writing to disk at all in the interim — unless, perhaps, you're using a journalled file system. If it is a journalled file system, then the actual writing will be spread over lots of different blocks.
If you manage to write to the same block on a magnetic spinning hard disk a million times, you are still not at serious risk of wearing the disk out.
A Google search on 'hard disk write cycles' shows a lot of informative articles (more particularly, perhaps, about SSD), and the related searches may also help you out.
On an SSD, there is a limited amount of writes (or erase cycles to be more accurate) to any particular block. It's probably more than 100K to 1 million to any given block, and SSD's use "wear loading" to avoid unnecessary "writes" to the same block every time. SSD's can only write zeros, so when you "reset" a bit to one, you have to erase the whole block. [One could put an inverter on the cell to make it the other way around, but you get one or t'other, so it doesn't help much].
Real hard disks are more of a mechanical device, so there isn't so much of a with how many times you write to the same place, it's more the head movements.
I wouldn't worry too much about it. Writing one file should be fine, it has little consequence whether you have one file or many.
I have a thread that needs to write data from an in-memory buffer to a disk thousands of times. I have some requirements of how long each write takes because the buffer needs to be cleared for a separate thread to write to it again.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long. My block of code looks like (this is in C by the way):
last = get_timestamp();
write();
now = get_timestamp();
if (longest_write < now - last)
longest_write = now - last;
And at the end I print out the longest write. I found that for a 32K buffer, I am seeing a longest write speed of about 47ms. This is way too long to meet the requirements of my application. I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds? Thanks
Edit:
I am in fact using multiple buffers of the type I declare above and striping between them to multiple disks. One solution to my problem would be to just increase the number of buffers to amortize the cost of long writes. However I would like to keep the amount of memory being used for buffering as small as possible to avoid dirtying the cache of the thread that is producing the data written into the buffer. My question should be constrained to dealing with variance in the latency of writing a small block to disk and how to reduce it.
I'm assuming that you are using an ATA or SATA drive connected to the built-in disk controller in a standard computer. Is this a valid assumption, or are you using anything out of the ordinary (hardware RAID controller, SCSI drives, external drive, etc)?
As an engineer who does a lot of disk I/O performance testing at work, I would say that this sounds a lot like your writes are being cached somewhere. Your "high latency" I/O is a result of that cache finally being flushed. Even without a filesystem, I/O operations can be cached in the I/O controller or in the disk itself.
To get a better view of what is going on, record not just your max latency, but your average latency as well. Consider recording your max 10-15 latency samples so you can get a better picture of how (in-)frequent these high-latency samples are. Also, throw out the data recorded in the first two or three seconds of your test and start your data logging after that. There can be high-latency I/O operations seen at the start of a disk test that aren't indicative of the disk's true performance (can be caused by things like the disk having to rev up to full speed, the head having to do a large initial seek, disk write cache being flushed, etc).
If you are wanting to benchmark disk I/O performance, I would recommend using a tool like IOMeter instead of using dd or rolling your own. IOMeter makes it easy to see what kind of a difference it makes to change the I/O size, alignment, etc, plus it keeps track of a number of useful statistics.
Requiring an I/O operation to happen within a certain amount of time is a risky thing to do. For one, other applications on the system can compete with you for disk access or CPU time and it is nearly impossible to predict their exact effect on your I/O speeds. Your disk might encounter a bad block, in which case it has to do some extra work to remap the affected sectors before processing your I/O. This introduces an unpredictable delay. You also can't control what the OS, driver, and disk controller are doing. Your I/O request may get backed up in one of those layers for any number of unforseeable reasons.
If the only reason you have a hard limit on I/O time is because your buffer is being re-used, consider changing your algorithm instead. Try using a circular buffer so that you can flush data out of it while writing into it. If you see that you are filling it faster than flushing it, you can throttle back your buffer usage. Alternatively, you can also create multiple buffers and cycle through them. When one buffer fills up, write that buffer to disk and switch to the next one. You can be writing to the new buffer even if the first is still being written.
Response to comment:
You can't really "get the kernel out of the way", it's the lowest level in the system and you have to go through it to one degree or another. You might be able to build a custom version of the driver for your disk controller (provided it's open source) and build in a "high-priority" I/O path for your application to use. You are still at the mercy of the disk controller's firmware and the firmware/hardware of the drive itself, which you can't necessarily predict or do anything about.
Hard drives traditionally perform best when doing large, sequential I/O operations. Drivers, device firmware, and OS I/O subsystems take this into account and try to group smaller I/O requests together so that they only have to generate a single, large I/O request to the drive. If you are only flushing 32K at a time, then your writes are probably being cached at some level, coalesced, and sent to the drive all at once. By defeating this coalescing, you should reduce the number of I/O latency "spikes" and see more uniform disk access times. However, these access times will be much closer to the large times seen in your "spikes" than the moderate times that you are normally seeing. The latency spike corresponds to an I/O request that didn't get coalesced with any others and thus had to absorb the entire overhead of a disk seek. Request coalescing is done for a reason; by bundling requests you are amortizing the overhead of a drive seek operation over multiple commands. Defeating coalescing leads to doing more seek operations than you would normally, giving you overall slower I/O speeds. It's a trade-off: you reduce your average I/O latency at the expense of occasionally having an abnormal, high-latency operation. It is a beneficial trade-off, however, because the increase in average latency associated with disabling coalescing is nearly always more of a disadvantage than having a more consistent access time is an advantage.
I'm also assuming that you have already tried adjusting thread priorities, and that this isn't a case of your high-bandwidth producer thread starving out the buffer-flushing thread for CPU time. Have you confirmed this?
You say that you do not want to disturb the high-bandwidth thread that is also running on the system. Have you actually tested various output buffer sizes/quantities and measured their impact on the other thread? If so, please share some of the results you measured so that we have more information to use when brainstorming.
Given the amount of memory that most machines have, moving from a 32K buffer to a system that rotates through 4 32K buffers is a rather inconsequential jump in memory usage. On a system with 1GB of memory, the increase in buffer size represents only 0.0092% of the system's memory. Try moving to a system of alternating/rotating buffers (to keep it simple, start with 2) and measure the impact on your high-bandwidth thread. I'm betting that the extra 32K of memory isn't going to have any sort of noticeable impact on the other thread. This shouldn't be "dirtying the cache" of the producer thread. If you are constantly using these memory regions, they should always be marked as "in use" and should never get swapped out of physical memory. The buffer being flushed must stay in physical memory for DMA to work, and the second buffer will be in memory because your producer thread is currently writing to it. It is true that using an additional buffer will reduce the total amount of physical memory available to the producer thread (albeit only very slightly), but if you are running an application that requires high bandwidth and low latency then you would have designed your system such that it has quite a lot more than 32K of memory to spare.
Instead of solving the problem by trying to force the hardware and low-level software to perform to specific performance measurements, the easier solution is to adjust your software to fit the hardware. If you measure your max write latency to be 1 second (for the sake of nice round numbers), write your program such that a buffer that is flushed to disk will not need to be re-used for at least 2.5-3 seconds. That way you cover your worst-case scenario, plus provide a safety margin in case something really unexpected happens. If you use a system where you rotate through 3-4 output buffers, you shouldn't have to worry about re-using a buffer before it gets flushed. You aren't going to be able to control the hardware too closely, and if you are already writing to a raw volume (no filesystem) then there's not much between you and the hardware that you can manipulate or eliminate. If your program design is inflexible and you are seeing unacceptable latency spikes, you can always try a faster drive. Solid-state drives don't have to "seek" to do I/O operations, so you should see a fairly uniform hardware I/O latency.
As long as you are using O_DIRECT | O_SYNC, you can use ioprio_set() to set the IO scheduling priority of your process/thread (although the man page says "process", I believe you can pass a TID as given by gettid()).
If you set a real-time IO class, then your IO will always be given first access to the disk - it sounds like this is what you want.
I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
The dd's block size is aligned with file system block size. I guess your log file isn't.
Plus probably your application writes not only the log file, but also does some other file operations. Or your application isn't alone using the disk.
Generally, disk I/O isn't optimized for latencies, it is optimized for the throughput. High latencies are normal - and networked file systems have them even higher.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long.
Some writes take longer time because after some point of time you saturate the write queue and OS finally decides to actually flush the data to disk. The I/O queues by default configured pretty short: to avoid excessive buffering and information loss due to a crash.
N.B. If you want to see the real speed, try setting the O_DSYNC flag when opening the file.
If your blocks are really aligned you might try using the O_DIRECT flag, since that would remove contentions (with other applications) on the Linux disk cache level. The writes would work at the real speed of the disk.
100MB/s with dd - without any syncing - is a highly synthetic benchmark, as you never know that data have really hit the disk. Try adding conv=dsync to the dd's command line.
Also trying using larger block size. 32K is still small. IIRC 128K size was the optimal when I was testing sequential vs. random I/O few years ago.
I am seeing a longest write speed of about 47ms.
"Real time" != "fast". If I define max response time of 50ms, and your app consistently responds within the 50ms (47 < 50) then your app would classify as real-time.
I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds?
I do not think you can avoid the write() delays. Latencies are the inherit property of the disk I/O. You can't avoid them - you have to expect and handle them.
I can think only of the following option: use two buffers. First would be used by write(), second - used for storing new incoming data from threads. When write() finishes, switch the buffers and if there is something to write, start writing it. That way there is always a buffer for threads to put the information into. Overflow might still happen if threads generate information faster than the write() can write. Dynamically adding more buffers (up to some limit) might help in the case.
Otherwise, you can achieve some sort of real-time-ness for (rotational) disk I/O only if your application is the sole user of the disk. (Old rule of real time applications applies: there can be only one.) O_DIRECT helps somehow to remove the influence of the OS itself from the equation. (Though you would still have the overhead of file system in form of occasional delays due to block allocation for the file extension. Under Linux that works pretty fast, but still can be avoided by preallocating the whole file in advance, e.g. by writing zeros.) If the timing is really important, consider buying dedicated disk for the job. SSDs have excellent throughput and do not suffer from the seeking.
Are you writing to a new file or overwriting the same file?
The big difference with dd is likely to be seek time, dd is streaming to a contigous (mostly) list of blocks, if you are writing lots of small files the head may be seeking all over the drive to allocate them.
The best way of solving the problem is likely to be removing the requirement for the log to be written in a specific time. Can you use a set of buffers so that one is being written (or at least sent to the drives's buffer) while new log data is arriving into another one?
linux does not write anything directly to the disk it will use the virtual memory and then, a kernel thread call pdflush will write these datas to the disk , the behavior of pdflush could be controlled through sysctl -w ""