why sequential write is faster than random write on HDD - database

Like appending log entries at the tail of a file, or just like mysql recording its redo log, people always say sequential write is much faster than random write. But Why? I mean, when you write data on disk, the seek time and rotate time dominate the performance. But between your two consecutive sequential writes, there're maybe lots of other write requests(like nginx records access.log). Those write requests may move the magnetic head to other tracks, and when your process does the sequential write, it needs to move the magnetic head back again, and incur the rotation time. Even though when there's no other process, the magnetic head could stand still, you also need to wait for the rotation.
So, is it true that sequential writes is better than random write just because, in many cases, sequential write doesn't contain the seek time while random write always contains the seek time, but both sequantial and random write contain the rotation time?

The write performance of a disk is influenced by the physical properties of the storage device (e.g. the physical rotational speed in revolutions per minute in case of a mechanical disk), the disk I/O unit to I/O request size ratio and the OS/application.
Physical properties of an HDD
One major drawback of traditional mechanical disks is that, for an I/O request to be honored, the head has to reach the desired starting position (seek delay) and the platter has to reach the desired starting position (rotational delay).
This is true for sequential and random I/O. However, with sequential I/O, this delay gets considerably less noticeable because more data can be written without repositioning the head. An "advanced format" hard disk has a sector size of 4096 bytes (the smallest I/O unit) and a cylinder size in the megabyte-range. A whole cylinder can be read without repositioning the head. So, yes, there's a seek and rotational delay involved but the amount of data that can be read/written without further repositioning is considerably higher. Also, moving from one cylinder to the next is significantly more efficient than moving from the innermost to the outermost cylinder (worst-case seek).
Writing 10 consecutive sectors involves a seek and rotational delay once, writing 10 sectors spread across the disk involves 10 counts of seeks and rotational delay.
In general, both, sequential and random I/O involve seek and rotational delays. Sequential I/O takes advantage of sequential locality to minimize those delays.
Physical properties of an SSD
As you know, a solid-state disk does not have moving parts because it's usually built from flash memory. The data is stored in cells. Multiple cells form a page - the smallest I/O unit ranging from 2K to 16K in size. Multiple pages are managed in blocks - a block contains between 128 and 256 pages.
The problem is two-fold. Firstly, a page can only be written to once. If all pages contain data, they cannot be written to again unless the whole block is erased. Assuming that a page in a block needs to be updated and all pages contain data, the whole block must be erased and rewritten.
Secondly, the number of write cycles of an individual block is limited. To prevent approaching or exceeding the maximum number of write cycles faster for some blocks than others, a technique called wear leveling is used so that writes are distributed evenly across all blocks independent from the logical write pattern. This process also involves block erasure.
To alleviate the performance penalty of block erasure, SSDs employ a garbage collection process that frees used pages that are marked as stale by writing block pages excluding stale pages to a new block and erasing the original block.
Both aspects can cause more data to be physically read and written than required by the logical write. A full page write can trigger a read/write sequence that is 128 to 256 times larger, depending on the page/block relationship. This is known as write amplification. Random writes potentially hit considerably more blocks than sequential writes, making them significantly more expensive.
Disk I/O unit to I/O request size ratio
As outlined before, a disk imposes a minimum on the I/O unit that can be involved in reads and writes. If a single byte is written to disk, the whole unit has to be read, modified, and written.
In contrast to sequential I/O, where the likelihood of triggering large writes as the I/O load increases is high (e.g. in case of a database transaction log), random I/O tends to involve smaller I/O requests. As those requests become smaller than the smallest I/O unit, the overhead of processing those increases, adding to the cost of random I/O. This is another example of write amplification as a consequence of storage device characteristics. In this case, however, HDD and SSD scenarios are affected.
OS/application
The OS has various mechanisms to optimize both, sequential and random I/O.
A write triggered by an application is usually not processed immediately (unless requested by the application by means of synchronous/direct I/O or a sync command), the changes are executed in-memory based on the so-called page cache and written to disk at a later point-in-time.
By doing so, the OS maximizes the total amount of data available and the size of individual I/Os. Individual I/O operations that would have been inefficient to execute can be aggregated into one potentially large, more efficient operation (e.g. several individual writes to a specific sector can become a single write). This strategy also allows for I/O scheduling, choosing a processing order that is most efficient for executing the I/Os even though the original order as defined by the application or applications was different.
Consider the following scenario: a web server request log and a database transaction log are being written to the same disk. The web server write operations would normally interfere with the database write operations if they were executed in order, as issued by the applications involved. Due to the asynchronous execution based on the page cache, the OS can reorder those I/O requests to trigger two large sequential write requests each. While those are being executed, the database can continue to write to the transaction log without any delay.
One caveat here is that, while this should be true for the web server log, not all writes can be reordered at will. A database triggers a disk sync operation (fsync on Linux/UNIX, FlushFileBuffers on Windows) whenever the transaction log has to be written to stable storage as part of a commit. Then, the OS cannot delay the write operations any further and has to execute all previous writes to the file in question immediately. If the web server was to do the same, there could be a noticeable performance impact because the order is then dictated by those two applications. That is why it's a good idea to place a transaction log on a dedicated disk to maximize sequential I/O throughput in the presence of other disk syncs / large amounts of other I/O operations (the web server log shouldn't be a problem). Otherwise, asynchronous writes based on the page cache might not be able to hide the I/O delays anymore as the total I/O load and/or the number of disk syncs increase.

Related

How could WAL (write ahead log) have better performance than write directly to disk?

The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously
There are two benefits:
If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)
So what is the advantage of using a WAL?
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
write all records to the end of some file
the files are somehow structured
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
write to the WAL for persistence
update the memtables (in RAM)
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2:
All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.

Does the distance between read and write locations have an effect on cache performance?

I have a buffer of size n that is full, and a successor buffer of size n that is empty. I want to insert a value within the first buffer at position i, but I would need to move a range of memory forward in order to do that, since the buffer is full (ie. sequential insert). I have two options here:
Prefer write close to read (adjacent):
Push the last value of the first buffer into the second.
Move between i and n - 1 in the first buffer one forward.
Insert at i.
Prefer fewer steps:
Copy the range i to n - 1 from the first into the second buffer.
Insert at i.
Most of what I can find only talks about locality in a read context, and I am wondering whether the distance between the read and the write memory should be considered.
Does the distance between read and write locations have an effect on cache performance?
Yes. Normally (not including rare situations where CPU can write an entire cache line with new data) the CPU has to fetch the most recent version of a cache line into its cache before doing the write. If the cache line is already in the cache (e.g. due to a previous read of some other data that happened to be in the same cache line) then CPU won't need to fetch the cache line before doing the write.
Note that there's also various other quirks (cache aliasing, TLB misses, etc); and all of it depends on the specific situation and which CPU (e.g. if all of the process' data fits in the CPU's cache, there's no shared memory in involved, and there's no task switches or other processes using the CPU; then you can assume everything will always be in the cache anyway).
I want to insert a value within the first buffer at position i, but I would need to move a range of memory forward in order to do that, since the buffer is full (ie. sequential insert).
Without more information (how often this happens, how much data is involved, etc) I can't really make any suggestions. However (at first glance, without much information), the entire idea seems bad. More specifically, it sounds like you're adding a bunch of hassle to make two smaller arrays behave exactly the same as one larger array would have (and then worrying about the cost of insertion because arrays aren't good for insertion in general).
this is a component deep within a data structure at the lowest level where n is small and constant
by small I assume you mean smaller than L1 cpu cache of being somewhere less than 1MB or L2 cache up to 10-20 MB, depending on your CPU then no,
I am wondering whether the distance between the read and the write memory should be considered.
sometimes; if all the data can fit into the CPU L1, L2, L3 cache that the process is running on then what you think random access means applies it would all be the same latency. You can get nitty gritty and delve into the differences between L1, L2, L3 cache but for sake of brevity (and i simply take it for granted) anywhere within a memory boundary it's all the same latency to access. So in your case where N is small and if it all fits into cpu cache (the first of many boundaries) then it would be the manner and efficiency in which you chose to move/change values and the number of times you end up doing that which affects performance (time to complete).
Now if N were big, for example in a 2 or more socket system (over intel QPI or UPI) and that data resided on DDR RAM that is located across the QPI or UPI path to memory dimms off the memory controller of the other CPU, then definitely yes big performance hit (relatively speaking) because now a boundary has been crossed, and that would be what could NOT fit into cache of the CPU that the process was running on (which was initally fetched from DIMMS LOCAL to that cpu memory controller) now incurs the overhead talking to the other CPU over the QPI or UPI path (while still very fast compared to previous architecures) and that other CPU then fetches the data from it's set of memory DIMMS and sends it back over QPI or UPI to the cpu your process is running on.
So when you exceed L1 cache limit into L2 there is a performance hit, likewise into L3 cache, all within one CPU. when a process has to repeatedly fetch from it's local set of dimms more data that it could not fit into cache then performance hit. And when that data is not on dimms local to that cpu = slower. And when that data is not on the same motherboard and goes across some kind of high speed fiber RDMA = slower. When it's across ethernet even slower... and so on.

Will writing million times to a file, spoil my harddisk?

I have a IO intensive simulation program, that logs the simulation trace / data to a file at every iterations. As the simulation runs for more than millions of iterations or so and logs the data to a file in the disk (overwrite the file each time), I am curious to know if that would spoil the harddisk as most of storage disk has a upper limit to write/erase cycles ( eg. flash disk allow up to 100,000 write/erase cycles). Will splitting the file in to multiple files be a better option?
You need to recognize that a million write calls to a single file may only write to each block of the disk once, which doesn't cause any harm to magnetic disks or SSD devices. If you overwrite the first block of the file one million times, you run a greater risk of wearing things out, but there are lots of mitigating factors. First, if it is a single run of a program, the o/s is likely to keep the disk image in memory without writing to disk at all in the interim — unless, perhaps, you're using a journalled file system. If it is a journalled file system, then the actual writing will be spread over lots of different blocks.
If you manage to write to the same block on a magnetic spinning hard disk a million times, you are still not at serious risk of wearing the disk out.
A Google search on 'hard disk write cycles' shows a lot of informative articles (more particularly, perhaps, about SSD), and the related searches may also help you out.
On an SSD, there is a limited amount of writes (or erase cycles to be more accurate) to any particular block. It's probably more than 100K to 1 million to any given block, and SSD's use "wear loading" to avoid unnecessary "writes" to the same block every time. SSD's can only write zeros, so when you "reset" a bit to one, you have to erase the whole block. [One could put an inverter on the cell to make it the other way around, but you get one or t'other, so it doesn't help much].
Real hard disks are more of a mechanical device, so there isn't so much of a with how many times you write to the same place, it's more the head movements.
I wouldn't worry too much about it. Writing one file should be fine, it has little consequence whether you have one file or many.

When are sequential seeks with small reads slower than reading a whole file?

I've run into a situation where lseek'ing forward repeatedly through a 500MB file and reading a small chunk (300-500 bytes) between each seek appears to be slower than read'ing through the whole file from the beginning and ignoring the bytes I don't want. This appears to be true even when I only do 5-10 seeks (so when I only end up reading ~1% of the file). I'm a bit surprised by this -- why would seeking forward repeatedly, which should involve less work, be slower than reading which actually has to copy the data from kernel space to userspace?
Presumably on local disk when seeking the OS could even send a message to the drive to seek without sending any data back across the bus for even more savings. But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
Regardless of whether reading from local disk or a network filesystem, how could this happen? My only guess is the OS is prefetching a ton of data after each location I seek to. Is this something that can normally occur or does it likely indicate a bug in my code?
The magnitude of the difference will be a factor of the ratio of the seek count/data being read to the size of the entire file.
But I'm accessing a network mount, where I'd expect read to be much slower than seek (sending one packet saying to move N bytes ahead versus actually transferring data across the network).
If there's rotational magnetic drives at the other end of the network, the effect will still be present and likely significantly compounded by the round trip time. The network protocol may play a role too. Even solid state drives may take some penalty.
I/O schedulers may reorder requests in order to minimize head movements (perhaps naively even for storage devices without a head). A single bulk request might give you some greater efficiency across many layers. The filesystems have an opportunity to interfere here somewhat.
Regardless of whether reading from local disk or a network filesystem, how could this happen?
I wouldn't be quick to dismiss the effect of those layers -- do you have measurements which show the same behavior from a local disk? It's much easier to draw conclusions without quite so much between you and the hardware. Start with a raw device and bisect from there.
Have you considered using a memory map instead? It's perfect for this use case.
Depending on the filesystem, the specific lseek implementation make create some overhead.
For example, I believe when using NFS, lseek locks the kernel by calling remote_llseek().

real-time writes to disk

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times. I have some requirements of how long each write takes because the buffer needs to be cleared for a separate thread to write to it again.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long. My block of code looks like (this is in C by the way):
last = get_timestamp();
write();
now = get_timestamp();
if (longest_write < now - last)
longest_write = now - last;
And at the end I print out the longest write. I found that for a 32K buffer, I am seeing a longest write speed of about 47ms. This is way too long to meet the requirements of my application. I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds? Thanks
Edit:
I am in fact using multiple buffers of the type I declare above and striping between them to multiple disks. One solution to my problem would be to just increase the number of buffers to amortize the cost of long writes. However I would like to keep the amount of memory being used for buffering as small as possible to avoid dirtying the cache of the thread that is producing the data written into the buffer. My question should be constrained to dealing with variance in the latency of writing a small block to disk and how to reduce it.
I'm assuming that you are using an ATA or SATA drive connected to the built-in disk controller in a standard computer. Is this a valid assumption, or are you using anything out of the ordinary (hardware RAID controller, SCSI drives, external drive, etc)?
As an engineer who does a lot of disk I/O performance testing at work, I would say that this sounds a lot like your writes are being cached somewhere. Your "high latency" I/O is a result of that cache finally being flushed. Even without a filesystem, I/O operations can be cached in the I/O controller or in the disk itself.
To get a better view of what is going on, record not just your max latency, but your average latency as well. Consider recording your max 10-15 latency samples so you can get a better picture of how (in-)frequent these high-latency samples are. Also, throw out the data recorded in the first two or three seconds of your test and start your data logging after that. There can be high-latency I/O operations seen at the start of a disk test that aren't indicative of the disk's true performance (can be caused by things like the disk having to rev up to full speed, the head having to do a large initial seek, disk write cache being flushed, etc).
If you are wanting to benchmark disk I/O performance, I would recommend using a tool like IOMeter instead of using dd or rolling your own. IOMeter makes it easy to see what kind of a difference it makes to change the I/O size, alignment, etc, plus it keeps track of a number of useful statistics.
Requiring an I/O operation to happen within a certain amount of time is a risky thing to do. For one, other applications on the system can compete with you for disk access or CPU time and it is nearly impossible to predict their exact effect on your I/O speeds. Your disk might encounter a bad block, in which case it has to do some extra work to remap the affected sectors before processing your I/O. This introduces an unpredictable delay. You also can't control what the OS, driver, and disk controller are doing. Your I/O request may get backed up in one of those layers for any number of unforseeable reasons.
If the only reason you have a hard limit on I/O time is because your buffer is being re-used, consider changing your algorithm instead. Try using a circular buffer so that you can flush data out of it while writing into it. If you see that you are filling it faster than flushing it, you can throttle back your buffer usage. Alternatively, you can also create multiple buffers and cycle through them. When one buffer fills up, write that buffer to disk and switch to the next one. You can be writing to the new buffer even if the first is still being written.
Response to comment:
You can't really "get the kernel out of the way", it's the lowest level in the system and you have to go through it to one degree or another. You might be able to build a custom version of the driver for your disk controller (provided it's open source) and build in a "high-priority" I/O path for your application to use. You are still at the mercy of the disk controller's firmware and the firmware/hardware of the drive itself, which you can't necessarily predict or do anything about.
Hard drives traditionally perform best when doing large, sequential I/O operations. Drivers, device firmware, and OS I/O subsystems take this into account and try to group smaller I/O requests together so that they only have to generate a single, large I/O request to the drive. If you are only flushing 32K at a time, then your writes are probably being cached at some level, coalesced, and sent to the drive all at once. By defeating this coalescing, you should reduce the number of I/O latency "spikes" and see more uniform disk access times. However, these access times will be much closer to the large times seen in your "spikes" than the moderate times that you are normally seeing. The latency spike corresponds to an I/O request that didn't get coalesced with any others and thus had to absorb the entire overhead of a disk seek. Request coalescing is done for a reason; by bundling requests you are amortizing the overhead of a drive seek operation over multiple commands. Defeating coalescing leads to doing more seek operations than you would normally, giving you overall slower I/O speeds. It's a trade-off: you reduce your average I/O latency at the expense of occasionally having an abnormal, high-latency operation. It is a beneficial trade-off, however, because the increase in average latency associated with disabling coalescing is nearly always more of a disadvantage than having a more consistent access time is an advantage.
I'm also assuming that you have already tried adjusting thread priorities, and that this isn't a case of your high-bandwidth producer thread starving out the buffer-flushing thread for CPU time. Have you confirmed this?
You say that you do not want to disturb the high-bandwidth thread that is also running on the system. Have you actually tested various output buffer sizes/quantities and measured their impact on the other thread? If so, please share some of the results you measured so that we have more information to use when brainstorming.
Given the amount of memory that most machines have, moving from a 32K buffer to a system that rotates through 4 32K buffers is a rather inconsequential jump in memory usage. On a system with 1GB of memory, the increase in buffer size represents only 0.0092% of the system's memory. Try moving to a system of alternating/rotating buffers (to keep it simple, start with 2) and measure the impact on your high-bandwidth thread. I'm betting that the extra 32K of memory isn't going to have any sort of noticeable impact on the other thread. This shouldn't be "dirtying the cache" of the producer thread. If you are constantly using these memory regions, they should always be marked as "in use" and should never get swapped out of physical memory. The buffer being flushed must stay in physical memory for DMA to work, and the second buffer will be in memory because your producer thread is currently writing to it. It is true that using an additional buffer will reduce the total amount of physical memory available to the producer thread (albeit only very slightly), but if you are running an application that requires high bandwidth and low latency then you would have designed your system such that it has quite a lot more than 32K of memory to spare.
Instead of solving the problem by trying to force the hardware and low-level software to perform to specific performance measurements, the easier solution is to adjust your software to fit the hardware. If you measure your max write latency to be 1 second (for the sake of nice round numbers), write your program such that a buffer that is flushed to disk will not need to be re-used for at least 2.5-3 seconds. That way you cover your worst-case scenario, plus provide a safety margin in case something really unexpected happens. If you use a system where you rotate through 3-4 output buffers, you shouldn't have to worry about re-using a buffer before it gets flushed. You aren't going to be able to control the hardware too closely, and if you are already writing to a raw volume (no filesystem) then there's not much between you and the hardware that you can manipulate or eliminate. If your program design is inflexible and you are seeing unacceptable latency spikes, you can always try a faster drive. Solid-state drives don't have to "seek" to do I/O operations, so you should see a fairly uniform hardware I/O latency.
As long as you are using O_DIRECT | O_SYNC, you can use ioprio_set() to set the IO scheduling priority of your process/thread (although the man page says "process", I believe you can pass a TID as given by gettid()).
If you set a real-time IO class, then your IO will always be given first access to the disk - it sounds like this is what you want.
I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.
I have tested the disk with dd. I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag). I am able to get about 100 MB/s with a 32K block size.
The dd's block size is aligned with file system block size. I guess your log file isn't.
Plus probably your application writes not only the log file, but also does some other file operations. Or your application isn't alone using the disk.
Generally, disk I/O isn't optimized for latencies, it is optimized for the throughput. High latencies are normal - and networked file systems have them even higher.
In my application, I noticed I wasn't able to write data to the disk at nearly this speed. So I looked into what was happening and I find that some writes are taking very long.
Some writes take longer time because after some point of time you saturate the write queue and OS finally decides to actually flush the data to disk. The I/O queues by default configured pretty short: to avoid excessive buffering and information loss due to a crash.
N.B. If you want to see the real speed, try setting the O_DSYNC flag when opening the file.
If your blocks are really aligned you might try using the O_DIRECT flag, since that would remove contentions (with other applications) on the Linux disk cache level. The writes would work at the real speed of the disk.
100MB/s with dd - without any syncing - is a highly synthetic benchmark, as you never know that data have really hit the disk. Try adding conv=dsync to the dd's command line.
Also trying using larger block size. 32K is still small. IIRC 128K size was the optimal when I was testing sequential vs. random I/O few years ago.
I am seeing a longest write speed of about 47ms.
"Real time" != "fast". If I define max response time of 50ms, and your app consistently responds within the 50ms (47 < 50) then your app would classify as real-time.
I don't think this can be solely attributed to rotational latency of the disk. Any ideas what is going on and what I can do to get more stable write speeds?
I do not think you can avoid the write() delays. Latencies are the inherit property of the disk I/O. You can't avoid them - you have to expect and handle them.
I can think only of the following option: use two buffers. First would be used by write(), second - used for storing new incoming data from threads. When write() finishes, switch the buffers and if there is something to write, start writing it. That way there is always a buffer for threads to put the information into. Overflow might still happen if threads generate information faster than the write() can write. Dynamically adding more buffers (up to some limit) might help in the case.
Otherwise, you can achieve some sort of real-time-ness for (rotational) disk I/O only if your application is the sole user of the disk. (Old rule of real time applications applies: there can be only one.) O_DIRECT helps somehow to remove the influence of the OS itself from the equation. (Though you would still have the overhead of file system in form of occasional delays due to block allocation for the file extension. Under Linux that works pretty fast, but still can be avoided by preallocating the whole file in advance, e.g. by writing zeros.) If the timing is really important, consider buying dedicated disk for the job. SSDs have excellent throughput and do not suffer from the seeking.
Are you writing to a new file or overwriting the same file?
The big difference with dd is likely to be seek time, dd is streaming to a contigous (mostly) list of blocks, if you are writing lots of small files the head may be seeking all over the drive to allocate them.
The best way of solving the problem is likely to be removing the requirement for the log to be written in a specific time. Can you use a set of buffers so that one is being written (or at least sent to the drives's buffer) while new log data is arriving into another one?
linux does not write anything directly to the disk it will use the virtual memory and then, a kernel thread call pdflush will write these datas to the disk , the behavior of pdflush could be controlled through sysctl -w ""

Resources