Related
I am working on an embedded application without any OS that needs the use of a File System. I've been over this many times with the people in the project and some agree with me that the system must make a proper shut down of the system whenever there is a power failure or else the file system might go crazy.
Some people say that it doesn't matter if you simply power off the system and let nature run its course, but I think that's one of the worst things to do, especially if you know this will bring you a problem and probably shorten your product's life span.
In the last paragraph I just assumed that it is a problem, but my question remains:
Does a power down have any effect on the file system?
Here is a list of various techniques to help an embedded system tolerate a power failure. These may not be practical for your particular application.
Use a Journaling File System - Can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but do your homework to confirm.
Unless your application needs the write performance, disable all write caching. Check your disk drivers for caching options. Under Linux/Unix, consider mounting the filesystem in sync mode.
Unless it must be writable, make it read-only. Try to keep your application executables and operating system files on their own partition(s), with write protections in place (e.g. mount read only in Linux). Your read/write data should be on its own partition. Even if your application data gets corrupted, your system should still be able to boot (albeit with a fail safe default configuration).
3a. For data that is only written once (e.g. Configuration Settings), try to keep it mounted as read-only most of the time. If there is a settings change mount is as R/W temporarily, update the data, and then unmount/remount it as read-only.
3b. Use a technique similar to 3a to handle application/OS updates in the field.
3c. If it is impractical for you to mount the FS as read-only, at least consider opening individual files as read-only (e.g. fp=fopen("configuration.ini", "r")).
If possible, use separate devices for your storage. Keeping things in separate partitions provides some protection, but there are still edge cases where a partition table may become corrupt and render the entire drive unreadable. Using physically separate devices further isolates against one corrupt device bringing down the whole system. In a perfect world, you would have at least 4 separate devices:
4a. Boot Loader
4b. Operating System & Application Code
4c. Configuration Settings
4e. Application Data
Know the characteristics of your storage devices, and control the brand/model/revision of devices used. Some hard disks ignore cache flush commands from the OS. We had cases where some models of CompactFlash cards would corrupt themselves during a power failure, but the "industrial" models did not have this problem. Of course, this information was not published in any datasheet, and had to be gathered by experimental testing. We developed a list of approved CF cards, and kept inventory of those cards. We periodically had to update this list as older cards became obsolete, or the manufacturer would make a revision.
Put your temporary files in a RAM Disk. If you keep those writes off-disk, you eliminate them as a potential source of corruption. You also reduce flash wear and tear.
Develop automated corruption detection and recovery methods. - All of the above techniques will not help you if the application simply hangs because a missing config file. You need to be able to recover as gracefully as possible:
7a. Your system should maintain at least two copies of its configuration settings, a "primary" and a "backup". If the primary fails for some reason, switch to the backup. You should also consider mechanisms for making backups whenever whenever the configuration is changed, or after a configuration has been declared "good" by the user (testing vs production mode).
7b. Did your Application Data partition fail to mount? Automatically run chkdsk/fsck.
7c. Did chkdsk/fsck fail to fix the problem? Automatically re-format the partition and get it back to a known state.
7d. Do you have a Boot Loader or other method to restore the OS and application after a failure?
7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.
Power Failures should be part of your system qualification testing. The only way you will be sure you have a robust system is to test it. Yank the power cord from the system and document what happens. Try yanking the power at multiple points in the system operation (during runtime, while booting, mid configuration, etc). Repeat each test multiple times.
If you cannot mitigate all power failure problems, incorporate a battery or Supercapacitor into the system - Keep in mind that you will need a background process in your OS to initiate a graceful shutdown when power gets low. Also, batteries will require periodic testing and replacement with age.
Addition to msemack's response, unfortunately my rating is too low to post a comment to his answer vs. a separate answer.
Does a power down have any effect on the file system?
Yes, if proper measures aren't put in place to prevent corruption. See previous answers for file system options to help mitigate. However if ATA flush/sleep aren't properly implemented on your device you may run into the scenario we did. In our scenario the device was corrupt beyond the file system, and fdisk/format would not recover the device.
Instead an ATA security-erase was required to recover the device once corruption occurs. In order to avoid this, we implemented an ATA sleep command prior to power loss. This required hold-up of 400ms to support the 160ms ATA sleep took, and leave some head room for degradation of the caps over the life of the product.
Notes from our scenario:
fdisk/format failed to repair/recover the drive.
Our power-safe file system's check disk utility returned that the device had bad blocks, but there really weren't any.
flush/sync returned success, quickly, and most likely weren't implemented.
Once corrupt, dd could not read the device beyond the 1st partition boundary and returned i/o errors after.
hdparm used to issue ATA security-erase, as only method of recovery for some corruption scenarios.
For non-journalling filesystem unexpected turn-off can mean corruption of certain data including directory structure. This happens if there's unsaved data in the cache or if the FS is in the process of writing multi-block update and interruption happens when only some blocks are written.
Journalling addresses this problem mostly - if there's interruption in the middle, recovery routine or check-and-repair operation done by the FS (usually implicitly) brings the filesystem to consistent state. However this state is not always the latest - i.e. if there were some data in the memory cache, they can be lost even with journalling. This is because journalling saves you from corruption of the filesystem but doesn't do magic.
Write-through mode (no write caching) reduces possibility of the data loss but doesn't solve the problem completely, as journalling will work as a cache (for a very short time).
So unfortunately backup or data duplication are the main ways to prevent data loss.
It totally depends on the file system you are using and if it is acceptable to loose some data at power off based on your project requirements.
One could imagine using a file system that is secured against unattended power-off and is able to recover from a partial write sequence. So on the applicative side, if you don't have critic data that absolutely needs to be written before shuting down, there is no need for a specific power off detection procedure.
Now if you want a more specific answer for your project you will have to give more information on the file system you are using and your project requirements.
Edit: As you have critical applicative data to save before power-off, i think you have answered the question yourself. The only way to secure unattended power-off is to have a brown-out detection that alerts your embedded device coupled with some hardware circuitry that allows keeping delivering enought power to the device to perform the shutdown procedure.
The FAT file-system is particularly prone to corruption if a write is in progress or a file is open on shutdown - specifically if ther is a buffered operation that is not flushed . On one project I worked on the solution was to run a file system integrity check and repair (essentially chkdsk/scandsk) on start-up. This strategy did not prevent data loss, but it did prevent the file system becoming unusable.
A number of vendors provide journalling add-on components for FAT to counter exactly this problem. These include Segger, Quadros and Micrium for example.
Either way, your system should generally adopt a open-write-close approach to file access, or open-write-flush if you feel the need to keep the file open.
I have a project for which i need to minimize the impact of sending / receiving computed data over the network.
In my configuration, a (small) grid of computers will compute a large number of values (matrix_i^n, each machine having a large set of i's assigned). These values will then be send over the network to another computer depending on properties of the computed value (on average, every computer receives the same number of values).
I would like to optimize the time needed to compute these values (up to a power m, predetermined). In oder to do this, i need to choose the best way to transfer the intermediary results:
Precompute everything then exchange all the values to the right computer
Send every value to the right computer as soon as it is available
Hybrid solution where small packs of data are exchanged during the computation
Since network transfers are very slow, I have the feeling that I should start transferring data asap but i'm not sure that the overhead on the CPU (handling more exceptions, hence more work for the scheduler) would not blow the performance of the computation.
Do you know documentation i could rely on or a good benchmark suite (written in C) i could use to make some test by myself ?
Thank you
CPU usage for networking is generally determined by the amount of I/O calls you make. If you can, design your app in a way that lets you tweak your buffer size easily so that you can test. Using enough buffering, 10gb/s is no sweat.
What OS are you using? I doubt you'll need it, but Windows 8 has Registered I/O which is designed for extremely low latency and CPU usage.
Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).
At work we're discussing the design of a new platform and one of the upper management types said it needed to run our current code base (C on Linux) but be real time because it needed to respond in less than a second to various inputs. I pointed out that:
That point doesn't mean it needs to be "real time" just that it needs a faster clock and more streamlining in its interrupt handling
One of the key points to consider is the OS that's being used. They wanted to stick with embedded Linux, I pointed out we need an RTOS. Using Linux will prevent "real time" because of the kernel/user space memory split thus I/O is done via files and sockets which introduce a delay
What we really need to determine is if it needs to be deterministic (needs to respond to input in <200ms 90% of the time for example).
Really in my mind if point 3 is true, then it needs to be a real time system, and then point 2 is the biggest consideration.
I felt confident answering, but then I was thinking about it later... What do others think? Am I on the right track here or am I missing something?
Is there any difference that I'm missing between a "real time" system and one that is just "deterministic"? And besides a RTC and a RTOS, am I missing anything major that is required to execute a true real time system?
Look forward to some great responses!
EDIT:
Got some good responses so far, looks like there's a little curiosity about my system and requirements so I'll add a few notes for those who are interested:
My company sells units in the 10s of thousands, so I don't want to go over kill on the price
Typically we sell a main processor board and an independent display. There's also an attached network of other CAN devices.
The board (currently) runs the devices and also acts as a webserver sending basic XML docs to the display for end users
The requirements come in here where management wants the display to be updated "quickly" (<1s), however the true constraints IMO come from the devices that can be attached over CAN. These devices are frequently motor controlled devices with requirements including "must respond in less than 200ms".
You need to distinguish between:
Hard realtime: there is an absolute limit on response time that must not be breached (counts as a failure) - e.g. this is appropriate for example when you are controlling robotic motors or medical devices where failure to meet a deadline could be catastrophic
Soft realtime: there is a requirement to respond quickly most of the time (perhaps 99.99%+), but it is acceptable for the time limit to be occasionally breached providing the response on average is very fast. e.g. this is appropriate when performing realtime animation in a computer game - missing a deadline might cause a skipped frame but won't fundamentally ruin the gaming experience
Soft realtime is readily achievable in most systems as long as you have adequate hardware and pay sufficient attention to identifying and optimising the bottlenecks. With some tuning, it's even possible to achieve in systems that have non-deterministic pauses (e.g. the garbage collection in Java).
Hard realtime requires dedicated OS support (to guarantee scheduling) and deterministic algorithms (so that once scheduled, a task is guaranteed to complete within the deadline). Getting this right is hard and requires careful design over the entire hardware/software stack.
It is important to note that most business apps don't require either: in particular I think that targeting a <1sec response time is far away from what most people would consider a "realtime" requirement. Having said that, if a response time is explicitly specified in the requirements then you can regard it as soft realtime with a fairly loose deadline.
From the definition of the real-time tag:
A task is real-time when the timeliness of the activities' completion is a functional requirement and correctness condition, rather than merely a performance metric. A real-time system is one where some (though perhaps not all) of the tasks are real-time tasks.
In other words, if something bad will happen if your system responds too slowly to meet a deadline, the system needs to be real-time and you will need a RTOS.
A real-time system does not need to be deterministic: if the response time randomly varies between 50ms and 150ms but the response time never exceeds 150ms then the system is non-deterministic but it is still real-time.
Maybe you could try to use RTLinux or RTAI if you have sufficient time to experiment with. With this, you can keep the non realtime applications on the linux, but the realtime applications will be moved to the RTOS part. In that case, you will(might) achieve <1second response time.
The advantages are -
Large amount of code can be re-used
You can manually partition realtime and non-realtime tasks and try to achieve the response <1s as you desire.
I think migration time will not be very high, since most of the code will be in linux
Just on a sidenote be careful about the hardware drivers that you might need to run on the realtime part.
The following architecture of RTLinux might help you to understand how this can be possible.
It sounds like you're on the right track with the RTOS. Different RTOSs prioritize different things either robustness or speed or something. You will need to figure out if you need a hard or soft RTOS and based on what you need, how your scheduler is going to be driven. One thing is for sure, there is a serious difference betweeen using a regular OS and a RTOS.
Note: perhaps for the truest real time system you will need hard event based resolution so that you can guarantee that your processes will execute when you expect them too.
RTOS or real-time operating system is designed for embedded applications. In a multitasking system, which handles critical applications operating systems must be
1.deterministic in memory allocation,
2.should allow CPU time to different threads, task, process,
3.kernel must be non-preemptive which means context switch must happen only after the end of task execution. etc
SO normal windows or Linux cannot be used.
example of RTOS in an embedded system: satellites, formula 1 cars, CAR navigation system.
Embedded System: System which is designed to perform a single or few dedicated functions.
The system with RTOS: also can be an embedded system but naturally RTOS will be used in the real-time system which will need to perform many functions.
Real-time System: System which can provide the output in a definite/predicted amount of time. this does not mean the real-time systems are faster.
Difference between both :
1.normal Embedded systems are not Real-Time System
2. Systems with RTOS are real-time systems.
I want to ensure I have done all I can to configure a system's disks for serious database use. The three areas I know of (any others?) to be concerned about are:
I/O size: the database engine and disk's native size should either match, or the database's native I/O size should be a multiple of the disk's native I/O size.
Disks that are capable of Direct Memory Access (eg. IDE) should be configured for it.
When a disk says it has written data persistently, it must be so! No keeping it in cache and lying about it.
I have been looking for information on how to ensure these are so for CENTOS and Ubuntu, but can't seem to find anything at all!
I want to be able to check these things and change them if needed.
Any and all input appreciated.
PLEASE NOTE: The actual hardware involved is VERY modest. The point is to get the most out of what hardware we do have, even though it's "not very serious hardware" from a broader perspective.
MORE:
I appreciate the time taken to read and reply, but I'm hoping to get "answers" that aren't just good database / hardware advice but answers that actually address the specific things I asked about. Namely:
1) What's a good easy way to tell what the I/O unit size is that the OS wants to do? How can I change it? (IOW: If this exclusively a file-system-format issue, how can I tell what was used on an already-created file system? I know /etc/fstab will tell me the file system format... In this case, it's ext3.
2) How can I tell if a disk drive has DMA? If so, how can I turn it on? (I've been told that some drives have this capability, but now I want to follow up and ensure that if these drives have it, it's turned on.)
And, finally;
3) How can I tell if a drive is merely telling the writer that their material is written when it's actually still in cache? And, more importantly, how can I set the system to NOT use such features if / when they exist?
Thank you for your insights.
RT
1) Check /sys/block/sdX/queue/{max_hw_sectors_kb,max_sectors_kb}. The first is that max transfer size the hw allows, the other is the current maximum which can be set to any value <= max_hw_sectors_kb
2) hdparm -i /dev/sdX
3) Turn off write-back caching (hdparm can do it), or make sure that the filesystem issues barriers when synchronizing (as in fsync(), or journal commit).
"serious database use" and you mention IDE in the same sentence?
SSDs or 15k SCSI in a many spindle RAID 1+0 array with separate arrays for data, log and backup. Consider a separate array for tempdb too.
You'd also switch the controller cache to 100% read too to avoid caching issues
Of course, if it's "serious" then you'd consider clustering etc: so a SAN comes in useful here but you may not be as quick as local spindles
You didn't include any info on filesystem or database, so here are some misc pointers.
It is inevitable that you will lose a disk eventually, so its equally important to put a good backup and recovery strategy in place, and mirror your transaction logs, so you can handle a disk failure or even full datafile loss.
1) If possible, put at least one copy of your transaction log on a fixed disk. Don't put your sole transaction log to an external storage subsystem. (Assuming you use a db that supports log mirroring).
2) I agree with gbn, in practice, don't use write caching. I've lost databases on RAID arrays with battery backup. Configure the storage controller card for write-through.
3) Raw devices provide guaranteed writes, but its not worth the hassle. Some filesystems provide synchronous write options too, use one if possible. I am partial to VxFS, but I'm from the Sun world. On Linux, btrfs is eminent at least, but for now, Ext3 works fine if you setup your db properly.