I have bought an I2C EEPROM. I want to store sensor and voltage data. I'm assuming that value can be bigger than one byte, and there can be a lot of data. Is it worth is such case to implement a filesystem with small file allocation table? It would make me easier to peek trought EEPROM for example.
I see two causes for a FAT on EEPROM
If there is a requirement for the flexibility of having different files. Such as
for data logging or configurations. It allows multiple such configuration/log files, to be independent and easily added in the future. This can be a very successful building block for future projects.
For ease of access by other devices or libraries. Typically only an option if the memory device is directly accessible by other interface. Where as in this case it is an EEPROM. If your device was directly USB capable, such as a ATmega32u4 (leo) then you can use LUFA tools to have the USB show up as MASS storage. Making FAT an ideal solution. Or possibly if the device has an Ethernet Shield.
with all being said and if this case simply a datalogger, then the KISS (Keep It Simple Solution) may be a good way to go. So that one can keep focus on the original subject for collecting the data itself.
It is worth noting that SdCards can be easily added for cheap either of the well established Sd library (IDE stock) or SdFat Library (GitHub more features) adding an almost infinite capacity of logging of FAT32. The only trade off is they consume a fair chunk of code space.
I think mpflaga is on the right path.
Some options you should consider include:
Is the device/microcontroller that is writing the data going to be the same as the one that is reading the data?
How many records are you hoping to fit into your storage device?
How robust/recoverable do you want your storage format to be to events such as reboots/power outages/etc?
My opinion regarding these points is that:
It is going to be the same device reading and writing, so you can probably get away with a very specific/custom format rather than a full blown file system.
You probably want to extract as many bytes as possible for use as storage, so a format well-designed for your application will probably help.
This is tricky. You could use self-describing structures, such as a TLV, which would pack your bytes tightly but be harder to search; OR you could use a fixed-length structure, which wastes a lot of bytes but allows easy access. Also, you could just assume the storage will always remain valid, but what happens if power is removed half-way through a write!
Overall, my recommendation would be:
Use an existing library
Use an application-specific format first, but ensure you abstract the storage of the data from the data itself.
If you find you need a filesystem, rewrite the storage layer to use a filesystem.
Having a small standard file system, like FAT16, is worth implementing because you can map this file system over the USB or Network to other devices/computers.
Standardization in your design is a big compliance advantage.
You can find ready sources/libraries or, if it's FAT16 and because it is really simple and well described/documented, try implementing yourself.
Related
I would like to programmatically copy a section of a file on another file. Is there any Win32 API I could use without moving bytes thru my program? Or should I just read from the source file and write on the target?
I know how to do this by reading and writing chunks of bytes, I just wanted to avoid doing it myself if the OS already offers that.
What you're asking for can be achieved, bot not easily. Device drivers routinely transfer data without CPU involvement, but doing that requires kernel mode code. Basically, you would have to write a device driver. The benefits would have to be huge to justify the difficulties associated with developing, testing, and distributing a kernel mode driver. So unless you think there is huge benefit at stake here, I'm afraid that ReadFile/WriteFile are the best you can do.
Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).
I need to save very large amounts of data (>500GB) which is being streamed (800Mb/s) from another device connected to my PC. The speed rules out use of a database e.g. MySQl/ISAM and I am looking for a fast, light library which sits on top of the 'C' stdio file lib (i.e. fopen/fclose/fwrite) which will allow me to write/read a very large file (up to available disk-space).
Behind-the-scenes, the large file can be broken up into smaller files e.g. 1GB and I want the API to take care of these details.
The data arrives at the PC in a compressed binary format and no further processing is needed before writing it to the hard-disk.
The library should be work for Windows and Linux.
if you need random access into the data, take a look at memory mapped files.
It lets you map a file (or a section of a file) into memeory transparently, without having to explicitly allocate memeory and read data. It works on windows/Linux (there is a boost lib that wraps the differences).
On Windows you can handle files >>4gb on a 32bit os by using multiple windows into the file.
edit: Sorry 800Mb/s !! I don't know any disks that can cope with that. You migth be lookign at a raid array of SSD drives.
There used to be image capture cards that used an attached drive as a simple series of bytes with no filesystem to get very high speed sustained writes. I don't know if you are going to need somethign like that.
For ultimate speed, I suggest you go highly platform specific.
The objective is to get as close as you can to connecting the input device directly to hard drive. One method is to write a driver for the input device that writes directly to the hard drive.
The generic algorithm is to use either a very large circular byte buffer or use multiple buffers. You need extra space to compensate for the speed difference between the input device and the output device; provided the input device is non-stop.
If you can pause the input device, the issue becomes easier.
I want to ensure I have done all I can to configure a system's disks for serious database use. The three areas I know of (any others?) to be concerned about are:
I/O size: the database engine and disk's native size should either match, or the database's native I/O size should be a multiple of the disk's native I/O size.
Disks that are capable of Direct Memory Access (eg. IDE) should be configured for it.
When a disk says it has written data persistently, it must be so! No keeping it in cache and lying about it.
I have been looking for information on how to ensure these are so for CENTOS and Ubuntu, but can't seem to find anything at all!
I want to be able to check these things and change them if needed.
Any and all input appreciated.
PLEASE NOTE: The actual hardware involved is VERY modest. The point is to get the most out of what hardware we do have, even though it's "not very serious hardware" from a broader perspective.
MORE:
I appreciate the time taken to read and reply, but I'm hoping to get "answers" that aren't just good database / hardware advice but answers that actually address the specific things I asked about. Namely:
1) What's a good easy way to tell what the I/O unit size is that the OS wants to do? How can I change it? (IOW: If this exclusively a file-system-format issue, how can I tell what was used on an already-created file system? I know /etc/fstab will tell me the file system format... In this case, it's ext3.
2) How can I tell if a disk drive has DMA? If so, how can I turn it on? (I've been told that some drives have this capability, but now I want to follow up and ensure that if these drives have it, it's turned on.)
And, finally;
3) How can I tell if a drive is merely telling the writer that their material is written when it's actually still in cache? And, more importantly, how can I set the system to NOT use such features if / when they exist?
Thank you for your insights.
RT
1) Check /sys/block/sdX/queue/{max_hw_sectors_kb,max_sectors_kb}. The first is that max transfer size the hw allows, the other is the current maximum which can be set to any value <= max_hw_sectors_kb
2) hdparm -i /dev/sdX
3) Turn off write-back caching (hdparm can do it), or make sure that the filesystem issues barriers when synchronizing (as in fsync(), or journal commit).
"serious database use" and you mention IDE in the same sentence?
SSDs or 15k SCSI in a many spindle RAID 1+0 array with separate arrays for data, log and backup. Consider a separate array for tempdb too.
You'd also switch the controller cache to 100% read too to avoid caching issues
Of course, if it's "serious" then you'd consider clustering etc: so a SAN comes in useful here but you may not be as quick as local spindles
You didn't include any info on filesystem or database, so here are some misc pointers.
It is inevitable that you will lose a disk eventually, so its equally important to put a good backup and recovery strategy in place, and mirror your transaction logs, so you can handle a disk failure or even full datafile loss.
1) If possible, put at least one copy of your transaction log on a fixed disk. Don't put your sole transaction log to an external storage subsystem. (Assuming you use a db that supports log mirroring).
2) I agree with gbn, in practice, don't use write caching. I've lost databases on RAID arrays with battery backup. Configure the storage controller card for write-through.
3) Raw devices provide guaranteed writes, but its not worth the hassle. Some filesystems provide synchronous write options too, use one if possible. I am partial to VxFS, but I'm from the Sun world. On Linux, btrfs is eminent at least, but for now, Ext3 works fine if you setup your db properly.
I will write some thing in a file/memory just before system shutdown or a service shutdown. In the next restart of system, Is it possible to access same file or same memory on the disk, before filesystem loads? Actual requirement is like this, we have a driver that sits between volume level drivers and filesystem driver...in that part of the driver code, I want to access some memory or file.
Thanks & Regards,
calvin
The logical thing here is to read/write this into the registry if it is not too big. Is there a reason you do not want to use the registry?
If you need to access large data and you are writing a volume or device filter and cannot rely on ZwOpen/Read/Write/Close functions in the kernel an approach would be to create the file in user mode, get its device name and cluster chain and store them in the registry. On the next boot, you can get the device and clusters from registry, and do direct I/O on them.
Since you want to access this before the filesystem loads, my first thought is to allocate and use a block of storage space on the hard drive outside of the filesystem. You can create a hidden mini-partition on the drive and use low-level I/O commands to read and write your data.
This is a common task in the world of embedded systems, and we often implement it by adding some sort of non-volatile memory device into the system (flash, battery-backed DRAM, etc) and reading and writing to that device. Since you likely don't have the same level of control over the available hardware as embedded developers do, the closest analogue I can think of would be to reserve a chunk of space on a physical disk that you can read from without having to mount as a filesystem. A dedicated mini-partition might work the best because if you know the size of it, you can treat it as one big raw-access buffer and can avoid having to hassle with filenames, filesystems, etc.