btrfs uncorrectable error on unmodified file after passing scrub - btrfs

I have two relatively new 4T hard drives (WD Data Center Re WD4000FYYZ) formatted as btrfs with raid1 data and raid1 metadata.
I copied a large binary file to the volume (~76 GB). Soon after copying the file, I ran a btrfs scrub. There were no errors.
A few months later, a scrub returned an unrecoverable error on that file. It has not been modified since it was originally copied. I might add that the SMART attributes for both drives do not indicate any errors (Current_Pending_Sector or otherwise).
The system with the drives does not have ECC memory.
The only thing that I can think of that might cause this kind of error is that in writing to another file whose data checksums were contained in the same block as some of the checksums for the big file, some corruption occurred in memory that allowed bad data to pollute one or more of the checksums for the big file.
Unfortunately, I was hoping in migrating to btrfs that once data was loaded and scrubbed successfully, you could be confident that it would remain so if it were not written to (in raid1/5/6 configuration, of course). Obviously, this is not the case.
Can anyone explain how this could have happened? Also, if I had taken a snapshot of the volume that contained the big file, would I still have had access to the original, uncorrupted data from the snapshot?

This silent data corruption was caused by a bad memory stick. The memory was replaced and the problem has not reappeared.

Related

Why can a database file become corrupted if it's copied while in use? [migrated]

This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 12 days ago.
Many sources online claim that copying a database file while it's in use can corrupt that file. But I couldn't find any explanation for why this is an issue, and I'm confused about it. Copying a file is a read-only operation on that file, so why should any of the data stored in that file get affected in any way?
Any file can become corrupted if copied while being in use by another process. "Being in use" means that the other process can modify the file. Since the speed of copying is finite, it might happen that, by the time you copy the last bytes of the file, that other process has already changed the first bytes, thus making the copy inconsistent.
Copying a file is a read-only operation on that file ...
Yes it is, but it is not "atomic".
The file is not locked while it is being read (to be copied) and the running database will continue to write to the file while the copy is running. The database writes to the datafile in "Blocks" and these blocks can be written to any part of the file, at any time. The copy process might have read the first part of the file, creating its copy, but then the database overwrites bits that have already been read. As a result, the copy is incomplete and inconsistent.
Now there are tools/mechanisms that handle this (e.g. SqlServer's VSS) by effectively asking the database to "Stop writing to the Data Files for a bit, please" (which sounds good, but effectively "stalls" the database entirely).
The Basic Rules of Thumb, of course, should be
Always do Database "Stuff" with Database Tools.

Will file system be corrupted on raw disk modification?

Some OSs allow programs to edit raw disk, bypassing the file system. If I'm wrong(ie only raw read is allowed, not raw write), please correct me.
If I'm correct, then what will happen if a program modifies a raw disk, modifying some blocks, but not updating the disk tables(free block list, FAT etc)? Will the file system auto update it or notify the program, get corrupted, or something else?
For example, if nothing is done, the file system may write a new file to the blocks containing those data thinking that they are currently free(if the free block list is not updated by the program).

btrfs can't mount after broken disk removed

I want to use btrfs as filesystem on my server, and i am still research about it in all worst case condition.
Currently i want to test the raid system crash, the condition that i want to test is :
if my disk broken, how to replace it
if i can't replace it, how to save my data
if accidentally i am (or my team) formated one of the disk, how to fix it
if accidentally one of my disk stollen (i think this case not possible, just for the worst case condition), how to replace it
for all question i am writen above, i just can answer two of my question.
answer number one is, i can use replace method before unplug the broken disk.
answer number two, i can plug external harddrive, and then mounting it, and i can use restore method to save my data
for the other question, i failed to test it.
for question number 3 and 4(if i replace it with another disk), i tried to use mount -o degraded but i can't mount it it shows error wrong fs type, bad option, bad superblock on /dev/sdb. i am tried to rebalance it with balance method, but i can't mounting it.
please, i need answer to my question number 3 and 4.
The replace option needs to be done before the disk completely dies or else the replace operation won't work (and will likely screw up the array). If the disk is already unreadable, then yank it and mount with the degraded option. Add a new disk into the array and tell it to delete missing devices and it should sort it all out.
If your array has redundancy on both data and metadata a single failed disk shouldn't cost you any of your data. If, for some reason, the array is corrupted and won't accept a replacement disk, you can use btrfs recover to copy as much as is recoverable out of the array and into a different storage system. Then rebuild the array.
Formatting a disk is no different from having one go bad except you don't actually need a new physical disk. If your array is redundant, mount degraded, add the formatted disk back in, and delete missing. It should automatically rebalance the affected data. Running a scrub when you're done might also be wise.
A stolen disk is the same as having one go bad. Mount degraded, add in a new one, and delete missing.
Your bad superblock issue is most likely caused by attempting to mount the disk that was formatted/replaced. Formatting will remove the BTRFS filesystem identifiers, so the system won't be able to detect the other drives in the array. Use one of the devices that's still a part of the array for the mount command and it should be able to detect the rest. If it doesn't, then probably your array was not in a consistent state before you removed/formatted the disk and there is insufficient redundancy to repair it. btrfs recover may be your only option at that point. Depending on circumstances you may need to run btrfs device scan to re-detect what devices are and are not part of the array.

How does fwite/putc write to Disk?

Suppose we have an already existing file, say <File>. This file has been opened by a C program for update (r+b). We use fseek to navigate to a point inside <File>, other than the end of it. Now we start writing data using fwrite/fputc. Note that we don't delete any data previously existing in <File>...
How does the system handle those writes? Does it rewrite the whole file to another position in the Disk, now containing the new data? Does it fragment the file and write only the new data in another position (and just remember that in the middle there is some free space)? Does it actually overwrite in place only the part that has changed?
There is a good reason for asking: In the first case, if you continuously update a file, the system can get slow. In the second case, it could be faster but will mess up the File System if done to many files. In the third case, especially if you have a solid state Disk, updating the same spot of a File over and over again may render that part of the Disk useless.
Actually, that's where my question originates from. I've read that, to save Disk Sectors from overuse, Solid State Disks move Data to less used sectors, using different techniques. But how exactly does the stdio functions handle such situations?
Thanks in advance for your time! :D
The fileystem handler creates a kind of dicationary writing to sectors on the disc, so when you update the content of the file, the filesystem looks up the dictionary on the disc, which tells it, in which sector on the disc the file data is located. Then it spins (or waits until the disc arrives there) and updates the appropriate sectors on the disc.
That's the short version.
So in case, of updating the file, the file is normally not moved to a new place. When you write new data to the file, appending to it, and the data doesn't fit into the existing sector, then additional sectors are allocated and the data is written there.
If you delete a file, then usually the sectors are marked as free and are reused. So only if you open a new file and rewrite it, it can happen that the file is put in different sectors than before.
But the details can vary, depending on the hardware. AFAIK if you overwrite data on a CD, then the data is newly written (as long as the session is not finalized), because you can not update data on a CD, once it is written.
Your understanding is incorrect: "Note that we don't delete any data previously existing in File"
If you seek into the middle of a file and start writing it will write over whatever was at that position before.
How this is done under the covers probably depends on how computer in the hard disk implements it. It's supposed to be invisible outside the hard disk and shouldn't matter.

Getting data from MATLAB Simulink every 0.008s in .txt file

I need to get data from my simulink model, write it to txt file, have another program read it, and this every 0.008s.
Is there any way to do it? All i could get is to get data into workspace
Also the system is discrete
You should use a To File block to save the data to disk. It will figure out the correct buffer size, etc., for you and write the data to disk. You just have to poll from the other program to get new data.
8 milliseconds is generally not enough data to justify the overhead of disk IO, so the To File block needs more than this to write to disk, and your other program needs more than this to read. This obviously introduces latency.
If you want a lower-latency solution, consider using UDP or TCP comminication blocks that exist in the DSP System Toolbox libarary.
Of course, it's impossible to say anything without a lot more detail.
How much data? What operating system? What happens if you "miss"? What kind of disk is the file on? Does it really have to be a file on-disk, can't you use e.g. pipes or something to avoid hitting disk? What does the "other program" have to do with the data?
8 milliseconds is not a lot of time for a disk to do anything, you're basically going to be assuming all accesses are in cache in order to work, so factor out the disk. Use a pipe or a RAM disk.
8 milliseconds is also not a lot of time for a typical desktop operating system.

Resources