Btrfs snapshot size too big. Do they actually contain the diffs only? - filesystems

I don't have a good understanding of COW-snapshots mechanics but expect they contain the diffs and shared data among all of those which have one parent subvolume.
I made a script to check btrfs snapshots disk space consumption.
#!/usr/bin/zsh
for i in {1..2000}
do
echo 'line'$i >> /btrfs/test-volume/btrfs-doc.txt
/usr/bin/time -f "execution time: %E" btrfs subvolume snapshot /btrfs/test-volume /btrfs/snapshots/test-volume-snap$i
done
After running i displayed their dirs size and what i got:
❯ btrfs filesystem df /btrfs
Data, single: total=8.00MiB, used=6.84MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=102.38MiB, used=33.39MiB
GlobalReserve, single: total=3.25MiB, used=0.00B
❯ btrfs filesystem du -s /btrfs
Total Exclusive Set shared Filename
18.54MiB 6.74MiB 36.00KiB /btrfs
❯ df -h /btrfs
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vgstoragebox-btrfs 2.0G 77M 1.8G 5% /btrfs
❯ du -sh /btrfs
20M /btrfs
❯ ll /btrfs/test-volume/btrfs-doc.txt
-rw-r--r-- 1 root root 17K Jul 6 14:50 /btrfs/test-volume/btrfs-doc.txt
❯ tree -hU /btrfs/snapshots
/btrfs/snapshots
├── [ 26] test-volume-snap1
│   └── [ 6] btrfs-doc.txt
├── [ 26] test-volume-snap2
│   └── [ 12] btrfs-doc.txt
├── [ 26] test-volume-snap3
│   └── [ 18] btrfs-doc.txt
...
├── [ 26] test-volume-snap1998
│   └── [ 16K] btrfs-doc.txt
├── [ 26] test-volume-snap1999
│   └── [ 16K] btrfs-doc.txt
└── [ 26] test-volume-snap2000
└── [ 16K] btrfs-doc.txt
2000 directories, 2000 files
All the utils calculated size differently, i can't say how much disk space /btrfs/snapshots dir consumed actually, but i see it's much bigger than at least a double size of the file /btrfs/test-volume/btrfs-doc.txt. At the moment i think it should be around the double size in case the btrfs snapshots contain the diffs and shared data is linking.
In comparison, i made the same test with LVM snapshots and small disk space was consumed by them.

From a userland perspective btrfs snapshots are just simple directories containing the files and contents of the subvolume at the time the snapshot was created. You can access them normally like any other directory.
Therefore the userland tools you used will report the sizes of the individual files within the snapshot just as with any other file.
If you create say 10 snapshots of the same subvolume the userland tools such as du will report the same total size for each snapshot, and summarizing this for all 10 snapshots will report a disk usage of 10 times the size of the initial subvolume.
But: due to the CoW-nature of these subvolumes the contained files within the snapshots actually all share the same data blocks on disk. So although du will report 10 times the total size it is only used up on disk once.
The way Copy-on-Write works is that a new copy of a file (e.g. with created cp --reflink) or new snapshot is at first nothing more than a new pointer to the same physical data on disk as the original file/subvolume. So the new file will not use any additional disk space (besides some additional metadata).
Only when the data is changed the new additional data is written to a new place on the disk an the pointer of the file/snapshot is updated to include that data block. All unchanged parts of the data are still shared with the original copy.
This is why creating snapshots is very fast and uses next to no additional disk space. But over time the disk space used by a snapshot may grow since its reference data blocks diverge from the original subvolume and less and less data block will actually be shared.
If you want to see the amount of data that is shared between/unique to the individual subvolumes you can use the quota support feature of btrfs.

Related

Why would file checksums inconsistently fail?

I created a ~2MiB file.
dd if=/dev/urandom of=file.bin bs=2M count=1
Then I copied that file a large number of times and generated a checksum for each (identical) copy.
for i in `seq 50000`;
do
name="file.${i}.bin"
cp file.bin "${name}"
sha512sum "${name}" > "${name}.sha512"
done
I then verified all of those checksummed files with a validation script to run sha512sum against each file.
for file in `find . -regex ".*\.sha512"`
do
sha512sum --check --quiet "${file}" || (
cat "${file}" && sha512sum "${file%.sha512}"
)
done
I just created these files, and when I validate them moments later, I see intermittent failures and inconsistencies in the data (console text truncated for readability)
will:/mnt/usb $ for file in `find ...
file.5602.bin: FAILED
sha512sum: WARNING: 1 computed checksum did NOT match
91fc201a3812e93ef3d4890 ... file.5602.bin
b176e8e3ea63a223130f3a0 ... ./file.5602.bin
The checksum files are all identical since the source files are all identical
The problem seems to be that my computer is, seemingly at random, generating the wrong checksum for some of my files when I go to validate. A different file fails the checksum every time, and files that previously failed will pass.
will:/mnt/usb $ for file in `find ...
sha512sum: WARNING: 1 computed checksum did NOT match
91fc201a3812e93ef3d4890 ... file.3248.bin
442a1d8805ed134c9ab5252 ... ./file.3248.bin
Keep in mind that all of these files are identical.
I see the same behavior with SATA SSD and HDD, and USB devices, with md5 and sha512, with xfs, btrfs, ext4, and vfat. I tried live booting to another OS. I see this same stranger behavior regardless. I also see rsync --checksum for these files thinks checksums are wrong and re-copies these files even though they have not changed.
What could explain this behavior? Since it's happening on multiple devices with all the scenarios I described, I doubt this is bit rot. My kernel logs show no obvious errors. I would assume this is a hardware issue based on my troubleshooting, but how can this be diagnosed? Is it the CPU, the motherboard, the RAM?
What could explain this behavior? How can this be diagnosed?
From what I've read, a number of issues could explain this behavior. Bad disk(s), bad PSU (power supply), bad RAM, filesystem issues.
I tried the following to determine what was happening. I repeated the experiment with different...
Disks
Types of disks (SDD vs HDD)
External drives (3.5 and 2.5 enclosures)
Flash drives (USB 2 and 3 on various ports)
Filesystems (ext4, vfat (fat32), xfs, btrfs)
Different PSU
Different OS (live boot)
Nothing seemed to resolve this.
Finally, I gave memtest86+ v5.0.1 a try via an Ubuntu live USB.
voila. It found bad memory. Through process of elimination I determined one of my memory sticks was bad, and then tested the other over night to ensure it was in good shape. I re-ran my experiment again and I am seeing consistent checksums on all my files.
What a subtle bug. I only noticed this bad behavior by accident. If I hadn't been messing around with file checksums, I do not think I would have found this bad RAM.
This makes me want to regularly schedule a routine in which I verify and test my RAM. A consequence of this bad memory stick is that some of my test data did end up corrupt, but more often than not, the checksum verifications were just interimmitent failures.
In one sample data pool, all the checksums start with cb2848ca0e1ff27202a309408ec76..., because all ~50,000 files are identical.
Though, there are two files that are corrupt, but this is not bit rot or file integrity damage.
What seems most likely is that these files were created with corruption because cp encountered bad RAM when I created these files. Those files consistently return bad checksums of 58fe24f0e00229e8399dc6668b9... and bd85b51065ce5ec31ad7ebf3..., while the other 49,998 files return the same checksum.
This has been a fun extremely frustrating experiment in debugging.

Google Compute Engine VM instance: VFS: Unable to mount root fs on unknown-block

My instance on Google Compute Engine is not booting up due to having some boot order issues.
So, I have created a another instance and re-configured my machine.
My questions:
How can I handle these issues when I host some websites?
How can I recover my data from old disk?
logs
[ 0.348577] Key type trusted registered
[ 0.349232] Key type encrypted registered
[ 0.349769] AppArmor: AppArmor sha1 policy hashing enabled
[ 0.350351] ima: No TPM chip found, activating TPM-bypass!
[ 0.351070] evm: HMAC attrs: 0x1
[ 0.351549] Magic number: 11:333:138
[ 0.352077] block ram3: hash matches
[ 0.352550] rtc_cmos 00:00: setting system clock to 2015-12-19 17:06:53 UTC (1450544813)
[ 0.353492] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 0.354108] EDD information not available.
[ 0.536267] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input2
[ 0.537862] md: Waiting for all devices to be available before autodetect
[ 0.538979] md: If you don't use raid, use raid=noautodetect
[ 0.539969] md: Autodetecting RAID arrays.
[ 0.540699] md: Scanned 0 and added 0 devices.
[ 0.541565] md: autorun ...
[ 0.542093] md: ... autorun DONE.
[ 0.542723] VFS: Cannot open root device "sda1" or unknown-block(0,0): error -6
[ 0.543731] Please append a correct "root=" boot option; here are the available partitions:
[ 0.545011] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 0.546199] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-39-generic #44~14.04.1-Ubuntu
[ 0.547579] Hardware name: Google Google, BIOS Google 01/01/2011
[ 0.548728] ffffea00008ae140 ffff880024ee7db8 ffffffff817af92b 000000000000111e
[ 0.549004] ffffffff81a7c7c8 ffff880024ee7e38 ffffffff817a976b ffff880024ee7dd8
[ 0.549004] ffffffff00000010 ffff880024ee7e48 ffff880024ee7de8 ffff880024ee7e38
[ 0.549004] Call Trace:
[ 0.549004] [] dump_stack+0x45/0x57
[ 0.549004] [] panic+0xc1/0x1f5
[ 0.549004] [] mount_block_root+0x210/0x2a9
[ 0.549004] [] mount_root+0x54/0x58
[ 0.549004] [] prepare_namespace+0x16d/0x1a6
[ 0.549004] [] kernel_init_freeable+0x1f6/0x20b
[ 0.549004] [] ? initcall_blacklist+0xc0/0xc0
[ 0.549004] [] ? rest_init+0x80/0x80
[ 0.549004] [] kernel_init+0xe/0xf0
[ 0.549004] [] ret_from_fork+0x58/0x90
[ 0.549004] [] ? rest_init+0x80/0x80
[ 0.549004] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 0.549004] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
What Causes This?
That is the million dollar question. After inspecting my GCE VM, I found out there were 14 different kernels installed taking up several hundred MB's of space. Most of the kernels didn't have a corresponding initrd.img file, and were therefore not bootable (including 3.19.0-39-generic).
I certainly never went around trying to install random kernels, and once removed, they no longer appear as available upgrades, so I'm not sure what happened. Seriously, what happened?
Edit: New response from Google Cloud Support.
I received another disconcerting response. This may explain the additional, errant kernels.
"On rare occasions, a VM needs to be migrated from one physical host to another. In such case, a kernel upgrade and security patches might be applied by Google."
1. "How can I handle these issues when I host some websites?"
My first instinct is to recommend using AWS instead of GCE. However, GCE is less expensive. Before doing any upgrades, make sure you take a snapshot, and try rebooting the server to see if the upgrades broke anything.
2. How can I recover my data from old disk?
Even Better - How to recover your instance...
After several back-and-forth emails, I finally received a response from support that allowed me to resolve the issue. Be mindful, you will have to change things to match your unique VM.
Take a snapshot of the disk first in case we need to roll back any of the changes below.
Edit the properties of the broken instance to disable this option: "Delete boot disk when instance is deleted"
Delete the broken instance.
IMPORTANT: ensure not to select the option to delete the boot disk. Otherwise, the disk will get removed permanently!!
Start up a new temporary instance.
Attach the broken disk (this will appear as /dev/sdb1) to the temporary instance
When the temporary instance is booted up, do the following:
In the temporary instance:
# Run fsck to fix any disk corruption issues
$ sudo fsck.ext4 -a /dev/sdb1
# Mount the disk from the broken vm
$ sudo mkdir /mnt/sdb
$ sudo mount /dev/sdb1 /mnt/sdb/ -t ext4
# Find out the UUID of the broken disk. In this case, the uuid of sdb1 is d9cae47b-328f-482a-a202-d0ba41926661
$ ls -alt /dev/disk/by-uuid/
lrwxrwxrwx. 1 root root 10 Jan 6 07:43 d9cae47b-328f-482a-a202-d0ba41926661 -> ../../sdb1
lrwxrwxrwx. 1 root root 10 Jan 6 05:39 a8cf6ab7-92fb-42c6-b95f-d437f94aaf98 -> ../../sda1
# Update the UUID in grub.cfg (if necessary)
$ sudo vim /mnt/sdb/boot/grub/grub.cfg
Note: This ^^^ is where I deviated from the support instructions.
Instead of modifying all the boot entries to set root=UUID=[uuid character string], I looked for all the entries that set root=/dev/sda1 and deleted them. I also deleted every entry that didn't set an initrd.img file. The top boot entry with correct parameters in my case ended up being 3.19.0-31-generic. But yours may be different.
# Flush all changes to disk
$ sudo sync
# Shut down the temporary instance
$ sudo shutdown -h now
Finally, detach the HDD from the temporary instance, and create a new instance based off of the fixed disk. It will hopefully boot.
Assuming it does boot, you have a lot of work to do. If you have half as many unused kernels as me, then you might want to purge the unused ones (especially since some are likely missing a corresponding initrd.img file).
I used the second answer (the terminal-based one) in this askubuntu question to purge the other kernels.
Note: Make sure you don't purge the kernel you booted in with!
How to handle these issues when I host some websites?
I'm not sure how you got into this situation, but it would be nice to have additional information (see my comment above) to be able to understand what triggered this issue.
How to recover my data from old disk?
Attach and mount the disk
Assuming you did not delete the original disk when you deleted the instance, you can simply mount this disk from another VM to read the data from it. To do this:
attach the disk to another VM instance, e.g.,
gcloud compute instances attach-disk $INSTANCE --disk $DISK
mount the disk:
sudo mkdir -p /mnt/disks/[MNT_DIR]
sudo mount [OPTIONS] /dev/disk/by-id/google-[DISK_NAME] /mnt/disks/[MNT_DIR]
Note: you'll need to substitute appropriate values for:
MNT_DIR: directory
OPTIONS: options appropriate for your disk and filesystem
DISK_NAME: the id of the disk after you attach it to the VM
Unmounting and detaching the disk
When you are done using the disk, reverse the steps:
Note: Before you detach a non-root disk, unmount the disk first. Detaching a mounted disk might result in incomplete I/O operation and data corruption.
unmount the disk
sudo umount /dev/disk/by-id/google-[DISK_NAME]
detach the disk from the VM:
gcloud compute instances detach-disk $INSTANCE --device-name my-new-device
In my case grub's (/boot/grub/grub.cfg) first menuentry (3.19.0-51-generic) was missing an initrd entry and was unable to boot.
Upon further investigating, looking at dpkg for the specific kernel its marked as failed and unconfigured
dpkg -l | grep 3.19.0-51-generic
iF linux-image-3.19.0-51-generic 3.19.0-51.58~14.04.1
iU linux-image-extra-3.19.0-51-generic 3.19.0-51.58~14.04.1
This all stemmed from the Ubuntu image supplied by Google having unattended-upgrades enabled. For some reason the initrd was killed when it was being built and something else came along and ran update-grub2.
unattended-upgrades-dpkg_2016-03-10_06:49:42.550403.log:update-initramfs: Generating /boot/initrd.img-3.19.0-51-generic
Killed
E: mkinitramfs failure cpio 141 xz -8 --check=crc32 137
unattended-upgrades-dpkg_2016-03-10_06:49:42.550403.log:update-initramfs: failed for /boot/initrd.img-3.19.0-51-generic with 1.
To work around the immediate problem run.
dpkg --force-confold --configure -a
Although unattended-upgrades in theory is a great idea, having it enabled by default can have unattended consequences.
There are a few cases where the kernel fails to handle the initrdless boot. Disable the GRUB_FORCE_PARTUUID options so that it boots with initrd.

How do I revover a btrfs filesystem that will not mount (but mount returns without error), checks ok, errors out on restore?

SYNOPSIS
mount -o degraded,ro /dev/disk/by-uuid/ec3 /mnt/ec3/ && echo noerror
noerror
DESCRIPTION
mount -t btrfs fails but returns with noerror as above and only since the last reboot.
btrfs check seems clean to me (I am simple user).
btrfs restore errors out with "We have looped trying to restore files in"...
I had a lingering artifact btrfs filesystem show giving " *** Some devices missing " on the volume. This meant it would not automount on boot and I have been manually mounting (+ searching for a resolution to that)
I have previously used rdfind to deduplicate with hard links (as many as 10 per file)
I had just backed up using btrfs send/recieve but have to check if I have everything - this was the main Raid1 server
DETAILS
btrfs-find-root /dev/disk/by-uuid/ec3
Superblock thinks the generation is 103093
Superblock thinks the level is 1
Found tree root at 8049335181312 gen 103093 level 1
btrfs restore -Ds /dev/disk/by-uuid/ec3 restore_ec3
We have looped trying to restore files in
df -h /mnt/ec3/
Filesystem Size Used Avail Use% Mounted on
/dev/dm-0 16G 16G 483M 97% /
mount -o degraded,ro /dev/disk/by-uuid/ec3 /mnt/ec3/ && echo noerror
noerror
df /mnt/ec3/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/dm-0 16775168 15858996 493956 97% /
btrfs filesystem show /dev/disk/by-uuid/ec3
Label: none uuid: ec3
Total devices 3 FS bytes used 1.94TiB
devid 6 size 2.46TiB used 1.98TiB path /dev/mapper/26d2e367-65ea-47ad-b298-d5c495a33efe
devid 7 size 2.46TiB used 1.98TiB path /dev/mapper/3c193018-5956-4637-9ec2-dd5e49a4a412
*** Some devices missing #### comment, this is an old artifact unchanged since before unable to mount
btrfs check /dev/disk/by-uuid/ec3
Checking filesystem on /dev/disk/by-uuid/ec3
UUID: ec3
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 2132966506496 bytes used err is 0
total csum bytes: 2077127248
total tree bytes: 5988204544
total fs tree bytes: 3492638720
total extent tree bytes: 242151424
btree space waste bytes: 984865976
file data blocks allocated: 3685012271104
referenced 3658835013632
btrfs-progs v4.1.2
Update:
After a reboot (had to wait for a slot to go down) the system mounts manually but not completely clean.
Now asking question on irc #btrfs:
! http://pastebin.com/359EtZQX
Hi, I'm scratching my head and searched in vain to remove *** Some devices missing. Can anyone help give me a clue to clean this up?
- is there a good way to 'fix' the artifacts I am seeing? trying: scrub, balance. Try: resize, defragment.
- would I be advised to move to a new clean volume set?
- would a fix via a btrfs send/recieve be safe from propogating errors?
- or (more painfully) rsync to a clean volume? http://pastebin.com/359EtZQX (My first ever day using irc)

Why the tests are not running ? ( Golang ) - goapp test - bug?

I'm trying to run GAE tests on multiple packages. My app (testapp ) looks as below:
testapp>
README.md package1 package2
each package has two go files. One is the package itself the other is the 'test' package.
package1
$ls package1
package1.go package1_test.go
package2
$ls package2
package2.go package2_test.go
To run the tests I use
goapp test -v ./...
Output:
warning: building out-of-date packages:
github.com/mihai/API
installing these packages with 'go test -i ./...' will speed future tests.
=== RUN TestGetDiskFile
codelistgobfile.gob
codelist.gob written successfully
--- PASS: TestGetDiskFile (0.00 seconds)
PASS
ok testapp/package1 0.010s
However as you can see above it seems to run only the first test ( TestGetDiskFile ) from package1. After that it gets stuck. I get no kind of output. If I go in each package ( cd package 1 ) and run goapp test all the tests (about 20 tests) run successfully
Any idea how I can fix / run all the tests without getting stuck or at least how I can debug it further? is this a goapp bug?
I've tried on two different machines ( Mac osx , and ubuntu ), the result is same.
To debug, strip things down to a minimal test case. For example, the following is a minimal test case for go test -v ./.... Try something similar for goapp test -v ./....
$ dir
package1 package2
$ tree ../packages
../packages
├── package1
│   └── package1_test.go
└── package2
└── package2_test.go
2 directories, 2 files
$ go test -v ./...
=== RUN TestPackage1
--- PASS: TestPackage1 (0.00 seconds)
PASS
ok packages/package1 0.004s
=== RUN TestPackage2
--- PASS: TestPackage2 (0.00 seconds)
PASS
ok packages/package2 0.004s
$
File: package1_test.go:
package package1
import "testing"
func TestPackage1(t *testing.T) {}
File: package2_test.go:
package package2
import "testing"
func TestPackage2(t *testing.T) {}

How to find which process is leaking file handles in Linux?

The problem incident:
Our production system started denying services with an error message "Too many open files in system". Most of the services were affected, including inability to start a new ssh session, or even log in into virtual console from the physical terminal. Luckily, one root ssh session was open, so we could interact with the system (morale: keep one root session always open!). As a side effect, some services (named, dbus-daemon, rsyslogd, avahi-daemon) saturated the CPU (100% load). The system also serves a large directory via NFS to a very busy client which was backing up 50000 small files at the moment. Restarting all kinds of services and programs normalized their CPU behavior, but did not solve the "Too many open files in system" problem.
The suspected cause
Most likely, some program is leaking file handles. Probably the culprit is my tcl program, which also saturated the CPU (not normal). However, killing it did not help, but, most disturbingly, lsof would not reveal large amounts of open files.
Some evidence
We had to reboot, so whatever information was collected is all we have.
root#xeon:~# cat /proc/sys/fs/file-max
205900
root#xeon:~# lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
init 1 root cwd DIR 8,6 4096 2 /
init 1 root rtd DIR 8,6 4096 2 /
init 1 root txt REG 8,6 124704 7979050 /sbin/init
init 1 root mem REG 8,6 42580 5357606 /lib/i386-linux-gnu/libnss_files-2.13.so
init 1 root mem REG 8,6 243400 5357572 /lib/i386-linux-gnu/libdbus-1.so.3.5.4
...
A pretty normal list, definitely not 200K files, more like two hundred.
This is probably, where the problem started:
less /var/log/syslog
Mar 27 06:54:01 xeon CRON[16084]: (CRON) error (grandchild #16090 failed with exit status 1)
Mar 27 06:54:21 xeon kernel: [8848865.426732] VFS: file-max limit 205900 reached
Mar 27 06:54:29 xeon postfix/master[1435]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Mar 27 06:54:29 xeon kernel: [8848873.611491] VFS: file-max limit 205900 reached
Mar 27 06:54:32 xeon kernel: [8848876.293525] VFS: file-max limit 205900 reached
netstat did not show noticeable anomalies either.
The man pages for ps and top do not indicate an ability to show open file count. Probably the problem will repeat itself after a few months (that was our uptime).
Any ideas on what else can be done to identify the open files?
UPDATE
This question has changed the meaning, after qehgt identified the likely cause.
Apart from the bug in NFS v4 code, I suspect there is a design limitation in Linux and kernel-leaked file handles can NOT be identified. Consequently, the original question transforms into:
"Who is responsible for file handles in the Linux kernel?" and "Where do I post that question?". The 1st answer was helpful, but I am willing to accept a better answer.
Probably the root cause is a bug in NFSv4 implementation: https://stackoverflow.com/a/5205459/280758
They have similar symptoms.

Resources