Filesystem- ext4 : Application corrupting superblock - c

I found many links but almost all are pointing to fix not the reason.
I created a 7GB ext4 partition on a sd card connected via USB card reader to PC. I have an application which is writing 10488576 bytes to the mentioned partition (/dev/sdc2). After the application run the filesystem is looking corrupt:
#fsck.ext4 -v /dev/sdc2
e2fsck 1.42.8 (20-Jun-2013)
ext2fs_open2: Bad magic number in super-block
fsck.ext4: Superblock invalid, trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear<y>? no
fsck.ext4: Illegal inode number while checking ext3 journal for /dev/sdc2
/dev/sdc2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sdc2: ********** WARNING: Filesystem still has errors **********
#dumpe2fs /dev/sdc2
dumpe2fs 1.42.8 (20-Jun-2013)
dumpe2fs: Bad magic number in super-block while trying to open /dev/sdc2
Couldn't find valid filesystem superblock.
The application is simply using something like below (i can't post exact code):
char *write_buf; //declared in header
write_buf = (char *) malloc(size) // where size = 10488576. This allocation is happening in function a() called from main
char *buf; // declared locally in function b()
buf = write_buf; // in function b()
write(fd,buf,size); // in function b()
The filesystem block size is 4K.
Backup superblock at 32768 , 98304 ,163840 ,229376 , 294912 ,819200, 884736 ,1605632
Let me know if any more information required. I need to understand what might cause this corruption , because I'm very much affirmative that something may be wrong with application code.
EDIT:
I can see that primary superblock starts at 0 , and the lseek() call before write() is also doing SEEK_SET to 0, which would overwrite superblock information. I am going to try lseek far from superblock before write().
EDIT:
I have got this fixed by doing as I mentioned above. As per dumpe2fs o/p I had below for group 0:
Group 0: (Blocks 0-32767)
Checksum 0x8bba, unused inodes 8069
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-474
Block bitmap at 475 (+475), Inode bitmap at 491 (+491)
Inode table at 507-1011 (+507)
24175 free blocks, 8069 free inodes, 2 directories, 8069 unused inodes
Free blocks: 8593-32767
Free inodes: 12-8080
So before writing I did lseek to 8593*4096 .Now filesystem is not getting corrupt.

I have got this fixed by doing as I mentioned above. As per dumpe2fs o/p I had below for group 0:
Group 0: (Blocks 0-32767)
Checksum 0x8bba, unused inodes 8069
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-474
Block bitmap at 475 (+475), Inode bitmap at 491 (+491)
Inode table at 507-1011 (+507)
24175 free blocks, 8069 free inodes, 2 directories, 8069 unused inodes
Free blocks: 8593-32767
Free inodes: 12-8080
So before writing I did lseek to 8593*4096.Now filesystem is not getting corrupt.

Related

in ext2File system pulling super block& group descriptor is easy.ButWhat abt pulling specific inode as there can be mny inodes each may relate 2 file

So to pull super block in file system (i.e. if my sda storage is ext2 formatted) is easy. I just need to skip 1024 bytes to get the super block fro sda storage
lseek(fd, 1024, SEEK_SET);
read(fd, &super_block, sizeof(super_block));
and to pull the group descriptor is also super easy (only if I understood correctly from looking at code)
lseek(fd, 1024 + [block_size_ext_1024_bytes]=1024, SEEK_SET);
read(fd, &block_group, sizeof(block_group));
or
lseek(fd, 1024 + 1024, SEEK_SET);
read(fd, &block_group, sizeof(block_group));
1024=Base offset
But I am not feeling at confort because the real challege I found is to pull inode is only I have file name. I know file names are stored in directory struct so first challege is to extract directory struct from there and in directory struct I can get the inode number. and from Inode number I can extract inode struct. but I do not know how to extract directory struct in ext2 formatted image. Can anyone please telll me this? thanks
Yes pulling super block is just a matter of skipping Base_Offset=1024 bytes in ext2 and then reading it like so
lseek(fd, BASE_OFFSET + block_size, SEEK_SET);
//BASE_OFFSET for EXT2 == 1024
read(fd, &super_block, sizeof(super_block));
block_size = 1024 << super_block.s_log_block_size;
printf("Block size is [%d]\n",super_block.s_log_block_size);
The size of a super-block is given by s_log_block_size. This value expresses the size of a block as a power of 2, using 1024(specifically for ext2) bytes as the unit. Thus, 0 denotes 1024-byte blocks, 1 denotes 2048-byte blocks, and so on. To calculate the size in bytes of a block:
unsigned int block_size = 1024 << super.s_log_block_size; /* block
super.s_log_block_size always 0 if need to hardcode 1024 and super.s_log_block_size is multiple of 2 so if 1024 block size super.s_log_block_size is should be 0
Then I can extract group descriptor. So for my image there is only one group descrptor. I dont know how many descriptor will I have if I have 1TB of storage as I do have this and file system is ext4. May be someone will tell me this.
Like this to extract group descriptor by further moving forward 1024 bytes
lseek(fd, BASE_OFFSET + block_size, SEEK_SET);
read(fd, &block_group, sizeof(block_group));
I think this gives the idea of finding out how many group desciptors are there in storage in ext2
unsigned int group_count = 1 + (super_block.s_blocks_count-1) / super_block.s_blocks_per_group;
so for example On my device image it has 128 blocks so first block always Boot info, second block contains super block, the third block contains first group descriptor -- still like to know what would be the offset of my second group descriptor if I had more space on my storage. Please someone shed light on this
Moving on, to extract specific inode the formula is this to seek the offset of specific inode
lseek(fd, BLOCK_OFFSET(block_group->bg_inode_table)+(inode_no-1)*sizeof(struct ext2_inode),
SEEK_SET);
bg_inode_table can be used to extract inode
The group descriptor tells us the location of the block/[inode bitmaps] and of the inode table (described later) through the bg_block_bitmap, bg_inode_bitmap and bg_inode_table fields.
Now to extract root inode=(should be ino_num=2) for example I just need to do
lseek(fd, BLOCK_OFFSET(block_group->bg_inode_table)+(2-1)*sizeof(struct ext2_inode),
SEEK_SET);
The block number of the first block of the inode table is stored in the bg_inode_table field of the group descriptor.
so inode table came to help in find specific inode
To extract the directory struct I just need to use inode.i_block[0] array. filled in last step
each i_block element is number that can be used in this way. basically a pointer points to actual blocks containing content of file with inode
lseek(...BASE_OFFSET+(i_block[x]-1)*block_size...)
block_size always 1024 for ext2
This way I can read the block at whose base contain directory struct in ext2 file system
read
void *block;
read(fd, block, block_size);
and the above line give me first directory mapped to specific inode
I can simple do a loop to get all entries
http://www.science.smith.edu/~nhowe/teaching/csc262/oldlabs/ext2.html

How to write data into an offset which is not 512*n bytes using linux native AIO?

I'm writing some app like Bittorrent client to download file from net and write it to local file. I will get partial data and write to the file.
For example, I will download a 1GB file, I will get offset 100, data: 312 bytes, offset 1000000, data: 12345, offset 4000000, data: 888 bytes.
I'm using Linux native AIO(io_setup, io_submit, io_getevents), I found this
When using linux kernel AIO, files are required to be opened in O_DIRECT mode. This introduces further requirements of all read and write operations to have their file offset, memory buffer and size be aligned to 512 bytes.
So how can I write data into some offset which is not 512 aligned?
For example, first I write 4 bytes to a file, so I have to do something like this:
fd = open("a.txt", O_CREAT | O_RDWR | O_DIRECT, 0666);
struct iocb cb;
char data[512] = "asdf";
cb.aio_buf = ALIGN(data, 512);
cb.aio_offset = 512;
cb.aio_nbytes = 512;
Then I would like to append data after asdf:
struct iocb cb2;
char data2[512] = "ghij";
cb2.aio_buf = ALIGN(data2, 512);
cb2.aio_offset = 5;
cb2.aio_nbytes = 512;
It will give error when write
Invalid argument (-22)
So how to do it?
You have to do what the driver would do if you weren't using O_DIRECT. That is, read the whole block, update the portion you want, and write it back. Block devices simply don't allow smaller accesses.
Doing it yourself can be more efficient (for example, you can update a number of disconnected sequences in the same block for the cost of one read and write). However, since you aren't letting the driver do the work you also aren't getting any atomicity guarantees across the read-modify-write operation.
You don't. The Linux AIO API is not useful, especially not for what you're trying to do. It was added for the sake of Oracle, who wanted to bypass the kernel's filesystems and block device buffer layer for Reasons™. It does not have anything to do with POSIX AIO or other things reasonable people mean when they talk about "AIO".

Ext3 Block Group Descriptor

I am having a problem understanding how to find Block Group Descriptor table. In literature (D.Poirier: "The 2nd extended filesystem") is stated that block group descriptor is located in block right after superblock.
Now, when I look at first disk, with block size of 1024 bytes, structure is like this:
MBR, 0-512 bytes
Superblock, 1536-2560 bytes
BG Descriptor, 2560 - ... bytes
And this structure is fine, because superblock starts with 3rd sector and BGD follows right after. However, when I look at second disk with block size of 4096 bytes, structure is like this:
MBR, 0-512 bytes
Superblock, 1536-2560 bytes
BG Descriptor, 4608 - ... bytes
In this case, BGD is located 3072(?) bytes away from superblock. Could someone enlight me and tell me how exactly is BGD position determined, because I'm writing a program that reads and analyses ext structure, and I can't write a generic program that knows how to find BGD.
the BGD starts offset can vary depending on the block size (1k, 2k, 4k).
In a partition, the first 1024 bytes are reserved, then followed with 1024 bytes of SUPER BLOCK. Depending on the block size, the BGD starts from:
BLK=1K, the BGD starts at partition offset 2048 (1024 reserved + 1024 super block).
BLK=2K, the BGD starts at partition offset 2048 (1024 reserved + 1024 super block).
BLK=4K, the BGD starts at partition offset 4096, which is 1 block from start, that is the result you see 3072 bytes apart from the super block.

Sequential access to hugepages in kernel driver

I'm working in a driver that uses a buffer backed by hugepages, and I'm finding some problems with the sequentality of the hugepages.
In userspace, the program allocates a big buffer backed by hugepages using the mmap syscall. The buffer is then communicated to the driver through a ioctl call. The driver uses the get_user_pages function to get the memory address of that buffer.
This works perfectly with a buffer size of 1 GB (1 hugepage). get_user_pages returns a lot of pages (HUGE_PAGE_SIZE / PAGE_SIZE) but they're all contigous, so there's no problem. I just grab the address of the first page with page_address and work with that. The driver can also map that buffer back to userspace with remap_pfn_range when another program does a mmap call on the char device.
However, things get complicated when the buffer is backed by more than one hugepage. It seems that the kernel can return a buffer backed by non-sequential hugepages. I.e, if the hugepage pool's layout is something like this
+------+------+------+------+
| HP 1 | HP 2 | HP 3 | HP 4 |
+------+------+------+------+
, a request for a hugepage-backed buffer could be fulfilled by reserving HP1 and HP4, or maybe HP3 and then HP2. That means that when I get the pages with get_user_pages in the last case, the address of page 0 is actually 1 GB after the address of page 262.144 (the next hugepage's head).
Is there any way to sequentalize access to those pages? I tried reordering the addresses to find the lower one so I can use the whole buffer (e.g., if kernel gives me a buffer backed by HP3, HP2 I use as base address the one of HP2), but it seems that would scramble the data in userspace (offset 0 in that reordered buffer is maybe offset 1GB in the userspace buffer).
TL;DR: Given >1 unordered hugepages, is there any way to access them sequentially in a Linux kernel driver?
By the way, I'm working on a Linux machine with 3.8.0-29-generic kernel.
Using the function suggested by CL, vm_map_ram, I was able to remap the memory so it can be accesed sequentially, independently of the number of hugepages mapped. I leave the code here (error control not included) in case it helps anyone.
struct page** pages;
int retval;
unsigned long npages;
unsigned long buffer_start = (unsigned long) huge->addr; // Address from user-space map.
void* remapped;
npages = 1 + ((bufsize- 1) / PAGE_SIZE);
pages = vmalloc(npages * sizeof(struct page *));
down_read(&current->mm->mmap_sem);
retval = get_user_pages(current, current->mm, buffer_start, npages,
1 /* Write enable */, 0 /* Force */, pages, NULL);
up_read(&current->mm->mmap_sem);
nid = page_to_nid(pages[0]); // Remap on the same NUMA node.
remapped = vm_map_ram(pages, npages, nid, PAGE_KERNEL);
// Do work on remapped.

Why do inode numbers start from 1 and not 0?

The C language convention counts array indices from 0. Why do inode numbers start from 1 and not 0?
If inode 0 is reserved is for some special use, then what is the significance of inode 0?
0 is used as a sentinel value to indicate null or no inode. similar to how pointers can be NULL in C. without a sentinel, you'd need an extra bit to test if an inode in a struct was set or not.
more info here:
All block and inode addresses start at
1. The first block on the disk is block 1. 0 is used to indicate no
block. (Sparse files can have these
inside them)
http://uranus.chrysocome.net/explore2fs/es2fs.htm
for instance, in old filesystems where directories were represented as a fixed array of file entries, deleting a file would result in setting that entry's inode val to 0. when traversing the directory, any entry with an inode of 0 would be ignored.
Usually, the inode 0 is reserved because a return value of 0 usually signals an error. Multiple method in the Linux kernel -- especially in the VFS layer shared by all file systems -- return an ino_t, e.g. find_inode_number.
There are more reserved inode numbers. For example in ext2:
#define EXT2_BAD_INO 1 /* Bad blocks inode */
#define EXT2_ROOT_INO 2 /* Root inode */
#define EXT2_BOOT_LOADER_INO 5 /* Boot loader inode */
#define EXT2_UNDEL_DIR_INO 6 /* Undelete directory inode */
and ext3 has:
#define EXT3_BAD_INO 1 /* Bad blocks inode */
#define EXT3_ROOT_INO 2 /* Root inode */
#define EXT3_BOOT_LOADER_INO 5 /* Boot loader inode */
#define EXT3_UNDEL_DIR_INO 6 /* Undelete directory inode */
#define EXT3_RESIZE_INO 7 /* Reserved group descriptors inode */
#define EXT3_JOURNAL_INO 8 /* Journal inode */
and ext4 has:
#define EXT4_BAD_INO 1 /* Bad blocks inode */
#define EXT4_ROOT_INO 2 /* Root inode */
#define EXT4_USR_QUOTA_INO 3 /* User quota inode */
#define EXT4_GRP_QUOTA_INO 4 /* Group quota inode */
#define EXT4_BOOT_LOADER_INO 5 /* Boot loader inode */
#define EXT4_UNDEL_DIR_INO 6 /* Undelete directory inode */
#define EXT4_RESIZE_INO 7 /* Reserved group descriptors inode */
#define EXT4_JOURNAL_INO 8 /* Journal inode */
Other fileystems use the ino 1 as root inode number. In general, a file system is free to choose its inode numbers and its reserved ino values (with the exception of 0).
OSX specifies that inode 0 signifies a deleted file that has not yet been deleted; this may have also been used in other filesystems, as OSX is BSD-derived, although at least NetBSD seems to have now removed this usage.
See the OSX manpage for getdirentries http://developer.apple.com/library/ios/#documentation/System/Conceptual/ManPages_iPhoneOS/man2/getdirentries.2.html
When I wrote a filesystem ages ago, I used inode 0 for the .badblocks pseudo-file.
On some filesystems .badblocks is actually present in the root directory as a regular file owned by root and mode 0. root can open it but reading or writing it is undefined.
There is some ancient tradition that inodes start from 1, #1 is .badblocks, and #2 is the root directory. Even though .badblocks is not particularly well-guaranteed, many filesystems go out of their way to make root #2.

Resources