File holes are the empty spaces in file, which, however, doesn't take up any disk space and contains null bytes. Therefore, the file size is larger than its actual size on disk.
However, I don't know how to create a file with file holes for experimenting with.
Use the dd command with a seek parameter.
dd if=/dev/urandom bs=4096 count=2 of=file_with_holes
dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes
That creates for you a file with a nice hole from byte 8192 to byte 28671.
Here's an example, demonstrating that indeed the file has holes in it (the ls -s command tells you how many disk blocks are being used by a file):
$ dd if=/dev/urandom bs=4096 count=2 of=fwh # fwh = file with holes
2+0 records in
2+0 records out
8192 bytes (8.2 kB) copied, 0.00195565 s, 4.2 MB/s
$ dd if=/dev/urandom seek=7 bs=4096 count=2 of=fwh
2+0 records in
2+0 records out
8192 bytes (8.2 kB) copied, 0.00152742 s, 5.4 MB/s
$ dd if=/dev/zero bs=4096 count=9 of=fwnh # fwnh = file with no holes
9+0 records in
9+0 records out
36864 bytes (37 kB) copied, 0.000510568 s, 72.2 MB/s
$ ls -ls fw*
16 -rw-rw-r-- 1 hopper hopper 36864 Mar 15 10:25 fwh
36 -rw-rw-r-- 1 hopper hopper 36864 Mar 15 10:29 fwnh
As you can see, the file with holes takes up fewer disk blocks, despite being the same size.
If you want a program that does it, here it is:
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <fcntl.h>
int main(int argc, const char *argv[])
{
char random_garbage[8192]; /* Don't even bother to initialize */
int fd = -1;
if (argc < 2) {
fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0666);
if (fd < 0) {
perror("Can't open file: ");
return 2;
}
write(fd, random_garbage, 8192);
lseek(fd, 5 * 4096, SEEK_CUR);
write(fd, random_garbage, 8192);
close(fd);
return 0;
}
The above should work on any Unix. Someone else replied with a nice alternative method that is very Linux specific. I highlight it here because it's a method distinct from the two I gave, and can be used to put holes in existing files.
Create a file.
Seek to position N.
Write some data.
There will be a hole at the start of the file (up to, and excluding, position N). You can similarly create files with holes in the middle.
The following document has some sample C code (search for "Sparse files"): http://www.win.tue.nl/~aeb/linux/lk/lk-6.html
Aside from creating files with holes, since ~2 months ago (mid-January 2011), you can punch holes on existing files on Linux, using fallocate(2) FALLOC_FL_PUNCH_HOLE LWN article, git commit on Linus' tree, patch to Linux's manpages.
The problem is carefully discussed in section 3.6 of W.Richard Stevens famous book "Advanced Programming in the UNIX Environment" (APUE for short). The lseek funstion included in unistd.h is used here, which is designed to set an open file's offset explicitly. The prototype of the lseek function is as follows:
off_t lseek(int filedes, off_t offset, int whence);
Here, filedes is the file descriptor, offset is the value we are willing to set, and whence is a constant set in the header file, specifically SEEK_SET, meaning that the offset is set from the beginning of the file; SEEK_CUR, meaning that the offset is set to its current value plus the offset in the arguement list; SEEK_END, meaning that the file's offset is set the the size of the file plus the offset in the arguement list.
The example to create a file with holes in C under UNIX like OSs is as follows:
/*Creating a file with a hole of size 810*/
#include <fcntl.h>
/*Two strings to write to the file*/
char buf1[] = "abcde";
char buf2[] = "ABCDE";
int main()
{
int fd; /*file descriptor*/
if((fd = creat("file_with_hole", FILE_MODE)) < 0)
err_sys("creat error");
if(write(fd, buf1, 5) != 5)
err_sys("buf1 write error");
/*offset now 5*/
if(lseek(fd, 815, SEEK_SET) == -1)
err_sys("lseek error");
/*offset now 815*/
if(write(fd, buf2, 5) !=5)
err_sys("buf2 write error");
/*offset now 820*/
return 0;
}
In the code above, err_sys is the function to deal with fatal error related to a system call.
A hole is created when data is written at an offset beyond the current file size or the file size is truncated to something larger than the current file size
Related
I am using the following program to find out the size of a file and allocate memory dynamically. This program has to be multi-platform functional.
But when I run the program on Linux machine and on a Windows machine using Cygwin, I see different outputs — why?
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
/*
Observation on Linux
When reading text file remember
the content in the text file if arranged in lines like below:
ABCD
EFGH
the size of file is 12, its because each line is ended by \r\n, so add 2 bytes for every line we read.
*/
off_t fsize(char *file) {
struct stat filestat;
if (stat(file, &filestat) == 0) {
return filestat.st_size;
}
return 0;
}
void ReadInfoFromFile(char *path)
{
FILE *fp;
unsigned int size;
char *buffer = NULL;
unsigned int start;
unsigned int buff_size =0;
char ch;
int noc =0;
fp = fopen(path,"r");
start = ftell(fp);
fseek(fp,0,SEEK_END);
size = ftell(fp);
rewind(fp);
printf("file size = %u\n", size);
buffer = (char*) malloc(sizeof(char) * (size + 1) );
if(!buffer) {
printf("malloc failed for buffer \n");
return;
}
buff_size = fread(buffer,sizeof(char),size,fp);
printf(" buff_size = %u\n", buff_size);
if(buff_size == size)
printf("%s \n", buffer);
else
printf("problem in file size \n %s \n", buffer);
fclose(fp);
}
int main(int argc, char *argv[])
{
printf(" using ftell etc..\n");
ReadInfoFromFile(argv[1]);
printf(" using stat\n");
printf("File size = %u\n", fsize(argv[1]));
return 0;
}
The problem is fread reading different sizes depends on compiler.
I have not tried on proper windows compiler yet.
But what would be the portable way to read contents from file?
Output on Linux:
using ftell etc..
file size = 34
buff_size = 34
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
using stat
File size = 34
Output on Cygwin:
using ftell etc..
file size = 34
buff_size = 30
problem in file size
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
_ROAMINGPRã9œw
using stat
File size = 34
Transferring comments into an answer.
The trouble is probably that on Windows, the text file has CRLF line endings ("\r\n"). The input processing maps those to "\n" to match Unix because you use "r" in the open mode (open text file for reading) instead of "rb" (open binary file for reading). This leads to a difference in the byte counts — ftell() reports the bytes including the '\r' characters, but fread() doesn't count them.
But how can I allocate memory, if I don't know the actual size? Even in this case also the return value of fread is 30/34, but my content is only of 26 bytes.
Define your content — there's a newline or CRLF at the end of each of 4 lines. When the file is opened on Windows (Cygwin) in text mode (no b), then you will receive 3 lines of 9 bytes (8 letters and a newline) plus one line with 3 bytes (2 letters and a newline), for 30 bytes in total. Compared to the 34 that's reported by ftell() or stat(), the difference is the 4 CR characters ('\r') that are not returned. If you opened the file as a binary file ("rb"), then you'd get all 34 characters — 3 lines with 10 bytes and 1 line with 4 bytes.
The good news is that the size reported by stat() or ftell() is bigger than the final number of bytes returned, so allocating enough space is not too hard. It might become wasteful if you have a gigabyte size file with every line containing 1 byte of data and a CRLF. Then you'd "waste" (not use) one third of the allocated space. You could always shrink the allocation to the required size with realloc().
Note that there is no difference between text and binary mode on Unix-like (POSIX) systems such as Linux. It does not do mapping of CRLF to NL line endings. If the file is copied from Windows to Linux without mapping the line endings, you will get CRLF at the end of each line on Linux If the file is copied and the line endings are mapped, you'll get a smaller size on Linux than under Cygwin. (Using "rb" on Linux does no harm; it doesn't do any good either. Using "rb" on Windows/Cygwin could be important; it depends on the behaviour you want.)
See also the C11 standard §7.21.2 Streams and also §7.21.3 Files.
I got confused about lseek()'s return value(which is new file offset)
I have the text file (Its name is prwtest). Its contents are written to a to z.
And, the code what I wrote is following,
1 #include <unistd.h>
2 #include <fcntl.h>
3 #include <stdlib.h>
4 #include <stdio.h>
5 #include <string.h>
6
7 #define BUF 50
8
9 int main(void)
10 {
11 char buf1[]="abcdefghijklmnopqrstuvwxyz";
12 char buf2[BUF];
13 int fd;
14 int read_cnt;
15 off_t cur_offset;
16
17 fd=openat(AT_FDCWD, "prwtest", O_CREAT | O_RDWR | O_APPEND);
18 cur_offset=lseek(fd, 0, SEEK_CUR);
19 //pwrite(fd, buf1, strlen(buf1), 0);
20 //write(fd, buf1, strlen(buf1));
21 //cur_offset=lseek(fd, 0, SEEK_END);
22
23 printf("current offset of file prwtest: %d \n", cur_offset);
24
25 exit(0);
26 }
On the line number 17, I use flag O_APPEND, so the prwtest's current file offset is taken from i-node's current file size. (It's 26).
On the line number 18, I use lseek() which is used by SEEK_CUR, and the offset is 0.
But the result value cur_offset is 0. (I assume that it must be 26, because SEEK_CUR indicates current file offset.)
However, SEEK_END gives me what I thought, cur_offset is 26.
Why the lseek(fd, 0, SEEK_CUR); gives me return value 0, not 26?
O_APPEND takes effect before each write to the file, not when opening file.
Therefore right after the open the position remains 0 but if you invoke write, the lseek on SEEK_CUR will return correct value.
Your issue is with open() / openat(), not lseek().
From the open() manpage, emphasis mine:
O_APPEND
The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2).
Since you don't write to the file, the offset is never repositioned to the end of the file.
While we're at it, you should be closing the file before ending the program...
Actually, while we're really at it, if you do #include <stdio.h> already, why not use the standard's file I/O (fopen() / fseek() / fwrite()) instead of the POSIX-specific stuff? ;-)
Also, on Linux, your commented-out code won't work as you expect. This code:
17 fd=openat(AT_FDCWD, "prwtest", O_CREAT | O_RDWR | O_APPEND);
18 cur_offset=lseek(fd, 0, SEEK_CUR);
19 pwrite(fd, buf1, strlen(buf1), 0);
will fail to write the contents of buf1 at the beginning of the file (unless the file is empty).
pwrite on Linux is buggy:
BUGS
POSIX requires that opening a file with the O_APPEND flag should
have no effect on the location at which pwrite() writes data.
However, on Linux, if a file is opened with O_APPEND, pwrite()
appends data to the end of the file, regardless of the value of
offset.
If you run dd with this:
dd if=/dev/zero of=sparsefile bs=1 count=0 seek=1048576
You appear to get a completely unallocated sparse file (this is ext4)
smark#we:/sp$ ls -ls sparsefile
0 -rw-rw-r-- 1 smark smark 1048576 Nov 24 16:19 sparsefile
fibmap agrees:
smark#we:/sp$ sudo hdparm --fibmap sparsefile
sparsefile:
filesystem blocksize 4096, begins at LBA 2048; assuming 512 byte sectors.
byte_offset begin_LBA end_LBA sectors
Without having to dig through the source of dd, I'm trying to figure out how to do that in C.
I tried fseeking and fwriting zero bytes, but it did nothing.
Not sure what else to try, I figured somebody might know before I hunt down dd's innards.
EDIT: including my example...
FILE *f = fopen("/sp/sparse2", "wb");
fseek(f, 1048576, SEEK_CUR);
fwrite("x", 1, 0, f);
fclose(f);
When you write to a file using write or various library routines that ultimately call write, there's a file offset pointer associated with the file descriptor that determines where in the file the bytes will go. It's normally positioned at the end of the data that was processed by the most recent call to read or write. But you can use lseek to position the pointer anywhere within the file, and even beyond the current end of the file. When you write data at a point beyond the current EOF, the area that was skipped is conceptually filled with zeroes. Many systems will optimize things so that any whole filesystem blocks in that skipped area simply aren't allocated, producing a sparse file. Attempts to read such blocks will succeed, returning zeroes.
Writing block-sized areas full of zeroes to a file generally won't produce a sparse file, although it's possible for some filesystems to do this.
Another way to produce a sparse file, used by GNU dd, is to call ftruncate. The documentation says this:
The ftruncate() function causes the regular file referenced by fildes to have a size of length bytes.
If the file previously was larger than length, the extra data is discarded. If it was previously shorter than length, it is unspecified whether the file is changed or its size increased. If the file is extended, the extended area appears as if it were zero-filled.
Support for sparse files is filesystem-specific, although virtually all designed-for-UNIX local filesystems support them.
This is complementary to the answer by #MarkPlotnick, it's a sample simple implementation of the feature you requested using ftruncate():
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
int
main(void)
{
int file;
int mode;
mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH;
file = open("sparsefile", O_WRONLY | O_CREAT, mode);
if (file == -1)
return -1;
ftruncate(file, 0x100000);
close(file);
return 0;
}
How do I read/write a block device? I heard I read/write like a normal file so I setup a loop device by doing
sudo losetup /dev/loop4 ~/file
Then I ran the app on the file then the loop device
sudo ./a.out file
sudo ./a.out /dev/loop4
The file executed perfectly. The loop device reads 0 bytes. In both cases I got FP==3 and off==0. The file correctly gets the string length and prints the string while the loop gets me 0 and prints nothing
How do I read/write to a block device?
#include <fcntl.h>
#include <cstdio>
#include <unistd.h>
int main(int argc, char *argv[]) {
char str[1000];
if(argc<2){
printf("Error args\n");
return 0;
}
int fp = open(argv[1], O_RDONLY);
printf("FP=%d\n", fp);
if(fp<=0) {
perror("Error opening file");
return(-1);
}
off_t off = lseek(fp, 0, SEEK_SET);
ssize_t len = read(fp, str, sizeof str);
str[len]=0;
printf("%d, %d=%s\n", len, static_cast<int>(off), str);
close(fp);
}
The losetup seems to map file in 512-byte sectors. If file size is not multiples of 512, then the rest will be truncated.
When mapping a file to /dev/loopX with losetup,
for fiile which is smaller than 512 bytes it gives us following warning:
Warning: file is smaller than 512 bytes;
the loop device may be useless or invisible for system tools.
For file which the size cannot be divided by 512:
Warning: file does not fit into a 512-byte sector;
the end of the file will be ignored
This warning was added since util-linux ver 2.22 in this commit
You can not put zeros or random values on the file to get 512 byte alignment. Use the first few byte to store the file size, followed by the file content. Now you know where the file content is ending. You put random data to achieve the 512 alignment.
e.g. File structure:
[File Size] [Data][<padding to get 512 alignment>]
I have a file whose length, I wanted to get using the stat() function in the code below:
FILE *file = fopen(filename, "r");
int filesize, i;
if(file==NULL)
{
printf("Could not open mea.dat!\n");
return ;
}
struct stat st;
stat(filename, &st);
filesize = st.st_size;
.........
but when i checked the filesize, i got the value 1504 even though just by counting numerically, the length of the file content is 101 and as such the filesize should have been 102 as wel. where am i missing it?
where am i missing it?
When the size returned by stat() and the size you got by counting numerically (whatever that means) differ, the chances are that your counting is wrong.
You need to check the return value of stat() before deciding to trust the value in the struct
if (stat(filename, &st)) exit(EXIT_FAILURE);
filesize = st.st_size;
Is the file in question a sparse file?
With sparse files blocks of data are not there on disk but are reported by ls -l. Here is an example sparse file:
ls -ls sparse
2 -rw-r--r-- 1 root sys 1048577 Feb 20 12:58 sparse
The leftmost 2 is the actual number of blocks used, the 1088577 is the number of bytes allocated ot the file (not all are actually on disk). Since 2 blocks (usually 1024 on the box I did this) do not add up to 1058577, you can see a sparse file this way.