I am using the following program to find out the size of a file and allocate memory dynamically. This program has to be multi-platform functional.
But when I run the program on Linux machine and on a Windows machine using Cygwin, I see different outputs — why?
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
/*
Observation on Linux
When reading text file remember
the content in the text file if arranged in lines like below:
ABCD
EFGH
the size of file is 12, its because each line is ended by \r\n, so add 2 bytes for every line we read.
*/
off_t fsize(char *file) {
struct stat filestat;
if (stat(file, &filestat) == 0) {
return filestat.st_size;
}
return 0;
}
void ReadInfoFromFile(char *path)
{
FILE *fp;
unsigned int size;
char *buffer = NULL;
unsigned int start;
unsigned int buff_size =0;
char ch;
int noc =0;
fp = fopen(path,"r");
start = ftell(fp);
fseek(fp,0,SEEK_END);
size = ftell(fp);
rewind(fp);
printf("file size = %u\n", size);
buffer = (char*) malloc(sizeof(char) * (size + 1) );
if(!buffer) {
printf("malloc failed for buffer \n");
return;
}
buff_size = fread(buffer,sizeof(char),size,fp);
printf(" buff_size = %u\n", buff_size);
if(buff_size == size)
printf("%s \n", buffer);
else
printf("problem in file size \n %s \n", buffer);
fclose(fp);
}
int main(int argc, char *argv[])
{
printf(" using ftell etc..\n");
ReadInfoFromFile(argv[1]);
printf(" using stat\n");
printf("File size = %u\n", fsize(argv[1]));
return 0;
}
The problem is fread reading different sizes depends on compiler.
I have not tried on proper windows compiler yet.
But what would be the portable way to read contents from file?
Output on Linux:
using ftell etc..
file size = 34
buff_size = 34
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
using stat
File size = 34
Output on Cygwin:
using ftell etc..
file size = 34
buff_size = 30
problem in file size
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
_ROAMINGPRã9œw
using stat
File size = 34
Transferring comments into an answer.
The trouble is probably that on Windows, the text file has CRLF line endings ("\r\n"). The input processing maps those to "\n" to match Unix because you use "r" in the open mode (open text file for reading) instead of "rb" (open binary file for reading). This leads to a difference in the byte counts — ftell() reports the bytes including the '\r' characters, but fread() doesn't count them.
But how can I allocate memory, if I don't know the actual size? Even in this case also the return value of fread is 30/34, but my content is only of 26 bytes.
Define your content — there's a newline or CRLF at the end of each of 4 lines. When the file is opened on Windows (Cygwin) in text mode (no b), then you will receive 3 lines of 9 bytes (8 letters and a newline) plus one line with 3 bytes (2 letters and a newline), for 30 bytes in total. Compared to the 34 that's reported by ftell() or stat(), the difference is the 4 CR characters ('\r') that are not returned. If you opened the file as a binary file ("rb"), then you'd get all 34 characters — 3 lines with 10 bytes and 1 line with 4 bytes.
The good news is that the size reported by stat() or ftell() is bigger than the final number of bytes returned, so allocating enough space is not too hard. It might become wasteful if you have a gigabyte size file with every line containing 1 byte of data and a CRLF. Then you'd "waste" (not use) one third of the allocated space. You could always shrink the allocation to the required size with realloc().
Note that there is no difference between text and binary mode on Unix-like (POSIX) systems such as Linux. It does not do mapping of CRLF to NL line endings. If the file is copied from Windows to Linux without mapping the line endings, you will get CRLF at the end of each line on Linux If the file is copied and the line endings are mapped, you'll get a smaller size on Linux than under Cygwin. (Using "rb" on Linux does no harm; it doesn't do any good either. Using "rb" on Windows/Cygwin could be important; it depends on the behaviour you want.)
See also the C11 standard §7.21.2 Streams and also §7.21.3 Files.
Related
Is it possible to read a text file hat has non-english text?
Example of text in file:
E 37
SVAR:
Fettembolisyndrom. (1 poäng)
Example of what is present in buffer which stores "fread" output using "puts" :
E 37 SVAR:
Fettembolisyndrom.
(1 poäng)
Under Linux my program was working fine but in Windows I am seeing this problem with non-english letters. Any advise how this can be fixed?
Program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
int debug = 0;
int main(int argc, char* argv[])
{
if (argc < 2)
{
puts("ERROR! Please enter a filename\n");
exit(1);
}
else if (argc > 2)
{
debug = atoi(argv[2]);
puts("Debugging mode ENABLED!\n");
}
FILE *fp = fopen(argv[1], "rb");
fseek(fp, 0, SEEK_END);
long fileSz = ftell(fp);
fseek(fp, 0, SEEK_SET);
char* buffer;
buffer = (char*) malloc (sizeof(char)*fileSz);
size_t readSz = fread(buffer, 1, fileSz, fp);
rewind(fp);
if (readSz == fileSz)
{
char tmpBuff[100];
fgets(tmpBuff, 100, fp);
if (!ferror(fp))
{
printf("100 characters from text file: %s\n", tmpBuff);
}
else
{
printf("Error encounter");
}
}
if (strstr("FRÅGA",buffer) == NULL)
{
printf("String not found!");
}
return 0;
}
Sample output
Text file
Summary: If you read text from a file encoded in UTF-8 and display it on the console you must either set the console to UTF-8 or transcode the text from UTF-8 to the encoding used by the console (in English-speaking countries, usually MS-DOS code page 437 or 850).
Longer explanation
Bytes are not characters and characters are not bytes. The char data type in C holds a byte, not a character. In particular, the character Å (Unicode <U+00C5>) mentioned in the comments can be represented in many ways, called encodings:
In UTF-8 it is two bytes, '\xC3' '\x85';
In UTF-16 it is two bytes, either '\xC5' '\x00' (little-endian UTF-16), or '\x00' '\xC5' (big-endian UTF-16);
In Latin-1 and Windows-1252, it is one byte, '\xC5';
In MS-DOS code page 437 and code page 850, it is one byte, '\x8F'.
It is the responsibility of the programmer to translate between the internal encoding used by the program (usually but not always Unicode), the encoding used in input or output files, and the encoding expected by the display device.
Note: Sometimes, if the program does not do much with the characters it reads and outputs, one can get by just by making sure that the input files, the output files, and the display device all use the same encoding. In Linux, this encoding is almost always UTF-8. Unfortunately, on Windows the existence of multiple encodings is a fact of life. System calls expect either UTF-16 or Windows-1252. By default, the console displays Code Page 437 or 850. Text files are quite often in UTF-8. Windows is old and complicated.
i am trying to check the size of my txt files using lseek. Unfortunately i doesnt work.
My T.Txt contains 16 characters:ABCDABCDDABCDABCD nothing more. So the number variables should have 16+1. Why it is 19 instead? The second problem why i cant use
SEEK_END-1 to start from last position-1.? I would be grateful for help with that.
int main(void)
{
int fd1 = open("T.txt", O_RDONLY);
long number;
if (fd1 < 0) {
return -1;
}
number = lseek(fd1, 0, SEEK_END);
printf("FILE size PROGRAM>C: %ld\n", number);
return 0;
}
This is probably because of the \r\n characters in your file, which stand for newline on Windows systems.
On my machine (Mac OS X 10.10) your code gives the right result for your file, provided it doesn't have any newline character on the end, i.e. only the string: ABCDABCDDABCDABCD (output is then: 17).
You use the lseek() function correctly except that the result of lseek() is off_t not a long.
Probably your text file contains a BOM header 0xEF,0xBB,0xBF.
Try to print the file contents in HEX and see if it prints these extra 3 characters.
You can learn more about [file headers and BOM here].(https://en.wikipedia.org/wiki/Byte_order_mark)
How do I read/write a block device? I heard I read/write like a normal file so I setup a loop device by doing
sudo losetup /dev/loop4 ~/file
Then I ran the app on the file then the loop device
sudo ./a.out file
sudo ./a.out /dev/loop4
The file executed perfectly. The loop device reads 0 bytes. In both cases I got FP==3 and off==0. The file correctly gets the string length and prints the string while the loop gets me 0 and prints nothing
How do I read/write to a block device?
#include <fcntl.h>
#include <cstdio>
#include <unistd.h>
int main(int argc, char *argv[]) {
char str[1000];
if(argc<2){
printf("Error args\n");
return 0;
}
int fp = open(argv[1], O_RDONLY);
printf("FP=%d\n", fp);
if(fp<=0) {
perror("Error opening file");
return(-1);
}
off_t off = lseek(fp, 0, SEEK_SET);
ssize_t len = read(fp, str, sizeof str);
str[len]=0;
printf("%d, %d=%s\n", len, static_cast<int>(off), str);
close(fp);
}
The losetup seems to map file in 512-byte sectors. If file size is not multiples of 512, then the rest will be truncated.
When mapping a file to /dev/loopX with losetup,
for fiile which is smaller than 512 bytes it gives us following warning:
Warning: file is smaller than 512 bytes;
the loop device may be useless or invisible for system tools.
For file which the size cannot be divided by 512:
Warning: file does not fit into a 512-byte sector;
the end of the file will be ignored
This warning was added since util-linux ver 2.22 in this commit
You can not put zeros or random values on the file to get 512 byte alignment. Use the first few byte to store the file size, followed by the file content. Now you know where the file content is ending. You put random data to achieve the 512 alignment.
e.g. File structure:
[File Size] [Data][<padding to get 512 alignment>]
I have a text file(unsigned short values) as follows
abc.txt
2311
1231
1232
54523
32423
I'm reading this file in my function using while loop and storing in a buffer as follows
while(!feof(ref))
{
fscanf(ref,"%d\n",&ref[count]);
count++;
}
It is taking too much time for reading large file is there any way to optimize the fscanf operation.
This is because secondary memory access is slower than primary memory access. First dump the file into primary memory using fread() in binary mode. Then read from primary memory integer by integer.
A common way is to read a larger chunk into a large memory buffer, and then parse out the data from that buffer.
Another way may be to instead memory map the file, then the OS will put the file into your process virtual memory map, so you can read it like reading from memory.
use a local buffer and read blocks of data using fread() in binary mode. Parse your text data and continue with the next block.
tune your buffer size properly, maybe 64K or 1Mb in size, it depends on your application.
#include <stdio.h>
int BUFFER_SIZE = 1024;
FILE *source;
FILE *destination;
int n;
int count = 0;
int written = 0;
int main()
{
unsigned char buffer[BUFFER_SIZE];
source = fopen("myfile", "rb");
if (source)
{
while (!feof(source))
{
n = fread(buffer, 1, BUFFER_SIZE, source);
count += n;
// here parse data
}
}
fclose(source);
return 0;
}
This may be faster if each line has only one number, atoi() is a lot faster than using fscanf()
#define BUFLEN 128
#define ARRAY_SIZE 12345
int myarray[ARRAY_SIZE];
char buffer[BUFLEN]
FILE *fp= fopen(...);
index=0;
do
{
if( fgets(buffer, BUFLEN-1, fp) == NULL )
break;
myarray[index++]= atoi(buffer);
if( index >= ARRAY_SIZE)
break;
}while(!feof(fp));
...hastily typed in code, not compiled or run ;)
You can improve the file reading by setting a stream buffer e.g.
#define STRMBUF_SIZE (64*1024)
char strmbuf[STRMBUF_SIZE];
setvbuf( fp, strmbuf,_IOFBF,STRMBUF_SIZE);
Ok, I have been reading up on fread() [which returns a type size_t]and saw several posts regarding large files and some issues others have been having - but I am still having some issues. This function passes in a file pointer and a long long int. The lld is from main where I use another function to get the actual filesize which is 6448619520 bytes.
char *getBuffer(FILE *fptr, long long size) {
char *bfr;
size_t result;
printf("size of file in allocate buffer: %lld\n", size);
//size here is 6448619520
bfr = (char*) malloc(sizeof(char) * size);
if (bfr == NULL) {
printf("Error, malloc failed..\n");
exit(EXIT_FAILURE);
}
//positions fptr to offset location which is 0 here.
fseek(fptr, 0, SEEK_SET);
//read the entire input file into bfr
result = fread(bfr, sizeof(char), size, fptr);
printf("result = %lld\n", (long long) result);
if(result != size)
{
printf("File failed to read\n");
exit(5);
}
return (bfr);
}
I have tested it on files of around 1-2gb in size and it works fine, however, when I test it on a 6gb file, nothing is read in to the buffer. Ignore the other results, (focus on the bolded for results), the issue lies with reading in the data bfr. Here are some of the results I get.
1st of a file that is 735844352 bytes (700+MB)
root#redbox:/data/projects/C/stubs/# ./testrun -x 45004E00 -i /data/Helix2008R1.iso
Image file is /data/Helix2008R1.iso
hex string = 45004E00
>Total size of file: 735844352
size of file in get buffer: 735844352
result = 735844352**
Begin parsing the command line hex value: 45004E00
Total number of bytes in hex string: 4
Results of hex string search:
Hex string 45004E00 was found at byte location: 37441
Hex string 45004E00 was found at byte location: 524768
....
Run #2 against a 6gb file:
root#redbox:/data/projects/C/stubs/# ./testrun -x BF1B0650 -i /data/images/sixgbimage.img
Image file is /data/images/sixgbimage.img
hex string = BF1B0650
Total size of file: 6448619520
size of file in allocate buffer: 6448619520
result = 0
File failed to read
I am still not sure why it it failing with large files and not smaller ones, is it a >4gb issue. I am using the following:
/* Support Large File Use */
#define _LARGEFILE_SOURCE 1
#define _LARGEFILE64_SOURCE 1
#define _FILE_OFFSET_BITS 64
BTW, I am using an ubuntu 9.10 box (2.6.x kernel). tia.
If you're just going to be reading through the file, not modifying it, I suggest using mmap(2) instead of fread(3). This should be much more efficient, though I haven't tried it on huge files. You'll need to change my very simplistic found/not found to report offsets if that is what you would rather have, but I'm not sure what you want the pointer for. :)
#define _GNU_SOURCE
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char* argv[]) {
char *base, *found;
off_t len;
struct stat sb;
int ret;
int fd;
unsigned int needle = 0x45004E00;
ret = stat(argv[1], &sb);
if (ret) {
perror("stat");
return 1;
}
len = sb.st_size;
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}
base = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
if (!base) {
perror("mmap");
return 1;
}
found = memmem(base, len, &needle, sizeof(unsigned int));
if (found)
printf("Found %X at %p\n", needle, found);
else
printf("Not found");
return 0;
}
Some tests:
$ ./mmap ./mmap
Found 45004E00 at 0x7f8c4c13a6c0
$ ./mmap /etc/passwd
Not found
If this is a 32 bit process, as you say, then size_t is 32 bit and you simply cannot store more than 4GB in your process's address space (actually, in practice, a bit less than 3GB). In this line here:
bfr = (char*) malloc(sizeof(char) * size);
The result of the multiplication will be reduced modulo SIZE_MAX + 1, which means it'll only try and allocate around 2GB. Similarly, the same thing happens to the size parameter in this line:
result = fread(bfr, sizeof(char), size, fptr);
If you wish to work with large files in a 32 bit process, you have to work on only a part of them at a time (eg. read the first 100 MB, process that, read the next 100 MB, ...). You can't read the entire file in one go - there just isn't enough memory available to your process to do that.
When fread fails, it sets errno to indicate why it failed. What is the value of errno after the call to fread that returns zero?
Update:
Are you required to read the entire file in one fell swoop? What happens if you read in the file, say, 512MB at a time?
According to your comment above, you are using a 32-bit OS. In that case, you will be unable to handle 6 GB at a time (for one, size_t won't be able to hold that large of a number). You should, however, be able to read in and process the file in smaller chunks.
I would argue that reading a 6GB file into memory is probably not the best solution to your problem even on a 64-bit OS. What exactly are you trying to accomplish that is requiring you to buffer a 6GB file? There's probably a better way to approach the problem.
After taking the advice of everyone, I broke the 6GB file up into 4K chunks, parsed the hex bytes and was able to get what the byte locations which will help me later when I pull out MBR from a VMFS partition that has been dd imaged. Here was the quick and dirty way of reading it per chunk:
#define DEFAULT_BLOCKSIZE 4096
...
while((bytes_read = fread(chunk, sizeof(unsigned char), sizeof(chunk), fptr)) > 0) {
chunkptr = chunk;
for(z = 0; z < bytes_read; z++) {
if (*chunkptr == pattern_buffer[current_search]) {
current_search++;
if (current_search > (counter - 1)) {
current_search = 0;
printf("Hex string %s was found at starting byte location: %lld\n",
hexstring, (long long int) (offsetctr-1));
matches++;
}
} else {
current_search = 0;
}
chunkptr++;
//printf("[%lld]: %02X\n", offsetctr, chunk[z] & 0xff);
offsetctr++;
}
master_counter += bytes_read;
}
...
and here were the results I got...
root#redbox:~/workspace/bytelocator/Debug# ./bytelocator -x BF1B0650 -i /data/images/sixgbimage.img
Total size of /data/images/sixgbimage.img file: 6448619520 bytes
Parsing the hex string now: BF1B0650
Hex string BF1B0650 was found at starting byte location: 18
Hex string BF1B0650 was found at starting byte location: 193885738
Hex string BF1B0650 was found at starting byte location: 194514442
Hex string BF1B0650 was found at starting byte location: 525033370
Hex string BF1B0650 was found at starting byte location: 1696715251
Hex string BF1B0650 was found at starting byte location: 1774337550
Hex string BF1B0650 was found at starting byte location: 2758859834
Hex string BF1B0650 was found at starting byte location: 3484416018
Hex string BF1B0650 was found at starting byte location: 3909721614
Hex string BF1B0650 was found at starting byte location: 3999533674
Hex string BF1B0650 was found at starting byte location: 4018701866
Hex string BF1B0650 was found at starting byte location: 4077977098
Hex string BF1B0650 was found at starting byte location: 4098838010
Quick stats:
================
Number of bytes that have been read: 6448619520
Number of signature matches found: 13
Total number of bytes in hex string: 4
Have you verified that malloc and fread are actually taking in the right type of parameters? You may want to compile with the -Wall option and check if your 64-bit values are actually being truncated. In this case, malloc won't report an error but would end up allocating far less than what you had asked for.