i am trying to check the size of my txt files using lseek. Unfortunately i doesnt work.
My T.Txt contains 16 characters:ABCDABCDDABCDABCD nothing more. So the number variables should have 16+1. Why it is 19 instead? The second problem why i cant use
SEEK_END-1 to start from last position-1.? I would be grateful for help with that.
int main(void)
{
int fd1 = open("T.txt", O_RDONLY);
long number;
if (fd1 < 0) {
return -1;
}
number = lseek(fd1, 0, SEEK_END);
printf("FILE size PROGRAM>C: %ld\n", number);
return 0;
}
This is probably because of the \r\n characters in your file, which stand for newline on Windows systems.
On my machine (Mac OS X 10.10) your code gives the right result for your file, provided it doesn't have any newline character on the end, i.e. only the string: ABCDABCDDABCDABCD (output is then: 17).
You use the lseek() function correctly except that the result of lseek() is off_t not a long.
Probably your text file contains a BOM header 0xEF,0xBB,0xBF.
Try to print the file contents in HEX and see if it prints these extra 3 characters.
You can learn more about [file headers and BOM here].(https://en.wikipedia.org/wiki/Byte_order_mark)
Related
The program works correctly in Linux, but I get extra characters after the end of file when running in Windows or through Wine. Not garbage but repeated text that was already written. The issue persists whether I write to stdout or a file, but doesn't occur with small files, a few hundred KB is needed.
I nailed down the issue to this function:
static unsigned long read_file(const char *filename, const char **output)
{
struct stat file_stats;
int fdescriptor;
unsigned long file_sz;
static char *file;
fdescriptor = open(filename, O_RDONLY);
if (fdescriptor < 0 || (fstat(fdescriptor ,&file_stats) < 0))
{ printf("Error opening file: %s \n", filename);
return (0);
}
if (file_stats.st_size < 0)
{ printf("file %s reports an Incorrect size", filename);
return (0);
}
file_sz = (unsigned long)file_stats.st_size;
file = malloc((file_sz) * sizeof(*file));
if (!file)
{ printf("Error allocating memory for file %s of size %lu\n", filename, file_sz);
return (0);
}
read(fdescriptor, file, file_sz);
*output = file;
write(STDOUT_FILENO, file, file_sz), exit(1); //this statement added for debugging.
return (file_sz);
}
I can't debug through Wine, much less in windows, but by using printf statements I can tell the file size is correct. The issue is either in the reading or the writing and without a debugger I can't look at the contents of the buffer in memory.
The program was compiled with x86_64-w64-mingw32-gcc, version 8.3. which is the same version of gcc in my system.
At this point I'm just perplexed; I would love to hear any ideas you may have.
Thank you.
Edit: The issue was that fewer bytes were being read than the reported file size and I was writing more than necessary. Thanks to Matt for telling me where to look.
Read can return a size different than that reported by fstat. I was writing the reported file size instead of the actual number of bytes read, which led to the issue. If writing, one should use the number of bytes directly reported by read to avoid this.
It is always best to both check the return value of read/write for failure and to make sure all bytes have been read as read can return less bytes than the total when reading from a pipe or interrupted by a signal, in which case multiple calls are necessary.
Thanks to Mat and Felix for the answer.
I am using the following program to find out the size of a file and allocate memory dynamically. This program has to be multi-platform functional.
But when I run the program on Linux machine and on a Windows machine using Cygwin, I see different outputs — why?
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
/*
Observation on Linux
When reading text file remember
the content in the text file if arranged in lines like below:
ABCD
EFGH
the size of file is 12, its because each line is ended by \r\n, so add 2 bytes for every line we read.
*/
off_t fsize(char *file) {
struct stat filestat;
if (stat(file, &filestat) == 0) {
return filestat.st_size;
}
return 0;
}
void ReadInfoFromFile(char *path)
{
FILE *fp;
unsigned int size;
char *buffer = NULL;
unsigned int start;
unsigned int buff_size =0;
char ch;
int noc =0;
fp = fopen(path,"r");
start = ftell(fp);
fseek(fp,0,SEEK_END);
size = ftell(fp);
rewind(fp);
printf("file size = %u\n", size);
buffer = (char*) malloc(sizeof(char) * (size + 1) );
if(!buffer) {
printf("malloc failed for buffer \n");
return;
}
buff_size = fread(buffer,sizeof(char),size,fp);
printf(" buff_size = %u\n", buff_size);
if(buff_size == size)
printf("%s \n", buffer);
else
printf("problem in file size \n %s \n", buffer);
fclose(fp);
}
int main(int argc, char *argv[])
{
printf(" using ftell etc..\n");
ReadInfoFromFile(argv[1]);
printf(" using stat\n");
printf("File size = %u\n", fsize(argv[1]));
return 0;
}
The problem is fread reading different sizes depends on compiler.
I have not tried on proper windows compiler yet.
But what would be the portable way to read contents from file?
Output on Linux:
using ftell etc..
file size = 34
buff_size = 34
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
using stat
File size = 34
Output on Cygwin:
using ftell etc..
file size = 34
buff_size = 30
problem in file size
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
_ROAMINGPRã9œw
using stat
File size = 34
Transferring comments into an answer.
The trouble is probably that on Windows, the text file has CRLF line endings ("\r\n"). The input processing maps those to "\n" to match Unix because you use "r" in the open mode (open text file for reading) instead of "rb" (open binary file for reading). This leads to a difference in the byte counts — ftell() reports the bytes including the '\r' characters, but fread() doesn't count them.
But how can I allocate memory, if I don't know the actual size? Even in this case also the return value of fread is 30/34, but my content is only of 26 bytes.
Define your content — there's a newline or CRLF at the end of each of 4 lines. When the file is opened on Windows (Cygwin) in text mode (no b), then you will receive 3 lines of 9 bytes (8 letters and a newline) plus one line with 3 bytes (2 letters and a newline), for 30 bytes in total. Compared to the 34 that's reported by ftell() or stat(), the difference is the 4 CR characters ('\r') that are not returned. If you opened the file as a binary file ("rb"), then you'd get all 34 characters — 3 lines with 10 bytes and 1 line with 4 bytes.
The good news is that the size reported by stat() or ftell() is bigger than the final number of bytes returned, so allocating enough space is not too hard. It might become wasteful if you have a gigabyte size file with every line containing 1 byte of data and a CRLF. Then you'd "waste" (not use) one third of the allocated space. You could always shrink the allocation to the required size with realloc().
Note that there is no difference between text and binary mode on Unix-like (POSIX) systems such as Linux. It does not do mapping of CRLF to NL line endings. If the file is copied from Windows to Linux without mapping the line endings, you will get CRLF at the end of each line on Linux If the file is copied and the line endings are mapped, you'll get a smaller size on Linux than under Cygwin. (Using "rb" on Linux does no harm; it doesn't do any good either. Using "rb" on Windows/Cygwin could be important; it depends on the behaviour you want.)
See also the C11 standard §7.21.2 Streams and also §7.21.3 Files.
I'm working on an example problem where I have to reverse the text in a text file using fseek() and ftell(). I was successful, but printing the same output to a file, I had some weird results.
The text file I input was the following:
redivider
racecar
kayak
civic
level
refer
These are all palindromes
The result in the command line works great. In the text file that I create however, I get the following:
ÿsemordnilap lla era esehTT
referr
levell
civicc
kayakk
racecarr
redivide
I am aware from the answer to this question says that this corresponds to the text file version of EOF in C. I'm just confused as to why the command line and text file outputs are different.
#include <stdio.h>
#include <stdlib.h>
/**********************************
This program is designed to read in a text file and then reverse the order
of the text.
The reversed text then gets output to a new file.
The new file is then opened and read.
**********************************/
int main()
{
//Open our files and check for NULL
FILE *fp = NULL;
fp = fopen("mainText.txt","r");
if (!fp)
return -1;
FILE *fnew = NULL;
fnew = fopen("reversedText.txt","w+");
if (!fnew)
return -2;
//Go to the end of the file so we can reverse it
int i = 1;
fseek(fp, 0, SEEK_END);
int endNum = ftell(fp);
while(i < endNum+1)
{
fseek(fp,-i,SEEK_END);
printf("%c",fgetc(fp));
fputc(fgetc(fp),fnew);
i++;
}
fclose(fp);
fclose(fnew);
fp = NULL;
fnew = NULL;
return 0;
}
No errors, I just want identical outputs.
The outputs are different because your loop reads two characters from fp per iteration.
For example, in the first iteration i is 1 and so fseek sets the current file position of fp just before the last byte:
...
These are all palindromes
^
Then printf("%c",fgetc(fp)); reads a byte (s) and prints it to the console. Having read the s, the file position is now
...
These are all palindromes
^
i.e. we're at the end of the file.
Then fputc(fgetc(fp),fnew); attempts to read another byte from fp. This fails and fgetc returns EOF (a negative value, usually -1) instead. However, your code is not prepared for this and blindly treats -1 as a character code. Converted to a byte, -1 corresponds to 255, which is the character code for ÿ in the ISO-8859-1 encoding. This byte is written to your file.
In the next iteration of the loop we seek back to the e:
...
These are all palindromes
^
Again the loop reads two characters: e is written to the console, and s is written to the file.
This continues backwards until we reach the beginning of the input file:
redivider
^
Yet again the loop reads two characters: r is written to the console, and e is written to the file.
This ends the loop. The end result is that your output file contains one character that doesn't exist (from the attempt to read past the end of the input file) and never sees the first character.
The fix is to only call fgetc once per loop:
while(i < endNum+1)
{
fseek(fp,-i,SEEK_END);
int c = fgetc(fp);
if (c == EOF) {
perror("error reading from mainText.txt");
exit(EXIT_FAILURE);
}
printf("%c", c);
fputc(c, fnew);
i++;
}
In addition to #melpomene correction about using only 1 fgetc() per loops, other issues exist.
fseek(questionable_offset)
fopen("mainText.txt","r"); opens the file in text mode and not binary mode. Thus the using fseek(various_values) as a valid offset into the file is prone to troubles. Usually not a problem in *nix systems.
I do not have a simple alternative.
ftell() return type
ftell() return long. Use long instead of int i, endNum. (Not a concern with small files)
Check return values
ftell() and fseek() can fail. Test for error returns.
Is it possible to read a text file hat has non-english text?
Example of text in file:
E 37
SVAR:
Fettembolisyndrom. (1 poäng)
Example of what is present in buffer which stores "fread" output using "puts" :
E 37 SVAR:
Fettembolisyndrom.
(1 poäng)
Under Linux my program was working fine but in Windows I am seeing this problem with non-english letters. Any advise how this can be fixed?
Program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
int debug = 0;
int main(int argc, char* argv[])
{
if (argc < 2)
{
puts("ERROR! Please enter a filename\n");
exit(1);
}
else if (argc > 2)
{
debug = atoi(argv[2]);
puts("Debugging mode ENABLED!\n");
}
FILE *fp = fopen(argv[1], "rb");
fseek(fp, 0, SEEK_END);
long fileSz = ftell(fp);
fseek(fp, 0, SEEK_SET);
char* buffer;
buffer = (char*) malloc (sizeof(char)*fileSz);
size_t readSz = fread(buffer, 1, fileSz, fp);
rewind(fp);
if (readSz == fileSz)
{
char tmpBuff[100];
fgets(tmpBuff, 100, fp);
if (!ferror(fp))
{
printf("100 characters from text file: %s\n", tmpBuff);
}
else
{
printf("Error encounter");
}
}
if (strstr("FRÅGA",buffer) == NULL)
{
printf("String not found!");
}
return 0;
}
Sample output
Text file
Summary: If you read text from a file encoded in UTF-8 and display it on the console you must either set the console to UTF-8 or transcode the text from UTF-8 to the encoding used by the console (in English-speaking countries, usually MS-DOS code page 437 or 850).
Longer explanation
Bytes are not characters and characters are not bytes. The char data type in C holds a byte, not a character. In particular, the character Å (Unicode <U+00C5>) mentioned in the comments can be represented in many ways, called encodings:
In UTF-8 it is two bytes, '\xC3' '\x85';
In UTF-16 it is two bytes, either '\xC5' '\x00' (little-endian UTF-16), or '\x00' '\xC5' (big-endian UTF-16);
In Latin-1 and Windows-1252, it is one byte, '\xC5';
In MS-DOS code page 437 and code page 850, it is one byte, '\x8F'.
It is the responsibility of the programmer to translate between the internal encoding used by the program (usually but not always Unicode), the encoding used in input or output files, and the encoding expected by the display device.
Note: Sometimes, if the program does not do much with the characters it reads and outputs, one can get by just by making sure that the input files, the output files, and the display device all use the same encoding. In Linux, this encoding is almost always UTF-8. Unfortunately, on Windows the existence of multiple encodings is a fact of life. System calls expect either UTF-16 or Windows-1252. By default, the console displays Code Page 437 or 850. Text files are quite often in UTF-8. Windows is old and complicated.
I have a solution here that supposedly opens a file and changes the last character of it. I don't quite understand how this works. Could you please explain?
void readlast()
{
int handle = open("./file.txt", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (handle < 0)
{
return;
}
Okay, this part opens the file and if it doesn't work, returns.
First question: Why is a file opening an integer (int handle)? What is being stored in it?
char c='N';
lseek(handle, -2*sizeof(c), SEEK_END);
lseek apparently changes the location of a reader. So I guess this sets the reader to the end of a file(SEEK_END). But why do we need an offset of -2*sizeof(c) if we just want to write one character?
write(handle, &c, sizeof(c));
close(handle);
}
I do understand this last part.
Thank you!
Normally a file descriptor is returned by open() and it is an integer. 0 and 1 are customarily standard I/O.
File size - 2 [octets] is the offset of last character/byte.