Convert binary data file generated in windows to linux - c

I apologize ahead of time for my lack of c knowledge, as I am a native FORTRAN programmer. I was given some c code to debug which ingests a binary file and parses it into an input file containing several hundred records (871, to be exact) for a Fortran program that I'm working with. The problem is that these input binaries, and the associated c code, were created in a Windows environment. The parser reads through the binary until it reaches the end of the file:
SAGE_Lvl0_Packet GetNextPacket()
{
int i;
SAGE_Lvl0_Packet inpkt;
WORD rdbuf[128];
memset(rdbuf,0,sizeof(rdbuf));
fprintf(stdout,"Nbytes: %u\n",Nbytes);//returns 224
if((i = fread(rdbuf,Nbytes,1,Fp)) != 1)
FileEnd = 1;
else
{
if(FileType == 0)
memcpy(&(inpkt.CCSDS),rdbuf,Nbytes);
else
memcpy(&inpkt,rdbuf,Nbytes);
memcpy(&CurrentPacket,&inpkt,sizeof(inpkt));
}
return inpkt;
}
So when the code gets to packet 872, this snippet should return FileEnd = 1. Instead, the parser attempts to read a large amount of data from (near) the end of the file. This, I would think, would cause the program to crash (at least it would in Fortran. Would c just start reading the next portion of memory?) Fortunately, there is a CRC later on in the code that catches that the parser isn't reading correct data and exits gracefully.
I assume the problem originates with the binary buffer size and value in a Windows binary being larger/different than that in Linux. If that is the case, is there an easy way to convert Windows' binaries to Linux either in c or Linux? If I'm wrong in my assumption, then perhaps I need to look over the code some more. BTW, a WORD is an unsigned short int, and a SAGE_Lvl0_Packet is a 3-tiered structure with a total of 106 WORDs.

I think the biggest problem here is that, when fread() indicates end of file, the FileEnd flag gets set, but the function still ends up returning an (invalid) zeroed-out packet. Not a particularly robust design. I assume that the caller should be checking FileEnd before it attempts to use the packet just returned, but since that's not shown, it's quite possible that's a false assumption.
Also, not knowing what the packet looks like, it's impossible to tell whether the various memcpy() calls are correct. The fact that memcpy() is asked to copy 224 bytes into a structure that is supposedly only 212 bytes long is highly problematic.
There are likely other issues, but those are the big ones I see at the moment.

Related

Buffer overflow that overwrites local variables

I'm doing a buffer overflow exercise where the source code is given. The exercise allows you to change the number of argument vectors you feed into the program so you can get around the null problem making it easy.
However the exercise also mentions that it is possible to use just 1 argument vector to compromise this code. I'm curious to see how this can be done. Any ideas on how to approach this would be greatly appreciated.
The problem here is that length needs to be overwritten in order for the overflow to take place and the return address to be compromised. To my knowledge, you can't really use NULLs in the string since they are being passed in via execve arguments. So the length ends up being a very large number as you have to write some non zero number causing the entire stack to go boom, it's the same case with the return address. Am I missing something obvious? Does strlen need to be exploited. I saw some references to arithmetic overflow of signed numbers but I'm not sure if turning the local variables does anything.
The code is posted below and returns to a main function which then ends the program and runs on a little endian system with all stack protection turned off as this is an introductory exercise for infosec:
int TrickyOverflowSeq ( char *in )
{
char to_be_exploited[128];
int c;
int limit;
limit = strlen(in);
if (limit > 144)
limit = 144;
for (c = 0; c <= limit; c++)
to_be_exploited[c] = in[c];
return(0);
}
I don't know where arg comes from, but since your buffer is only 128 bytes, and you cap the max length to 144, you need only pass in a string longer than 128 bytes to cause a buffer overrun when copying in to to_be_exploited. Any malicious code would be in the input buffer from positions 129 to 144.
Whether or not that will properly set up a return to a different location depends on many factors.
However the exercise also mentions that it is possible to use just 1 argument vector to compromise this code. I'm curious to see how this can be done.
...
The problem here is that length needs to be overwritten in order for the overflow to take place and the return address to be compromised.
It seems pretty straightforward to me. That magic number 144 makes sense if sizeof(int) == 8, which it would if you are building for 64-bit.
So assuming a stack layout where to_be_exploited comes before c and limit, you can simply pass in a very long string with junk in the bytes starting at offset 136 (i.e., 128 + sizeof(int)), and then carefully crafted junk in the bytes starting with offset 144. This will overwrite limit starting with that byte, thus disabling the length check. Then the carefully crafted junk overwrites the return address.
You could put almost anything into the 8 bytes starting at offset 136 and have them make a number that is large enough to disable the security check. Just make sure you don't end up with a negative number. For example, the string "HAHAHAHA" would evaluate, as an integer, to 5206522089439316033. This number is larger than 144... actually, it's too large as you want this function to stop copying once your string is copied. So you just need to figure out how long your attack string actually is and put the correct bytes for that length into that position, and the attack will be copied in.
Note that normal string-handling functions in C use a NUL byte as a terminator, and stop copying. This function doesn't do that; it just trusts limit. So you could put any junk you want in the input string to exploit this function. However, if normal C library functions need to copy the input data, you might end up needing to avoid NUL bytes.
Of course nobody should put code this silly into production.
EDIT: I wrote the above in a hurry. Now that I have more time, I re-read your question and I think I better understand what you wanted to have explained.
You are wondering how a string can correctly clobber limit with a correct length without having strlen() chop it off short. This is impossible on a big-endian computer, but perfectly possible on a little-endian computer.
On a little-endian computer, the first byte is the least significant byte. See the Wikipedia entry:
http://en.wikipedia.org/wiki/Endianness
Any number that is not ridiculously large must have zero in its most significant bytes. On a big-endian computer that means the first several bytes will all be zero, will act like a NUL, and will cause strlen() to chop the string before the function can clobber limit. However, on a little-endian computer, the important bytes you want copied will all come before the NUL bytes.
In the early days of the Internet, it was common for big-endian computers (often bought from Sun Microsystems) to run Internet server apps. These days, commodity x86 server hardware is most common, and x86 is little-endian. In practice, anyone deploying such exploitable code as the TrickyOverflowSeq() function will get 0wned.
If you don't think this answer is thorough enough, please post a comment explaining what part you think I need to cover better and I'll update the answer.
I am aware that this is quite an old post, however I stumbled on your question because I found myself in the same situation with exactly the same questions as the ones you ask in your post and in the comments.
A few minutes later, I solved the problem. I don't know how much of it I should "spoil" here, since AFAIK this is a typical problem in many Computer Security courses. I can say however that the solution can indeed be achieved with exactly one argument... and with a couple of environment variables. Additional hint: environment variables are stored after function arguments on the stack (as in in higher addresses than the function arguments).

What really is EOF for binary files? Condition? Character?

I have managed this far with the knowledge that EOF is a special character inserted automatically at the end of a text file to indicate its end. But I now feel the need for some more clarification on this. I checked on Google and the Wikipedia page for EOF but they couldn't answer the following, and there are no exact Stack Overflow links for this either. So please help me on this:
My book says that binary mode files keep track of the end of file from the number of characters present in the directory entry of the file. (In contrast to text files which have a special EOF character to mark the end). So what is the story of EOF in context of binary files? I am confused because in the following program I successfully use !=EOF comparison while reading from an .exe file in binary mode:
#include<stdio.h>
#include<stdlib.h>
int main()
{
int ch;
FILE *fp1,*fp2;
fp1=fopen("source.exe","rb");
fp2=fopen("dest.exe","wb");
if(fp1==NULL||fp2==NULL)
{
printf("Error opening files");
exit(-1);
}
while((ch=getc(fp1))!=EOF)
putc(ch,fp2);
fclose(fp1);
fclose(fp2);
}
Is EOF a special "character" at all? Or is it a condition as Wikipedia says, a condition where the computer knows when to return a particular value like -1 (EOF on my computer)? Example of such "condition" being when a character-reading function finishes reading all characters present, or when character/string I/O functions encounter an error in reading/writing?
Interestingly, the Stack Overflow tag for EOF blended both those definitions of the EOF. The tag for EOF said "In programming realm, EOF is a sequence of byte (or a chacracter) which indicates that there are no more contents after this.", while it also said in the "about" section that "End of file (commonly abbreviated EOF) is a condition in a computer operating system where no more data can be read from a data source. The data source is usually called a file or stream."
But I have a strong feeling EOF won't be a character as every other function seems to be returning it when it encounters an error during I/O.
It will be really nice of you if you can clear the matter for me.
The various EOF indicators that C provides to you do not necessarily have anything to do with how the file system marks the end of a file.
Most modern file systems know the length of a file because they record it somewhere, separately from the contents of the file. The routines that read the file keep track of where you are reading and they stop when you reach the end. The C library routines generate an EOF value to return to you; they are not returning a value that is actually in the file.
Note that the EOF returned by C library routines is not actually a character. The C library routines generally return an int, and that int is either a character value or an EOF. E.g., in one implementation, the characters might have values from 0 to 255, and EOF might have the value −1. When the library routine encountered the end of the file, it did not actually see a −1 character, because there is no such character. Instead, it was told by the underlying system routine that the end of file had been reached, and it responded by returning −1 to you.
Old and crude file systems might have a value in the file that marks the end of file. For various reasons, this is usually undesirable. In its simplest implementation, it makes it impossible to store arbitrary data in the file, because you cannot store the end-of-file marker as data. One could, however, have an implementation in which the raw data in the file contains something that indicates the end of file, but data is transformed when reading or writing so that arbitrary data can be stored. (E.g., by “quoting” the end-of-file marker.)
In certain cases, things like end-of-file markers also appear in streams. This is common when reading from the terminal (or a pseudo-terminal or terminal-like device). On Windows, pressing control-Z is an indication that the user is done entering input, and it is treated similarly to reach an end-of-file. This does not mean that control-Z is an EOF. The software reading from the terminal sees control-Z, treats it as end-of-file, and returns end-of-file indications, which are likely different from control-Z. On Unix, control-D is commonly a similar sentinel marking the end of input.
This should clear it up nicely for you.
Basically, EOF is just a macro with a pre-defined value representing the error code from I/O functions indicating that there is no more data to be read.
The file doesn't actually contain an EOF. EOF isn't a character of sorts - remember a byte can be between 0 and 255, so it wouldn't make sense if a file could contain a -1. The EOF is a signal from the operating system that you're using, which indicates the end of the file has been reached. Notice how getc() returns an int - that is so it can return that -1 to tell you the stream has reached the end of the file.
The EOF signal is treated the same for binary and text files - the actual definition of binary and text stream varies between the OSes (for example on *nix binary and text mode are the same thing.) Either way, as stated above, it is not part of the file itself. The OS passes it to getc() to tell the program that the end of the stream has been reached.
From From the GNU C library:
This macro is an integer value that is returned by a number of narrow stream functions to indicate an end-of-file condition, or some other error situation. With the GNU C Library, EOF is -1. In other libraries, its value may be some other negative number.
EOF is not a character. In this context, it's -1, which, technically, isn't a character (if you wanted to be extremely precise, it could be argued that it could be a character, but that's irrelevant in this discussion). EOF, just to be clear is "End of File". While you're reading a file, you need to know when to stop, otherwise a number of things could happen depending on the environment if you try to read past the end of the file.
So, a macro was devised to signal that End of File has been reached in the course of reading a file, which is EOF. For getc this works because it returns an int rather than a char, so there's extra room to return something other than a char to signal EOF. Other I/O calls may signal EOF differently, such as by throwing an exception.
As a point of interest, in DOS (and maybe still on Windows?) an actual, physical character ^Z was placed at the end of a file to signal its end. So, on DOS, there actually was an EOF character. Unix never had such a thing.
Well it is pretty much possible to find the EOF of a binary file if you study it's structure.
No, you don't need the OS to know the EOF of an executable EOF.
Almost every type of executable has a Page Zero which describes the basic information that the OS might need while loading the code into the memory and is stored as the first page of that executable.
Let's take the example of an MZ executable.
https://wiki.osdev.org/MZ
Here at offset 2, we have the total number of complete/partial pages and right after that at offset 4 we have the number of bytes in the last page. This information is generally used by the OS to safely load the code into the memory, but you can use it to calculate the EOF of your binary file.
Algorithm:
1. Start
2. Parse the parameter and instantiate the file pointer as per your requirement.
3. Load the first page (zero) in a (char) buffer of default size of page zero and print it.
4. Get the value at *((short int*)(&buffer+2)) and store it in a loop variable called (short int) i.
5. Get the value at *((short int*)(&buffer+4)) and store it in a variable called (short int) l.
6. i--
7. Load and print (or do whatever you wanted to do) 'size of page' characters into a buffer until i equals zero.
8. Once the loop has finished executing just load `l` bytes into that buffer and again perform whatever you wanted to
9. Stop
If you're designing your own binary file format then consider adding some sort of meta data at the start of that file or a special character or word that denotes the end of that file.
And there's a good amount of probability that the OS loads the size of the file from here with the help of simple maths and by analyzing the meta-data even though it might seem that the OS has stored it somewhere along with other information it's expected to store (Abstraction to reduce redundancy).

C File Input/Output for Unknown File Types: File Copying

having some issues with a networking assignment. End goal is to have a C program that grabs a file from a given URL via HTTP and writes it to a given filename. I've got it working fine for most text files, but I'm running into some issues, which I suspect all come from the same root cause.
Here's a quick version of the code I'm using to transfer the data from the network file descriptor to the output file descriptor:
unsigned long content_length; // extracted from HTTP header
unsigned long successfully_read = 0;
while(successfully_read != content_length)
{
char buffer[2048];
int extracted = read(connection,buffer,2048);
fprintf(output_file,buffer);
successfully_read += extracted;
}
As I said, this works fine for most text files (though the % symbol confuses fprintf, so it would be nice to have a way to deal with that). The problem is that it just hangs forever when I try to get non-text files (a .png is the basic test file I'm working with, but the program needs to be able to handle anything).
I've done some debugging and I know I'm not going over content_length, getting errors during read, or hitting some network bottleneck. I looked around online but all the C file i/o code I can find for binary files seems to be based on the idea that you know how the data inside the file is structured. I don't know how it's structured, and I don't really care; I just want to copy the contents of one file descriptor into another.
Can anyone point me towards some built-in file i/o functions that I can bludgeon into use for that purpose?
Edit: Alternately, is there a standard field in the HTTP header that would tell me how to handle whatever file I'm working with?
You are using the wrong tool for the job. fprintf takes a format string and extra arguments, like this:
fprintf(output_file, "hello %s, today is the %d", cstring, dayoftheweek);
If you pass the second argument from an unknown source (like the web, which you are doing) you can accidentally have %s or %d or other format specifiers in the string. Then fprintf will try to read more arguments than it was passed, and cause undefined behaviour.
Use fwrite for this:
fwrite(buffer, 1, extracted, output_file);
A couple things with your code:
For fprintf - you are using the data as the second argument, when in fact it should be the format, and the data should be the third argument. This is why you are getting problems with the % character, and why it is struggling when presented with binary data, because it is expecting a format string.
You need to use a different function, such as fwrite, to output the file.
As a side note this is a bit of a security problem - if you fetch a specially crafted file from the server it is possible to expose random areas of your memory.
In addition to Seth's answer: unless you are using a third-party library for handling all the HTTP stuff, you need to deal with the Transfer-Encoding header and the possible compression, or at least detect them and throw an error if you don't know how to handle that case.
In general, it may (or may not) be a good idea to parse the HTTP response headers, and only if they contain exclusively stuff that you understand should you continue to interpret the data that follows the header.
I bet your program is hanging because it's expecting X bytes but receiving Y instead, with X < Y (most likely, sans compression - but PNG don't compress well with gzip). You'll get chunks [*] of data, with one of the chunks most likely spanning content_length so your condition while(successfully_read != content_length) is always true.
You could try running your program under strace or whatever its equivalent is for your OS, if you want to see how your program continues trying to read data it will never get (because you've likely made an HTTP/1.1 request that holds the connection open, and you haven't made a second request) or has ended (if the server closes the connection, your (repeated) calls to read(2) will just return 0, which leaves your (still true) loop condition unchanged.
If you are sending your program's output to stdout, you may find that it produces no output - this can happen if the resource you are retrieving contains no newline or other flush-forcing control characters. Other stdio buffering regimes may apply when output goes to a file. (For example, the file will remain empty until the stdio buffers have accumulates at least 4096 bytes.)
[*] Then there's also Transfer-Encoding: chunked, as #roland-illig alludes to, which will ruin the exact equivalence between content_length (presumably derived from the eponymous HTTP header) and the actual number of bytes transferred over the socket.
You are opening the file as a text file. Doing so means that the program will add \r\n characters at the end of every write() call. Try opening the file as binary, and those errors in size shall go away.

Saving data to a binary file

I would like to save a file as binary, because I've heard that it would probably be smaller than a normal text file.
Now I am trying to save a binary file with some text, but the problem is that the file just contains the text and NULL at the end. I would expect to see only zero's and one's inside the file.
Any explaination or suggestions are highly appreciated.
Here is my code
#include <iostream>
#include <stdio.h>
int main()
{
/*Temporary data buffer*/
char buffer[20];
/*Data to be stored in file*/
char temp[20]="Test";
/*Opening file for writing in binary mode*/
FILE *handleWrite=fopen("test.bin","wb");
/*Writing data to file*/
fwrite(temp, 1, 13, handleWrite);
/*Closing File*/
fclose(handleWrite);
/*Opening file for reading*/
FILE *handleRead=fopen("test.bin","rb");
/*Reading data from file into temporary buffer*/
fread(buffer,1,13,handleRead);
/*Displaying content of file on console*/
printf("%s",buffer);
/*Closing File*/
fclose(handleRead);
std::system("pause");
return 0;
}
All files contain only ones and zeroes, on binary computers that's all there is to play with.
When you save text, you are saving the binary representation of that text, in a given encoding that defines how each letter is mapped to bits.
So for text, a text file or a binary file almost doesn't matter; the savings in space that you've heard about generally come into play for other data types.
Consider a floating point number, such as 3.141592653589. If saved as text, that would take one character per digit (just count them), plus the period. If saved in binary as just a copy of the float's bits, it will take four characters (four bytes, or 32 bits) on a typical 32-bit system. The exact number of bits stored by a call such as:
FILE *my_file = fopen("pi.bin", "wb");
float x = 3.1415;
fwrite(&x, sizeof x, 1, my_file);
is CHAR_BIT * sizeof x, see <stdlib.h> for CHAR_BIT.
The problem you describe is a chain of (very common1, unfortunately) mistakes and misunderstandings. Let me try to fully detail what is going on, hopefully you will take the time to read through all the material: it is lengthy, but these are very important basics that any programmer should master. Please do not despair if you do not fully understand all of it: just try to play around with it, come back in a week, or two, practice, see what happens :)
There is a crucial difference between the concepts of a character encoding and a character set. Unless you really understand this difference, you will never really get what is going on, here. Joel Spolsky (one of the founders of Stackoverflow, come to think of it) wrote an article explaining the difference a while ago: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Before you continue reading this, before you continue programming, even, read that. Honestly, read it, understand it: the title is no exaggeration. You must absolutely know this stuff.
After that, let us proceed:
When a C program runs, a memory location that is supposed to hold a value of type "char" contains, just like any other memory location, a sequence of ones and zeroes. "type" of a variable only means something to the compiler, not to the running program who just sees ones and zeroes and does not know more than that. In other words: where you commonly think of a "letter" (an element from a character set) residing in memory somewhere, what is actually there is a bit sequence (an element from a character encoding).
Every compiler is free to use whatever encoding they wish to represent characters in memory. As a consequence, it is free represent what we call a "newline" internally as any number it chooses. For example, say I write a compiler, I can agree with myself that every time I want to store a "newline" internally I store it as number six (6), which is just 0x6 in binary (or 110 in binary).
Writing to a file is done by telling the operating system2 four things at the same time:
The fact that you want to write to a file (fwrite())
Where the data starts that you want to write (first argument to fwrite)
How much data you want to write (second and third argument, multiplied)
What file you want to write to (last argument)
Note that this has nothing to do with the "type" of that data: your operating has no idea, and does not care. It does not know anything about characters sets and it does not care: it just sees a sequence of ones and zeroes starting somewhere and copies that to a file.
Opening a file in "binary" mode is actually the normal, intuitive way of dealing with files that a novice programmer would expect: the memory location you specify is copied one-on-one to the file. If you write a memory location that used to hold variables that the compiler decided to store as type "char", those values are written one-on-one to the file. Unless you know how the compiler stores values internally (what value it associates with a newline, with a letter 'a', 'b', etc), THIS IS MEANINGLESS. Compare this to Joel's similar point about a text file being useless without knowing what its encoding is: same thing.
Opening a file in "text" mode is almost equal to binary mode, with one (and only one) difference: anytime a value is written that has value equal to what the compiler uses INTERNALLY for the newline (6, in our case), it writes something different to the file: not that value, but whatever the operating system you are on considers to be a newline. On windows, this is two bytes (13 and 10, or 0x0d 0x0a, on Windows). Note, again, if you do not know about the compiler's choice of internal representation of the other characters, this is STILL MEANINGLESS.
Note at this point that it is pretty clear that writing anything but data that the compiler designated as characters to a file in text mode is a bad idea: in our case, a 6 might just happen to be among the values you are writing, in which case the output is altered in a way that we absolutely do not mean to.
(Un)Luckily, most (all?) compilers actually use the same internal representation for characters: this representation is US-ASCII and it is the mother of all defaults. This is the reason you can write some "characters" to a file in your program, compiled with any random compiler, and then open it with a text editor: they all use/understand US-ASCII and it happens to work.
OK, now to connect this to your example: why is there no difference between writing "test" in binary mode and in text mode? Because there is no newline in "test", that is why!
And what does it mean when you "open a file", and then "see" characters? It means that the program you used to inspect the sequence of ones and zeroes in that file (because everything is ones and zeroes on your hard disk) decided to interpret that as US-ASCII, and that happened to be what your compiler decided to encode that string as, in its memory.
Bonus points: write a program that reads the ones and zeroes from a file into memory and prints every BIT (there's multiple bits to make up one byte, to extract them you need to know 'bitwise' operator tricks, google!) as a "1" or "0" to the user. Note that "1" is the CHARACTER 1, the point in the character set of your choosing, so your program must take a bit (number 1 or 0) and transform it to the sequence of bits needed to represent character 1 or 0 in the encoding that the terminal emulator uses that you are viewing the standard out of the program on oh my God. Good news: you can take lots of short-cuts by assuming US-ASCII everywhere. This program will show you what you wanted: the sequence of ones and zeroes that your compiler uses to represent "test" internally.
This stuff is really daunting for newbies, and I know that it took me a long time to even know that there was a difference between a character set and an encoding, let alone how all of this worked. Hopefully I did not demotivate you, if I did, just remember that you can never lose knowledge you already have, only gain it (ok not always true :P). It is normal in life that a statement raises more questions than it answered, Socrates knew this and his wisdom seamlessly applies to modern day technology 2.4k years later.
Good luck, do not hesitate to continue asking. To other readers: please feel welcome to improve this post if you see errors.
Hraban
1 The person that told you that "saving a file in binary is probably smaller", for example, probably gravely misunderstands these fundamentals. Unless he was referring to compressing the data before you save it, in which case he just uses a confusing word ("binary") for "compressed".
2 "telling the operating system something" is what is commonly known as a system call.
Well, the difference between native and binary is the way the end of line is handled.
If you write a string in a binary, it will stay the string.
If you want to make it smaller, you'll have to somehow compress it (look for libz for example).
What is smaller is: when wanting to save binary data (like an array of bytes), it's smaller to save it as binary rather than putting it in a string (either in hexa representation or base64). I hope this helps.
I think you're a bit confused here.
The ASCII-string "Test" will still be an ASCII-string when you write it to the file (even in binary mode). The cases when it makes sense to write binary are for other types than chars (e.g. an array of integers).
try replacing
FILE *handleWrite=fopen("test.bin","wb");
fwrite(temp, 1, 13, handleWrite);
with
FILE *handleWrite=fopen("test.bin","w");
fprintf(handleWrite, "%s", temp);
Function printf("%s",buffer); prints buffer as zero-ending string.
Try to use:
char temp[20]="Test\n\rTest";

What's a good coding style for reading different bits of data from a binary file in C?

I'm novice programmer and am writing a simple wav-player in C as a pet project. Part of the file loading process requires reading specific data (sampling rate, number of channels,...) from the file header.
Currently what I'm doing is similar to this:
Scan for a sequence of bytes and skip past it
Read 2 bytes into variable a
Check value and return on error
Skip 4 bytes
Read 4 bytes into variable b
Check value and return on error
...and so on. (code see: https://github.com/qgi/Player/blob/master/Importer.c)
I've written a number of helper functions to do the scanning/skipping/reading bit. Still I'm repeating the reading, checking, skipping part several times, which doesn't seem to be neither very effective nor very smart. It's not a real issue for my project, but as this seems to be quite a common task when handling binary files, I was wondering:
Is there some kind of a pattern on how to do this more effectively with cleaner code?
Most often, people define structs (often with something like #pragma pack(1) to assure against padding) that matches the file's structures. They then read data into an instance of that with something like fread, and use the values from the struct.
The cleanest option that I've come across is the scanf-like function unpack presented by Kernighan & Pike on page 219 of The Practice of Programming, which can be used like
// assume we read the file header into buf
// and the header consists of magic (4 bytes), type (2) and length (4).
// "l" == 4 bytes (long)
// "s" == 2 bytes (short)
unpack(buf, "lsl", &magic, &type, &length);
For efficiency using a buffer of say size 4096 to read into and then doing your parsing on the data in the buffer would be more efficient, and ofcource doing a single scan parsing where you only go forward is most efficient.

Resources