What could be the purpose of a header file with raw data? - c

Decided to check out some code other than my own, Quake I was the choice. The first file I click on is filled with nothing but raw data and the only comments are the GPL. I am guessing it is an array containing normal vectors? Regardless of what its purpose was, what confuses me is what it is doing in a header file anorms.h. I am wondering what could be the purpose of doing this?
The other source, actual code, feels fairly complicated to me. As a novice programmer I probably just need to spend more time at it.

Guessing by the looks, it's indeed a normal array used somewhere in the game.
In older days, pretty much every game content was hardcoded; now you can simply open a file and load the data, because the HDDs (and more often SSDs) got so much faster.
Older games were also compiled as plain C executable; in modern IDEs such as Visual Studio (or pretty much anything, really), you can easily compile arbitrary data into the .exe in form of resources.
All that being said, it's simply legacy cruft and I shouldn't be very concerned with it.
Sample usage:
struct Vec { float x,y,z };
Vec arr[] = {
#include "anorms.h"
};

Related

C: Using serialized data as type

So I've run into an interesting design pattern and I wanted to know if you guys had an opinion on it.
Basically, the design is passing everything around as a pre-serialized type. There is no "types" for the returns, for example. It is passed as a simple uint8_t*. There is a defined header that "tells" you what is in the buffer, how big it is, what the version of the buffer is, ect. I call it "pre-serialized" because it forces flattening of all structures.
The pros:
You can easily write it (or even a set of it) to what ever you want. Files, IO, whatever.
Can store arbitrary data.
The Cons: IMHO:
No type safety is going to be a nightmare
The programmer has to parse the code. Even if there is an enumerated type, the user would have to know what that type means. Even if there are functions to parse the type, the programmer has to know that is the function to call.
Version hell: changing code will cause a ripple effect of errors. Because everywhere is parsing it differently, you have no idea where the code works or where it is broken.
It is viral: because it is flat, you can't "insert" the header on the end of outside data. You could wrap the call if you copy your "data", but this could cause an unnecessary copy that would be SLOW. So either your code is slower than it needs to be, or you conform to this data structure.
It isn't human readable OR debug-able.
Have you seen this design pattern before? Is there a name for this design pattern? Things I missed?
Is there a name for this design pattern?
Well, Legacy Code? :) I have seen such design in 30 years old Cobol systems...
The pros you have stated are easily reachable also by using XML format (or JSON):
You can easily write it (or even a set of it) to what ever you want. Files, IO, whatever - most of all, web services!
Can store arbitrary data.
Furthermore, all your cons are eliminated.
The only pro I can see in your solution is conciseness - when every byte counts and you need to avoid any overhead as too expensive, then this is nice.
Added: Cobol has a feature to easily define the structure of such serialized data, see PICTURE clause. Reading the data is very easy then, you read them as variables. (Like if you have a binary data and define a struct in the C language and typecast the binary to the struct.)
As Honza said this would be normal in Legacy Cobol/PL1 (was there a Cobol/PL1 conversion or interface to COBOL programs ???).
In COBOL this design pattern would make sense, not sure about C though (one of the binary serialization packages or JSON etc might be more sensible).
In Cobol, you would have a Cobol copybook which all programs would use and could edit the data using the Cobol Copybook (with something like file-aid or Microfocus Data Editor).
Why use this "design pattern" in Cobol:
Regression testing of Modules; you can write a driver module like
Read Test-data-file
while more-data
Call Module
write Result to output-file
Read Test-data-file
end
You can then do a compare between Output from the
re-Change Program to the changed program.
Testing - some times you can use a "production file" in testing
A file provides trace or snapshot of what is going on, this can be very useful.
Easy to reorganize Batch streams:
Split a programs up (and pass the data via file). There variety of reason for doing this including
program has gotten to big and is hard to maintain.
Sorting the data
Performance (use a file rather than hitting the DB multiple times)
new uses for extracted data
While your cons are valid for C, they will be less of an issue in Cobol.
The key to using this "design pattern" is being able to edit/view/compare the format. If you can not edit/view/compare a file, I do not see the point

C theory/general practice related to "splitting" the system into x number of source files

I have a quick question when programming in C. I am writing a simple application in C as the title suggests but i find myself defining rather large functions in separate source files so it makes maintenance and debugging much easier but my question is is there a standard X amount of lines in a c source file before you should "split" it up into multiple files or is it very dependant on the system/functions in question.
Say for example i have 20 source files with 1 function in each say the functions are somewhat related but they all do different things (e.g. they all manipulate the same struct in some way) should you in theory have these 20 files, or 1 larger file with 20 functions and keep the modification of X structure in the same file?
My idea is the more "split" the better/easier the coding becomes, but then again im quite new to C.
Any input will be appreciated.
Cheers,
Chris.
It makes sense to put code related to the same conceptional area together. If you have functions which work on matrices for example, it would seem to make sense to have a file called matrices.c within which, there are X number of matrix functions. A function called render would obviously not belong there.
Yet if the number of matrix function were to grow huge, it started to feel wrong to shove them all into a single file. Under such a situation I would look for sub-categories and create separate files for each, e.g 2d_matrix.c, 3d_matrix.c, etc.
As for the number of functions you place in a file before you recategorize it, that's is up to personal choice and sometimes development rules of the team you work for.
The same consideration sometimes applies to the size of a function. One team I have worked for would not allow code which is over two screens high, feeling that such code should be broken up into a number of smaller functions which would make the code more readable.
To me, structure your code in a way that makes sense. Keep related code together and be sensible with sizes of functions, number of functions in a file (both too few or too many).
The larger a function gets, the more easy it is to accidentally break it.
The more code you shove in one file, the more likely it will be for other people to be a little sloppy and shove more, and possibly unrelated code in the same file.
Splitting up of a file is not function/system dependant. That entirely depends on the programmer. I have seen 1000-1500 or even more lines of code in a single C file. Keeping twenty functions in a same file makes sense if they are not very different from each other. However if you split the functions among the files, make sure that you write the Makefile properly when compiling them. The phrase " the more split, the easier coding becomes" is debatable.
I liked alk's answer in the closed duplicate: If you follow an object oriented style in C, i.e. use structures and operations on them, the files separate quite naturally in the same way as they would in C++. Operations on the same data types, together forming a "poor man's class", go together.

c99 dynamic array

I'm writing a very small, project-specific OpenGLES engine for iphone and I really need to use a good, solid, and proven dynamic array library/macro in c99 dialect. (No C++, Obj-C, stl whatsoever)
It's strongly necessary for render batch and polygon mesh, so it should be able to handle various types of data, and additionally causes minimal overhead when array size changes and new data is inserted.
I've been searching around and found two candidates for my need.
the first one is from ccCArray from Cocos2d.
and another one is utarray written by Troy D. Hanson.
ccCArray IS rock solid, thoroughly proven by community. utarray looks fine but I cannot find anyone actually uses it.
Any more suggestion?
A library ?! A C++ template would be more than suitable for this need. I'd say about AT MOST 15 functions (excluding alternative constructors and const getters), and you're done. Also able to use it for ANY type, ANY size and ANY size type (byte, int etc.) And it's just one file: a .h or, better said, a .hpp
Any reason you're rejecting it ? Seems like you want to make life harder for yourself :)

Why should I use a human readable file format?

Why should I use a human readable file format in preference to a binary one? Is there ever a situation when this isn't the case?
EDIT:
I did have this as an explanation when initially posting the question, but it's not so relevant now:
When answering this question I wanted to refer the asker to a standard SO answer on why using a human readable file format is a good idea. Then I searched for one and couldn't find one. So here's the question
It depends
The right answer is it depends. If you are writing audio/video data for instance, if you crowbar it into a human readable format, it won't be very readable! And word documents are the classic example where people have wished they were human readable, so more flexible, and by moving to XML MS are going that way.
Much more important than binary or text is a standard or not a standard. If you use a standard format, then chances are you and the next guy won't have to write a parser, and that's a win for everyone.
Following this are some opinionated reasons why you might want to choose one over the other, if you have to write your own format (and parser).
Why use human readable?
The next guy. Consider the maintaining developer looking at your code 30 years or six months from now. Yes, he should have the source code. Yes he should have the documents and the comments. But he quite likely won't. And having been that guy, and had to rescue or convert old, extremely, valuable data, I'll thank you for for making it something I can just look at and understand.
Let me read AND WRITE it with my own tools. If I'm an emacs user I can use that. Or Vim, or notepad or ... Even if you've created great tools or libraries, they might not run on my platform, or even run at all any more. Also, I can then create new data with my tools.
The tax isn't that big - storage is free. Nearly always disc space is free. And if it isn't you'll know. Don't worry about a few angle brackets or commas, usually it won't make that much difference. Premature optimisation is the root of all evil. And if you are really worried just use a standard compression tool, and then you have a small human readable format - anyone can run unzip.
The tax isn't that big - computers are quick. It might be a faster to parse binary. Until you need to add an extra column, or data type, or support both legacy and new files. (though this is mitigated with Protocol Buffers)
There are a lot of good formats out there. Even if you don't like XML. Try CSV. Or JSON. Or .properties. Or even XML. Lots of tools exist for parsing these already in lots of languages. And it only takes 5mins to write them again if mysteriously all the source code gets lost.
Diffs become easy. When you check in to version control it is much easier to see what has changed. And view it on the Web. Or your iPhone. Binary, you know something has changed, but you rely on the comments to tell you what.
Merges become easy. You still get questions on the web asking how to append one PDF to another. This doesn't happen with Text.
Easier to repair if corrupted. Try and repair a corrupt text document vs. a corrupt zip archive. Enough said.
Every language (and platform) can read or write it. Of course, binary is the native language for computers, so every language will support binary too. But a lot of the classic little tool scripting languages work a lot better with text data. I can't think of a language that works well with binary and not with text (assembler maybe) but not the other way round. And that means your programs can interact with other programs you haven't even thought of, or that were written 30 years before yours. There are reasons Unix was successful.
Why not, and use binary instead?
You might have a lot of data - terabytes maybe. And then a factor of 2 could really matter. But premature optimization is still the root of all evil. How about use a human one now, and convert later? It won't take much time.
Storage might be free but bandwidth isn't (Jon Skeet in comments). If you are throwing files around the network then size can really make a difference. Even bandwidth to and from disc can be a limiting factor.
Really performance intensive code. Binary can be seriously optimised. There is a reason databases don't normally have their own plain text format.
A binary format might be the standard. So use PNG, MP3 or MPEG. It makes the next guys job easier (for at least the next 10 years).
There are lots of good binary formats out there. Some are global standards for that type of data. Or might be a standard for hardware devices. Some are standard serialization frameworks. A great example is Google Protocol Buffers. Another example: Bencode
Easier to embed binary. Some data already is binary and you need to embed it. This works naturally in binary file formats, but looks ugly and is very inefficient in human readable ones, and usually stops them being human readable.
Deliberate obscurity. Sometimes you don't want it obvious what your data is doing. Encryption is better than accidental security through obscurity, but if you are encrypting you might as well make it binary and be done with it.
Debatable
Easier to parse. People have claimed that both text and binary are easier to parse. Now clearly the easiest to parse is when your language or library supports parsing, and this is true for some binary and some human readable formats, so doesn't really support either. Binary formats can clearly be chosen so they are easy to parse, but so can human readable (think CSV or fixed width) so I think this point is moot. Some binary formats can just be dumped into memory and used as is, so this could be said to be the easiest to parse, especially if numbers (not just strings are involved. However I think most people would argue human readable parsing is easier to debug, as it is easier to see what is going on in the debugger (slightly).
Easier to control. Yes, it is more likely someone will mangle text data in their editor, or will moan when one Unicode format works and another doesn't. With binary data that is less likely. However, people and hardware can still mangle binary data. And you can (and should) specify a text encoding for human-readable data, either flexible or fixed.
At the end of the day, I don't think either can really claim an advantage here.
Anything else
Are you sure you really want a file? Have you considered a database? :-)
Credits
A lot of this answer is merging together stuff other people wrote in other answers (you can see them there). And especially big thanks to Jon Skeet for his comments (both here and offline) for suggesting ways it could be improved.
It entirely depends on the situation.
Benefits of a human readable format:
You can read it in its "native" format
You can write it yourself, e.g. for unit tests - or even for real content, depending on what it's for
Probable benefits of a binary format:
Easier to parse (in terms of code)
Faster to parse
More efficient in terms of space
Easier to control (any time you need text in there, you can ensure it's UTF-8 encoded, and length prefixed etc)
Easier to include opaque binary data efficiently (images, etc - with a text format you'd be getting into base64)
Don't forget that you can always implement a binary format but produce tools to convert to/from a human-readable format as well. That's what the Protocol Buffers framework does - it's actually pretty rare IME to need to parse a text version of a protocol buffer, but it's really handy to be able to write it out as text.
EDIT: Just in case this ends up being an accepted answer, you should also bear in mind the point made by starblue: Human readable forms are much better for diffing. I suspect it would be feasible to design a binary format which is appropriate for diffing (and where a human-readable diff could be generated) but out-of-the-box support from existing diff tools will be better for text.
Version control is easier with text formats, because changes can easily be viewed and merged.
Especially MS-Word is giving us grief in this respect.
Open format -- no binary bit juggling
Readability :)
Interchange across platforms
Debugging aid
Easily parsed (and easily converted to any format)
One important point: you write a parser once, but read the output many times. That kind of tilts the balance in favor of HRF.
A major reason is that if someone needs to read the data say, 30 years from now, human readable format can be figured out. Binary is much more difficult.
If your have large data sets that are binary by nature (e.g. images), they obviously can't be stored in any other than binary form. But even then, the metadata could (and should!) be human-readable.
There's something called The Art of Unix Programming.
I won't say it's good or bad, but it's fairly famous. It has a whole chapter called Textuality in which the author asserts that human readable file format are an important part of the Unix way of programming.
They open the possibility to be created/edited with tools other than the original ones. New and better tools can be developed by others, integration into third party applications becomes possible. Think about binary iCal files, for example - would the format have been a success?
Apart from that: Human readable files improve the ability to debug or, for the savvy user, at least find the reason an error.
Pros for binary:
fast to parse
generally smaller data
easy to write a parser for
Pros for human readable:
easier to understand while reading - no "field X is set to 4 487 which means that the reactor should be shut down NOW"
if using something like XML easy to write a tool that will parse any file
I have had to deal with both types. If you are sending data and you want to keep it small binary is good. If you expect people to read it then human readable is good.
Human readable generally somewhat self documenting as well. And with binary it is bery easy to make mistakes - and hard to spot them.
Editable
Readable (duh!)
Printable
Notepad and vi enabled
Most importantly , their function can be decuded from the content (well mostly)
Because you are a human, and sooner or later you (or one of your customers) will be able to read the data.
We only use binary format if speed is an issue. And even then debugging is troublesome so we added a human readable equivalent.
Interoperability is the standard argument, i.e. a human-readable form is easier for developers of disparate systems to deal with so therefore confers some advantage.
Personally I think that is not true, and the performance benfits of binary files ought to beat that argument, especially if you publish your protocol. However the ubiquity of XML/HTTP based frameworks for machine interactions means that it is easier to adopt.
XML is way over-used.
Just a quick illustration where human-readable document format can be a better choice:
documents used for deploying application in production
We used to have our release notes in word format, but that release notes document had to be opened on various environment (Linux, Solaris) in pre-production and production plateform.
It also had to be parsed in order to extract various data.
In the end, we switched to a wiki-based syntax, still displayed nicely in HTML through a wiki, but still used as a simple text file in other situation.
As an adjuct to this, there are differing levels of human readability, and all are enhanced by using a good editor or viewer with code coloring, folding or navigation.
For example,
JSON is quite readable even in plaintext
XML has the angle bracket tax but is usable when using a good editor
INI is mostly human readable
CSV can be readable, but is best when loaded into a spreadsheet.
No one said, so I will: human-readability is not really a property of a file format (all files are binary after all), but rather of a file format and viewer app combination.
So called human readable formats are all based on top of additional abstraction layer of one of existing text encodings. And viewer programs (often also serving as an editor) that are capable of rendering these encodings in a form readable by humans are very common.
Text encoding standards are widespread and fairly mature, which means they're unlikely to evolve much in the foreseeable future.
Usually on top of the text encoding layer of the format we find a syntax layer that is reasonably intuitive given target user knowledge and cultural background.
Hence the benefits of "human-readable" formats:
Ubiquity of suitable viewers and editors.
Timelessness (given that cultural conventions won't change much).
Easiness-to-learn, read and modify.
Reliance on the extra abstraction layer makes text encoded files:
Space hungry.
Slower to process.
"Binary" files do not resort to text encoding abstraction layer as a base (or a common denominator), but they might or might not use some sort of an extra abstraction more suitable for their purpose and hence, they can be much better optimised for a specific task at hand meaning:
Faster processing.
Smaller footprint.
On the other hand:
Viewers and editors are specific for a particular binary format and make interoperability harder.
Viewers for any given format are less wide spread, because they are more specialised.
Formats might evolve significantly or go out of use over time: their main benefit in being very well suited for a particular task and as the task or task requirements evolve, so does the format.
Take a moment and think about application OTHER than web development.
The assumption that:
A) It has a meaning that is "obvious" in text format is false.
Things like control systems for a steel mill, or manufacturing plant don't typically have any advantage in being human readable. The software for those types of environments will typically have routines to display data in a graphically meaningful manner.
B) Outputting it in text is easier. Unnecessary conversions that actually require more code make a system LESS robust. The fact of the matter if you are NOT using a language which treats all variables as strings then human readable text is an extra conversion. I.E. Extra code means more code to be verified, tested and more opportunities to intro errors in the application.
C) You have to parse it anyway. It many cases for DSP systems I've worked on (I.E. NO Human readable interface to start with.) Data is streamed out of the system in uniformly sized packets. Logging the data for analysis and later processing is simply a matter of pointing to the beginning of a buffer and writing a multiple of the block size to the data logger system. This allows me to analysis the data "untouched" as the customer's system would see it where, once again, converting it to a different format would result in possibly introducing errors. Not only that, if you only save the "converted data" you may lose information in the translation that may help you diagnose a problem.
D) Text is a Natural format for the data. No hardware I've ever seen uses a "TEXT" interface. (My first job out of college was writing a device driver for a camera line scan camera.) The system build on top of it does MIGHT, but for every "PC".
For web pages where the information has a "natural" meaning in text format, so sure knock yourself out. For processing source code it’s a no brainer, of course. But the pervasive computing environments where even you refrigerator and TOOTHBRUSH are going to have a processor built in, not so much. Simply burdening these type of systems with the overhead of adding the ability to process text introduces unnessary complexity. You're not going to link "printf" into the software for an 8-bit micro that controls a mouse. (And yeah, somebody has to write that software too.)
The world is not a black and white place where the only forms of computing that need to be consider are PCs and Web servers.
Even on a PC, if I can directly load the data directly into a datastructure using a single OS read call and be done with it without writing serialize and deserializing routines, that's fantastic, check a blocks CRC job -- done on to the next problem.
Uhm… because human-readable file formats can be read by humans? Seems like a pretty good reason to me.
(Well, for configuration files it’s inevitable that they are read (and edited!) by humans. Files for persistent storage of some sort or the other don’t really need to be read or edited by humans.)
Why should I use a human readable file
format in preference to a binary one?
Is there ever a situation when this
isn't the case?
Yes, compressed volumes (zip, jpeg, mp3, etc) would be suboptimal if they were human readable.
I guess its not good in most situations probably. I think the main reason for these formats such as JSON and XML is because of web development, and general use over the web where you need to be able to process data on the user-side and you cant necessarily read binary. A good example of a bad case to use a human readable format would be any thing non textual such as images, video, audio. Ive noticed the use of non-binary formats being used in web development where it does not make sense, I feel guilty!
Often files become part of your human interface thus they should be human friendly (not programmer only)
The only time that I use a binary stream for files that aren't archives is when I want to conceal things from the casual observer. For instance, if I'm making temporary files that only my application should be editing, I'll use binary.
Its not an attempt to obfuscate, rather its just discouraging the user from editing the file by hand (which could break the application).
One instance where this would be a good idea is storing / saving running data about some game.. i.e. to save your game and continue later. Other scenarios would describe intermediate files, but those are typically binary / byte compiled anyway.
Why should I use a human readable file
format in preference to a binary one?
Depends on the content and context, i.e. where is the data coming from and going. If the data is typically directly written by a human, storing it in an format that can be manipulated through a text editor is a good idea. For example, program source code will normally be stored as human readable with good reason. However, if we are archiving it, or sharing it using a version control system, our storage strategy will change.
The human format is simplier to parsing and debugging if you have a problem with a field (example: a field contains a number where the spec says the this field must be a string), also the human format is closier to domain of problem.
I prefer the binary format with a lot of data AND i'm sure that I have the software for parsing him :)
When reading Fielding's dissertation about REST, I really liked the concept of "Architectural Properties"; one that sticked was "Visibility". That's what we're talking about here: being able to 'see' the data. Huge benefits when debugging the system.
One aspect that I find missing in the other answers: enforcing semantics.
From the moment you go for human readable, you allow the silly notepad user to create data to be fed into the system. No way to guarantee this data makes sense. No way to guarantee the system will respond in a sensible way.
So in the case you don't need to notepad-inspect your data, and you want to enforce valid data (by e.g. usage of an API) rather than first validating it, you better avoid human readable data. If debuggeability is an issue (it most often is), inspection of the data can be done by using the API, too.
Human readable is not equal to easier to be parsed by machine code.
Take human natural language as an example. :) Machine parsing of human language is still a pending problem to be fully solved.
So I agree with https://stackoverflow.com/a/714111/2727173 which has much deeper insight on this question.

What should I know before poking around an unknown archive file for things?

A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.

Resources