Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?
Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.
Related
I am generating long double float data in a C program on a Linux cluster. I need to export the data to Matlab, which is not installed on the cluster.
What is the best way? My advisor says to export using printf statements. I assume he means sending the data to a comma separated file (and fprintf). But it seems to me like that could be slow and use too much memory and we may lose a lot of precision.
I've found this web page for reading and writing .MAT files, but I don't really understand the page, or the example, which I copied to my cluster, but cannot compile (because it's missing libraries which, obviously, come from MATLAB.
What is the best, or easiest, or fastest way to export data from Linux/C to Windows/MATLAB? How do I get started with that method? Be advised when you answer that I am pretty new to C and will likely need help figuring out how to obtain, install, configure, and link any libraries. But once that's done, I think I'm pretty good at learning to use them.
Why do you think you would you lose precision? The only drawback with CSVs is that ASCII files require much more storage than binary files, but in this day and age where you get terabytes of storage for the price of a good haircut, that hardly seems like a problem.
It will only be noticeably slower if you're writing gigabytes upon gigabytes, but normally calculations take so much longer that the difference between ASCII and binary is completely negligible (and if the calculations don't take so long: why do you need a cluster then?)
In any case, I'd go for ASCII -- how to write and read some binary blob needs to be documented in two places, it's easier to create bugs in both the writing end and the reading end, it's harder to solve them since no human can read the file, etc. Also, MAT file formats may change in the next Matlab release (as they have in the past).
With ASCII, you have none of these problems, the only drawback I can think of is that you have to write a small cluster-specific file reader in Matlab (which is still a lot less work than working out all the bugs and maintaining a MAT file writer).
Anyway, there's tons of tools available in Matlab for ASCII: textread, dlmread, importdata, to name a few. On the C-end, indeed just use fprintf (documentation here).
I once had this problem as well (well, kind of...) and used a simple binary format to do the job.
If your data format is static, that means if it will never change, you can restrict yourself to exactly what you need and hard-code the exact format in your loading program. If you want to stay flexible to add and remove columns, however, you should define a kind of header to add information about the data format and evaluate that on reading.
The trick for simple importing of data is the following:
Let the MATLAB program know how longs your data records are and how they are composed.
Read the data with
rest = fread(fid, 'uchar=>uint8', 'b').';
in order to have a row vector of uint8s.
Reshape the data with
rest = reshape(rest, recordlength, []).';
in order to get your data in recordlength columns and as many rows as you need.
For each data column, combine the relevant uint8 rows into a "sub-matrix", using a combination of reshape, typecast, swapbytes to group your data appropriately and convert them into the wanted format.
The most important thing here is the typecast() function, which accepts the "byte-wise" data as 1st and the wanted data type as 2nd parameter. There is a wide range of accepted data types, such as intXX, uintXX (with XX one of 8, 16, 32 and (AFAIK) 64) as well as float and double.
For example, typecast([1, 1], 'uint16') gives you 257, while typecast([0, 0, 96, 64], 'float') gives you 3.5.
Once you do so, you can improve the reading speed - compared with a text file - by factor 20 or so. (At least, this was the case in the application I wrote this for: there were about 10 different measure values every 10 ms, one measurement could be of several minutes or even hours and I wanted to read in such a file as fast as possible. So I recoded the stuff from text to binary and gained about factor 20, or maybe 15 - don't know exactly. But it was a lot...)
I would save the workspace as a .MAT file, as you said. Then you have whatever values are contained in all the present variables saved as a workspace at that moment. However, if you are reading arrays (your data) that are gigabytes of long, then probably you read them chunk by chunk (due to RAM restrictions maybe?) and saving the workspace in that case might not help you.
I would never printf anything for transporting. In my work (on long time asymptotics, so I have huge outputs), I save everything as binary files using fwrite. Converting to text is slow and expensive, as far as I know.
I hope this helps a little bit!
When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.
I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.
I am planning to do a text editor in c. So just wanted to know what data structure is good to save the text. I read using linked list was one way to do it, but not efficient. Please point me to some references where I can get a good idea of what need to use. I am planning to use ncurses library for getting the user input and capturing keys and UI.
Using the source code of existing editors is kind of too complex, all the text editors are huge, even console only editors. Any simple console editor source code for reference?
You will benefit by reading about Emacs buffers. Also see this blog, especially the last comment, quoted here for easy reference:
Many versions of Emacs, including GNU, use a single contiguous character array virtually split in two sections separated by a gap. To insert the gap is first moved to the insertion point. Inserted characters fill into the gap, reducing its size. If there’s insufficient space to hold the characters the entire buffer is reallocated to a new larger size and the gaps coalesced at the previous insertion point.
The naive look at this and say the performance must be poor because of all the copying involved. Wrong. The copy operation is incredibly quick and can be optimized in a variety of ways. Gap buffers also take advantage of usage patterns. You might jump all over the window before focusing and inserting text. The gap doesn’t move for display – only for insert (or delete).
On the other hand, inserting a character block at the head of a 500MB file then inserting another at the end is the worst case for the gap approach, especially if the gap’s size is exceeded. How often does that happen?
Contiguous memory blocks are prized in virtual memory environments because less paging is involved. Moreover, reads and writes are simplfied because the the file doesn’t have to be parsed and broken up into some other data structure. Rather, the file’s internal representation in the gap buffer is identical to disk and can be read into and written out optimally. Writes themselves can be done with a single system call (on *nix).
The gap buffer is the best algorithm for editing text in a general way. It uses the least memory and has the highest aggregate performance over a variety of use cases. Translating the gap buffer to a visual window is a bit trickier as line context must be constantly maintained.
If you want it to scale, you should use a form of balanced binary tree. It's possible to make it so basically all operations - insert, delete, seek to character, seek to line, etc. - are O(log n). If you only care about "sane" file sizes for text (a few megs maximum) it doesn't really matter what structures you use.
The (very old) book Software Tools in Pascal implements a complete ed-style (think vim) text editor, regexp search/replace included. It uses arrays to hold the edited text.
You should "save" the data as plain text. If you mean how to store the data in memory, I recommend a simple linked list.
If it's just a text editor (not a word processor), the approach I took was to store each line in it's own link node.
This is a nice simple approach that makes it easy to insert and delete lines. And inserting or deleting text is efficient because only the data within the current node needs to be shifted around when inserting or deleting text.
You said you don't want to look at source code but, nonetheless, you can download the version I wrote many, many years ago at http://www.softcircuits.com/sw_dos.aspx by downloading pictor.zip to see a simple text editor.
This link provides good information - A case study in the design of a "What-You-See-Is-What-You-Get" (or "WYSIWYG") document editor
My program is reading a text file containing various lines of text for a settings file. Some of the lines could get very large. Currently the buffer size is 4096 chars. It is possible that some lines could exceed this, whether through maliciousness or due to various factors operating within the program.
The current routines were rather tedious to write and now I want to expand the possible contents of the file which will require more of this tedious repetitive code. (This is for a settings type file, consisting of name value pairs and the occasional section header. Some numerical values need to be read as strings due to multiple precision).
The main thing I want is to read an arbitrary length line without buffer overflow. I've just discovered getline can do this for me, but, is there for heavens sake a library that will just do the whole lot of this tediousness for me?
edit:
I don't wish to be forced to place an = sign between the name and values, a blank space should suffice as separator.
By widespread, I mean the library should be available in the standard packages of the popular Linux distributions.
I'm aware of libconfig but it seems complete overkill for my requirements.
Look into libini, sounds about right. It is quite old and not exactly undergoing frantic development, but if it already works for your problem, that should be fine.
A more up to date library, with a bunch of other benefits, is glib, it has a key-value-parser API.
My suggestion is, DIY, since it's quite easy.
Read each line
count chars until your separator and after your separator
allocate buffers
and read name value pairs with sscanf
like:
sscanf(line, "%[^:]: %[^\n]", key, value);
You will be safe since you counted chars before sccanf.
I contributed an updated fork of libini at CCAN. It also contains a very useful dictionary implementation as well as some simple hashing algorithms. Rusty put it in the repo, so I guess I did a reasonably good job of bringing it up to date and fixing the few minor bugs.
The latest version of the library can be found if you poke through this tree, it contains basic token support as well as basic transaction support (useful for re-reading configuration files and reverting if there's a parsing error). It also contains a much more updated set of unit tests.
I don't actively maintain the fork any more, as the original author of libini became active again, however the module is maintained in CCAN.