How do I check if a file is text-based?

How do I check if a file is text-based? - file

I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?

Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...

Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').

It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.

Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".

well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)

You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.

Related

Validation for user input on file system

I have written a bunch of web apps and know how to protect against mysql injections and such. I am writing a log storage system for a project in C and I was advised to make sure that it was hack free in the sense that the user could not supply bad data like foo\b\b\b and try to hack into the OS with some rm -rf /* kind of crud. I looked online and found a similar question here: how to check for the "backspace" character in C
This is at least what I thought of, but I know there are probably other things I need to protect against. Can someone who has a bit more experience help me list out the things I need to validate when I am saving files onto a server using user input as part of the hierarchical file naming system?
Example file: /home/webapp/data/{User input}/{Machine-ID}/{hostname}/{tag} where all of these fields could be "faked" when submitted to our log storing system.

Instead of checking for bad characters, turn the problem on its head and specify the good characters. E.g. require {User Input} be a single directory name made of [[:alnum:]_] characters; {Machine-ID} must be made of [[:xdigit:]] to your liking, etc. That gets rid of all the injection stuff quickly.

If you're only ever using these inputs as file names inside your program, and you're storing them on a native Linux filesystem, then the critical things to watch for are:
absolutely proscribe any file name starting with ../ or containing /../ or ending with /... Such file names could allow the user to reach files outside the directory tree that you're working in.
Be wary of any file name containing / as these allow the user to name subdirectories, possibly with unintended consequences.
Other things that could cause trouble include:
Non-ASCII characters that may have a different meaning if used in a different locale.
Some ASCII punctuation characters may have a special meaning in parts of your processing system or may be invalid in some filesystems.
Some parts of your system may be case-sensitive with other parts being case-insensitive. Consider normalizing the case.
If applicable, restrict each field to something that isn't going to cause any trouble. For example:
A machine ID should probably consist of only ASCII lower letters and digits (or only ASCII uppercase letters and digits).
A hostname should consist of only ASCII lowercase letters and digits, plus - but not in an initial position (use Punycode for non-ASCII host names). If these are fully qualified host names, as opposed to host names in a network, then . is also valid, but not in initial position.
No field should be empty or contain a / or start with a . (an initial . could be . or .. — see above — and would be a dot file that ls doesn't show by default and isn't included in the pattern * in shells, so they're best avoided).
While control characters such as backspace aren't directly harmful, they can be indirectly harmful in that if you're investigating an issue on the command line, they can cause you to make mistakes. Do not allow them.

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?

Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.

the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.

If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.

To quickly identify the type of the file, or the positions within it.

Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

UTF-8 tuple storage using lowest common technological denominator, append-only

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.
What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.
As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.
However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).
I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.
Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.
Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.
Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .

I would recommend tab delimiting each field and carriage-return delimiting each record.
Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).
They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.
This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.
This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.
Example:
U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED becomes ~LF
U+000D CARRIAGE RETURN becomes ~CR
U+007E TILDE becomes ~~~
etc.
There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.

Mostly thinking out loud here...
Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.
Perhaps one could use SCSU along with that.
Or it might be worth to look at the gzip format, and maybe ape it, if not using it:
A gzip file consists of a series of "members" (compressed data sets).
[...]
The members simply appear one after another in the file, with no additional information before, between, or after them.
Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.
Or you could use bencode, used in torrent-files. Or BSON.
See also Wikipedia's Comparison of data serialization formats.
Otherwise i think your idea of preceding each string with its length is probably the simplest one.

Is there a widespread C library for reading name/value pairs from a file?

My program is reading a text file containing various lines of text for a settings file. Some of the lines could get very large. Currently the buffer size is 4096 chars. It is possible that some lines could exceed this, whether through maliciousness or due to various factors operating within the program.
The current routines were rather tedious to write and now I want to expand the possible contents of the file which will require more of this tedious repetitive code. (This is for a settings type file, consisting of name value pairs and the occasional section header. Some numerical values need to be read as strings due to multiple precision).
The main thing I want is to read an arbitrary length line without buffer overflow. I've just discovered getline can do this for me, but, is there for heavens sake a library that will just do the whole lot of this tediousness for me?
edit:
I don't wish to be forced to place an = sign between the name and values, a blank space should suffice as separator.
By widespread, I mean the library should be available in the standard packages of the popular Linux distributions.
I'm aware of libconfig but it seems complete overkill for my requirements.

Look into libini, sounds about right. It is quite old and not exactly undergoing frantic development, but if it already works for your problem, that should be fine.
A more up to date library, with a bunch of other benefits, is glib, it has a key-value-parser API.

My suggestion is, DIY, since it's quite easy.
Read each line
count chars until your separator and after your separator
allocate buffers
and read name value pairs with sscanf
like:
sscanf(line, "%[^:]: %[^\n]", key, value);
You will be safe since you counted chars before sccanf.

I contributed an updated fork of libini at CCAN. It also contains a very useful dictionary implementation as well as some simple hashing algorithms. Rusty put it in the repo, so I guess I did a reasonably good job of bringing it up to date and fixing the few minor bugs.
The latest version of the library can be found if you poke through this tree, it contains basic token support as well as basic transaction support (useful for re-reading configuration files and reverting if there's a parsing error). It also contains a much more updated set of unit tests.
I don't actively maintain the fork any more, as the original author of libini became active again, however the module is maintained in CCAN.

Reuse of characters in compiled .exe file

Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?

Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.