Justification of BOM mark in file encoding [closed] - file

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I want to be confident that using BOM mark for file encoding is absolutely needed for a file for the following reasons.
A information of a file must be self-contained. We didn't figure out a clear algorithm for identifying which encoding is appropriate for a file.
For the compatibility issue about the shebang line, this issue need to be corrected inside the script language because the encoding is much higher concept than the shebang line.
For the first claim, I have difficult time to determine which encoding is right or not for a file. Therefore, applying different encoding for a file appeared frequently and I guess that most of the fresh developers encounter this situation and ignore the weird characters in a file due to different encoding strategy.
I already recognize the compatibility is an important aspect for software maintenance. However, I think that the old rule that makes the system confuse is changed for future steps.
Is that any thought or any movement to make adding BOM mark as official? Or is there any critical reason that we must not introduce BOM mark? (e.g. A clear algorithm to identify the encoding file exists.)
My understanding comes from the following link, so the additional link to change my perspective would be a great pleasure.
What's the difference between UTF-8 and UTF-8 without BOM?
Thanks,

Your first assumption is wrong. We have protocols to define what a file (or a packet) contain and how to interpret the contain. We should always split meta-data with data. You practically are pushing to put BOM as meta-data, which describe the following bytes, but this is not enough. Text data is not a so useful information: we still need to understand and interpret what it is the meaning of text data. The more obvious part is about interpreting U+0020 (white space) either as a printed character or as a control data. HTML interpret as the second (two whitespaces are not so special, or a white space and a new line, but in <pre>). But also: we have a mail, a mailbox, a markdown file, a HTML, etc.. BOM doesn't help alone. But so, for your first point, we need to add more information, and so on. But then we have a general container format (metadata, with one or more data), so it is not more text, and it is not BOM which help us.
If you need BOM, you already lost the battle, and you can have a BOM which it is not really a BOM, but real data in other encoding. Using 2 bytes or 3 bytes are not enough (shebang, which it is old, used 4 bytes, #! /, now space is not more required, but in any case it is an old protocol, when files were not heavy exchanged, and the path is relevant (nobody execute random files, and if it not a shebang, it was an a.out file).
And you are discussing a old stuff. Now everything is UTF-8. No need to BOM. Microsoft is just making thing more complex: Unix, Linux, macos did a short transition without much hurt (and no "flag day"). Also web: it is UTF-8 by default. Your question is about programming languages, but there UTF-8 is fine: they uses ASCII in syntax, and what it is in strings it doesn't matter so much: it is standard to treat strings and Unicode as opaque objects, but for few cases, else you will miss something from Unicode (e.g. splitting combining chars, splitting emoji e.g. in language which works with UTF16 code units, etc.).
UTF-16 is not one thing you will write programs. It may be used by API (fixed length may be/seem better), or ev. for data, but usually not for coding.
And BOM doesn't help, if you do not modify all scripts/programs (but so, lets' do it as "all is UTF-8"): it is not seldom to find program sources in multiple encoding (and on the same file): you may have copied-pasted the copyright (and so author name) with a simple editor (and with one encoding), then strings may be on other encoding, and few comments (and committers name) maybe on a different one. And git (and other tools) just check lines so it may insert lines with wrong encoding: git has very few information, and users often have incorrect configuration. So you may break sources which where ok (just because encoding problems were just in comments).
Then a short comment on the second assumption, which it is also difficult.
You want to split layers, but this is very problematic: we have scripts which contain binary data at the end, so operating system should not try to transcode the script (and so then to remove BOM), because first part may be just text, but some part may requires exact the correct bytes. (And some Unicode test files are also in this category, and they are text, possibly with some invalid code).
Just use UTF-8 without BOM and all things become much simpler.

Related

read thunderbird address mab files content

I have several address list's on my TBIRD address book.
every time I need to edit an address that is contained in several lists, is a pain on the neck to find which list contains the address to be modified.
As a help tool I want to read the several files and just gave the user a list of which
xxx.MAB files includes the searched address on just one search.
having the produced list, the user can simply go to edit just the right address list's.
Will like to know a minimum about the format of mentioned MAB files, so I can OPEN + SEARCH for strings into the files.
thanks in advance
juan
PD have asked mozilla forum, but there are no plans from mozilla to consolidate the address on one master file and have the different list's just containing links to the master. There is one individual thinking to do that, but he has no idea when due to lack of resources,
on this forum there is a similar question mentioning MORK files, but my actual TBIRD looks like to have all addresses contained on MAB files
I am afraid there is no answer that will give you a proper solution for this question.
MORK is a textual database containing the files Address Book Data (.mab files) and Mail Folder Summaries (.msf files).
The format, written by David McCusker, is a mix of various numerical namespaces and is undocumented and seem to no longer be developed/maintained/supported. The only way you would be able to get the grips of it is to reverse engineer it parallel with looking at source code using this format.
However, there have been experienced people trying to write parsers for this file format without any success. According to Wikipedia former Netscape engineer Jamie Zawinski had this to say about the format:
...the single most brain-damaged file format that I have ever seen in
my nineteen year career
This page states the following:
In brief, let's count its (Mork's) sins:
Two different numerical namespaces that overlap.
It can't decide what kind of character-quoting syntax to use: Backslash? Hex encoding with dollar-sign?
C++ line comments are allowed sometimes, but sometimes // is just a pair of characters in a URL.
It goes to all this serious compression effort (two different string-interning hash tables) and then writes out Unicode strings
without using UTF-8: writes out the unpacked wchar_t characters!
Worse, it hex-encodes each wchar_t with a 3-byte encoding, meaning the file size will be 3x or 6x (depending on whether whchar_t is 2
bytes or 4 bytes.)
It masquerades as a "textual" file format when in fact it's just another binary-blob file, except that it represents all its magic
numbers in ASCII. It's not human-readable, it's not hand-editable, so
the only benefit there is to the fact that it uses short lines and
doesn't use binary characters is that it makes the file bigger. Oh
wait, my mistake, that isn't actually a benefit at all."
The frustration shines through here and it is obviously not a simple task.
Consequently there apparently exist no parsers outside Mozilla products that is actually able to parse this format.
I have reversed engineered complex file formats in the past and know it can be done with the patience and right amount of energy.
Sadly, this seem to be your only option as well. A good place to start would be to take a look at Thunderbird's source code.
I know this doesn't give you a straight-up solution but I think it is the only answer to the question considering the circumstances for this format.
And of course, you can always look into the extension API to see if that allows you to access the data you need in a more structured way than handling the file format directly.
Sample code which reads mork
Node.js: https://www.npmjs.com/package/mork-parser
Perl: http://metacpan.org/pod/Mozilla::Mork
Python: https://github.com/KevinGoodsell/mork-converter
More links: https://wiki.mozilla.org/Mork

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?
Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.
the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.
If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.
To quickly identify the type of the file, or the positions within it.
Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

UTF-8 tuple storage using lowest common technological denominator, append-only

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.
What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.
As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.
However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).
I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.
Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.
Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.
Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .
I would recommend tab delimiting each field and carriage-return delimiting each record.
Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).
They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.
This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.
This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.
Example:
U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED becomes ~LF
U+000D CARRIAGE RETURN becomes ~CR
U+007E TILDE becomes ~~~
etc.
There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.
Mostly thinking out loud here...
Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.
Perhaps one could use SCSU along with that.
Or it might be worth to look at the gzip format, and maybe ape it, if not using it:
A gzip file consists of a series of "members" (compressed data sets).
[...]
The members simply appear one after another in the file, with no additional information before, between, or after them.
Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.
Or you could use bencode, used in torrent-files. Or BSON.
See also Wikipedia's Comparison of data serialization formats.
Otherwise i think your idea of preceding each string with its length is probably the simplest one.

Is there a widespread C library for reading name/value pairs from a file?

My program is reading a text file containing various lines of text for a settings file. Some of the lines could get very large. Currently the buffer size is 4096 chars. It is possible that some lines could exceed this, whether through maliciousness or due to various factors operating within the program.
The current routines were rather tedious to write and now I want to expand the possible contents of the file which will require more of this tedious repetitive code. (This is for a settings type file, consisting of name value pairs and the occasional section header. Some numerical values need to be read as strings due to multiple precision).
The main thing I want is to read an arbitrary length line without buffer overflow. I've just discovered getline can do this for me, but, is there for heavens sake a library that will just do the whole lot of this tediousness for me?
edit:
I don't wish to be forced to place an = sign between the name and values, a blank space should suffice as separator.
By widespread, I mean the library should be available in the standard packages of the popular Linux distributions.
I'm aware of libconfig but it seems complete overkill for my requirements.
Look into libini, sounds about right. It is quite old and not exactly undergoing frantic development, but if it already works for your problem, that should be fine.
A more up to date library, with a bunch of other benefits, is glib, it has a key-value-parser API.
My suggestion is, DIY, since it's quite easy.
Read each line
count chars until your separator and after your separator
allocate buffers
and read name value pairs with sscanf
like:
sscanf(line, "%[^:]: %[^\n]", key, value);
You will be safe since you counted chars before sccanf.
I contributed an updated fork of libini at CCAN. It also contains a very useful dictionary implementation as well as some simple hashing algorithms. Rusty put it in the repo, so I guess I did a reasonably good job of bringing it up to date and fixing the few minor bugs.
The latest version of the library can be found if you poke through this tree, it contains basic token support as well as basic transaction support (useful for re-reading configuration files and reverting if there's a parsing error). It also contains a much more updated set of unit tests.
I don't actively maintain the fork any more, as the original author of libini became active again, however the module is maintained in CCAN.

Reuse of characters in compiled .exe file

Once long ago, out of curiosity, I've tried hex-editing the executable file of the game "Dangerous Dave".
I've looked around the file for any strings I could find, and made some random edits to see if it would actually change the text displayed within the game.
I was surprised to see the result, which I have now recreated using a hex-editor and DOSBox:
As can be seen, editing the two characters "RO" in the string "ROMERO" resulted in 4 characters being changed, with the result becoming "ZUMEZU". It seems as if the program is reusing the two characters and prints them at the start and end of that string.
What is the cause of this? My first guess would be trying to make the executable smaller but just the code that reuses the characters would probably require more space than those 2 bytes to be saved.
Is it just a trick done by the author, or just some compiler voodoo?
Tricky to say for sure without reverse-engineering, but my guess would be that a lot of the constant data in the program is compressed using an algorithm from the LZ family. These compression schemes work essentially in the way that you've observed: they encode repeated substrings as references to text that has previously been decoded.
These compression algorithms were probably used for more than just this one string, and not just for text either; it's quite possible that they were also used to compress other data, such as graphics or level layouts. In short, there were probably significant savings made by using this algorithm!
The use of these compression algorithms is common in older games as a way of saving disk space, but was not automatic - the implementation of this algorithm would likely have been something Romero added himself.

Resources