Efficient Output Format for Huge Data Sets? - c

I have written a program that writes output to a file. The output is in 6-column, n-row format, and all values are double-precision float. It is very common in my code for n to become extremely large (1e20 or so) and, hence, the output data file also becomes extremely large.
I am currently storing everything in *.csv format, which obviously results in huge data files. Is there any more efficient way to store such values? Any new format of file or any new method that would decrease the file size significantly?
For clarification:
The data does not need to be human readable, binary would do just fine.
I will further process the data in the file to get some important parameters from the runs, probably the distance of travel, time of exit at a particular point etc.
The code is actually an astrophysical simulation of moving particles and for about 1e10 particles for a million time steps each, it gets quite high for size.

When designing a file format you have to consider various things, like:
a) Is there a chance that the file could have been corrupted, or maliciously tampered with (or is there any kind of confidentiality requirement)? The answer to this is almost always "yes". To guard against these things you need to consider some kind of checksum and/or encryption. You may also need to consider whether partial recovery is desirable (e.g. is it beneficial to split the file into multiple blocks/sections where each block has its own checksum/encryption, so that if 4 bytes in one block/section are corrupted you can still recover the majority of the data).
b) Is there a portability concern? For example, if you store raw double values in the file will it create a problem on other computers that have a different binary format for "double"?
c) For each type of value; what is the range that actually needs to be represented and what is the precision requirements? Typically software uses "larger and more precise" than necessary (often because its faster to select the next largest type the CPU supports); but for file formats this causes an unnecessary increase in file sizes. For a simple example; maybe you could convert a (64-bit) double into a 32-bit fixed point format and halve the space used while still achieving the range and precision that's actually needed.
d) Are there "clever" ways to reduce the range and precision required for some of the values? For a simple example; maybe you have "starting value" and "ending value" which both do need 64 bits; but you could convert it into "starting value" and "difference" (so that "ending value" can be calculated as "starting value + difference") where "difference" values have less range and only needs 32 bits to store.
e) Is any kind of indexing beneficial? For a simple example; if the file might contain 1 million entries and you only want to find one, then you might be able to use the index to find the offset for the entry you want and only load that one entry (and avoid loading all 1 million entries).
f) What other meta-data might you want? This can be things like a "magic signature" (so that software can check if a file is supposed to comply with the file format and the user didn't give your program the wrong type of file), a "file format version number" (so that the program can do "auto-update to new file format" or at least detect when the file uses an obsolete/deprecated file format that is no longer supported). It can also include information to identify things like who the author was, where the data came from, when the data was obtained, which program created/prepared the file, etc. Sometimes there is also optional data and flags to say if the optional data is/isn't included in the file. You might also want things like "number of entries" and "offset in file for each different area", etc.
g) What kind of allowances do you need to make for extensibility (and backward compatibility, and forward compatibility)? Typically people leave things like (e.g.) "reserved for future use" fields in headers so that they can add/change/extend the file format in future without breaking everything. Sometimes this is even more specific about what software should do when it sees values in reserved fields that it doesn't support - e.g. "reserved for future use, should be zero, if non-zero software should ignore this value" vs. "reserved for future use, should be zero, if non-zero (due to future use) software should generate an error and not use the file"
h) Are any kind of compression techniques useful? For a simple example, if you have "6-columns, N-rows" with an index, and sometimes the data for 2 or more rows happens to be the same; then maybe you can only store one copy of the data for those rows and then use the index to figure out which row uses which data (a bit like "row[n] = unique_row_data[ index[n] ]").

Related

What do files actually contain, and how are they "read"? What is a "format" and why should I worry about them?

As it becomes ever easier to use computers in general and get into programming in particular, an increasing fraction of beginners seem to lack certain fundamental understanding that was once taken for granted in programming circles. Meanwhile as technology advances, the details of that understanding have grown more complex (I personally was programming before Unicode existed, let alone, say, JSON or XML). So, for the sake of having a solid reference, it seems apropos to ask:
What exactly is in a file, anyway? What do we mean when we say that we "open" and "read" a file - what are we getting out of it? I know the term "data", but just giving a name to something is not a real explanation.
More importantly, how do we make sense of data? If I try simply reading some data from a file and outputting it to the console, why does it often look like garbage? Why do some other files appear to have some text scattered among that garbage, while yet others seem to be mostly or completely text? Why isn't it sufficient to ask the program to read, say, an image file, in order to display the image? Again, I know the term "format", but this doesn't explain the concept. If we say, for example, that we make sense of data according to its format, then that only raises two more questions - how do we determine the format, and how does it actually help?
Related: What exactly causes binary file "gibberish"?.
Data, bits and bytes
Everyone who has had to buy hardware, or arrange a network connection, should have some familiarity with the concept of a "bit" and of a "byte". They're used to measure the capacity of storage devices and transfer rates. In short, they measure data: the amount of data that can be stored on a disk, or the amount of data transferred along a cable (or via a wireless connection) per second.
Data is essentially information - a record of some kind of knowledge. The bit is the fundamental unit of information, representing the smallest possible amount of knowledge: the answer to a yes-or-no question, a choice between two options, a record of a decision between two alternatives. (There would need to be at least two possibilities; with only one, there was no answering, choice or decision necessary, and thus nothing is learned by seeing that single possibility arise.)
A byte is simply a grouping of bits in a standard size. Almost everyone nowadays defines a byte to mean 8 bits, mainly because all contemporary consumer hardware is designed around that concept. In some very specific technical contexts (such as certain C or C++ language standard documents), "byte" may have a broader meaning, and octet is used to be precise about 8-bit groupings. We will stick with "byte" here, because we don't need to worry about ancient hardware or idiosyncratic compiler implementations for now.
Data storage devices - both permanent ones like HDDs and SSDs, and temporary ones like RAM - use a huge amount of individual components (depending on the device) to represent data, each of which can conceptually be in either of two states (we commonly use "on or off", "1 or 0" etc. as metaphors). Because there's a decision to be made between those two states, the component thus represents one bit of data. The data isn't a physical thing - it's not the component itself. It's the state of that component: the answer to the question "which of the two possible ways is this component configured right now?".
How data is made useful
It's clear to see how we can use a bit to represent a number, if there are only two possible numbers we are interested in. Suppose those numbers are 0 and 1; then we can ask, "is the number 1?", and according to the bit that tells us the answer to that question, we know which number is represented.
It turns out that in fact this is all we need in order to represent all kinds of numbers. For example, if we need to represent a number from {0, 1, 2, 3}, we can use two bits: one that tells us whether the represented number is in {0, 1} or {2, 3}, and one that tells us whether it's in {0, 2} or {1, 3}. If we can answer those two questions, we can identify the number. This technique generalizes, using base two arithmetic, to represent any integer: essentially, each bit corresponds to a value from the geometric sequence 1, 2, 4, 8, 16..., and then we just add up (implicitly) the values that were chosen by the bits. By tweaking this convention slightly, we can represent negative integers as well. If we let some bits correspond to binary fractions as well (1/2, 1/4, 1/8...), we can approximate real numbers (including the rationals) as closely as we want, depending on how many bits we use for the fractional part. Alternately, we can just use separate groups of bits to represent the numerator and denominator of a rational number - or, for that matter, the real and imaginary parts of a complex number.
Furthermore, once we can represent numbers, we can represent all kinds of answers to questions. For example, we can agree on a sequence of symbols that are used in text; and then, implicitly, a number represents the symbol at that position in the sequence. So we can use some amount of bits to represent a symbol; and by representing individual symbols repeatedly, we can represent text.
Similarly, we can represent the height of a sound wave at a given instant in time; by repeating this process a few tens of thousands of times per second, we can represent sound audible to humans.
Similarly, having studied how the human eye works, we find that we can analyze colours as combinations of three intensity values (i.e., numbers) representing "components" of the colour. By describing colours at many points a small distance apart (like with the sound wave, but in a two-dimensional grid), we can represent images. By considering images across time (a few tens of times per second), we can represent animations.
And so on, and so on.
Choosing an interpretation
There's a problem, here, though. All of this simply talks about possibilities for what data could represent. How do we know what it does represent?
Plainly, the raw data stored by a computer doesn't inherently represent anything specific. Because it's all in the same regular, sequence-of-bits form, nothing stops us from taking any arbitrary chunk of data and interpreting it by any of the schemes described above.
It just... isn't likely to appear like anything meaningful, that way.
However, the choice of interpretations is a choice... which means it can be encoded and recorded in this raw-data form. We say that such data is metadata: data that tells us about the meaning of other data. This could take many forms: the names of our files and the folder structure (telling us how those files relate to each other, and how the user intends to keep track of them); extensions on file names, special data at the beginning of files or other notes made within the file system (telling us what type of file it is, corresponding to a file format - keep reading); documentation (something that humans can read in order to understand how another file is intended to work); and computer programs (data which tells the computer what steps to take, in order to present the file's contents to the user).
What is a (file) format?
Quite simply, a format is the set of rules that describes a way to interpret some data (typically, the contents of a file). When we say that a file is "in" a particular format, we mean that it a) has a valid interpretation according to that format (not every possible chunk of data will meet the requirements, in general) and b) is intended to be interpreted that way.
Put another way: a format is the meaning represented by some metadata.
A format can be a subset or refinement of some other format. For example, JSON documents are also text documents, using UTF-8 encoding. The JSON format adds additional meaning to the text that was represented, by describing how specific text sequences are used to represent structured data. A programming language can also be thought of as this kind of format: it gives additional meaning to text, by explaining how that text can be translated into instructions a computer can follow. (A computer's "machine code" is also a kind of format, that gets interpreted directly by the hardware rather than by a program.)
(Recall: we established that a computer program can be a kind of metadata, and that a programming language can be a kind of format, and that metadata represents a format. To close the loop: of course, one can have a computer program that implements a programming language - that's what a compiler is.)
A format can also involve multiple steps, explained by separate standards. For example, Unicode is the de facto standard text format, but it only describes how abstract numbers correspond to text symbols. It doesn't directly say how to convert the bits into numbers (and this does need to be specified; "treat each byte as a number from 0..255" a) would still be making a choice of many possible ways to do it; b) isn't really sufficient, because there are a lot more possible text symbols than that). To represent text, we also need an encoding, i.e. the rest of the rules for the data format, specifically to convert bits to numbers. UTF-8 is one such encoding, and has become dominant.
What actually happens when we read the file?
Raw data is transferred from the file on disk, into the program's memory.
That's it.
Some languages offer convenience functionality, for the common case of treating the data like text. This might mean doing some light processing on the data (because operating systems disagree about which text symbols, in what order represent "the end of a line"), and loading the data into the language's built-in "string" data structure, using some kind of encoding. (Yes, even if the encoding is "each byte represents a number from 0 to 255 inclusive, which represents the corresponding Unicode code point", that is an encoding - even if it doesn't represent all text and thus isn't a proper Unicode encoding - and it is being used even if the programmer did nothing to specify it; there is no such thing as "plain text", and ignoring this can have all kinds of strange consequences.)
But fundamentally, the reading is really just a transfer of data. Text conversion is often treated as special because, for a long time, programmers were sloppy about treating text properly as an interpretation of data; for decades there was an interpretation of data as text - one byte per text symbol (incidentally, "character" does not mean the same thing as a Unicode code point) - so well established that everyone started forgetting they were actually using it. Programmers forgot about this even though it only actually specifies what half the possible values of a byte mean and leaves the other half up to a local interpretation, and even though that scheme is still woefully inadequate for many world languages, such that programmers in many other countries came up with their own solutions. The solution - the Unicode standard, mentioned several times above - had its first release in 1991, but there are still a few programmers today blithely ignoring it.
But enough ranting.
How does interpreting a file work?
In order to display an image, render a web page, play sound or anything else from a file, we need to:
Have data that is actually intended to represent the corresponding thing;
Know the format that is used by the data to represent the thing;
Load the data (read the file, or read data from a network connection, or create the data by some other process);
Process the data according to the format.
This happens for even the simplest cases, and it can involve multiple programs. For example, a simple command-line program that inputs text from the user (from the "standard input stream") and outputs text back (to the "standard output stream"), generally, is not actually causing the text to appear on screen, or figuring out what keys were pressed on the keyboard. Instead: the operating system interprets signals from the keyboard, in order to create readable data; after the program writes out its response to the input, another program (the terminal) will translate the text into pixel colour values (getting help from the operating system to choose images from a font); then the operating system will arrange to send the appropriate data to the monitor (according to the terminal window's position etc.).

What is checkSumAdjustment in TTF/OTF head table used for?

The specifications for TrueType and OpenType specify a checkSumAdjustment in the 'head' or 'bhed' table of an Sfnt. Both specifications describe how to calculate this value but I can't find any information on why this value exists and what it is used for.
Bonus question: Why do I have to subtract from 0xB1B0AFBA?
The point of this value is to allow font engines to detect corruption in the font without actually having to parse all the font data first. Ideally, the checksum would be all the way at the start of the file, but thanks to needing to unify various font formats, isn't. Instead it's in the head table. Silly, but we're stuck with it.
Each table in a font has its own checksum value, so that the engine can verify that parts of a font are correct "as is", but to make things easier the font itself also has a master checksum that's even easier to compute (find its value offset in the byte stream by parsing a minimal amount of data, then sum the entire bytestream as LONGs while treating the four bytes where this checksum is located as 0x00000000) and can be used to determine whether the font followed the OpenType spec when it was encoded without needing to look up what every table says its checksum is, where it starts, how long it is, and then running the same checksum computation several times for different parts of the byte stream. If the master checksum fails, it doesn't even matter if the checksums for individual tables turn out to be correct: there's something wonky about this font.
The subtraction from 0xB1B0AFBA is pretty much just "for historical reasons" because OpenType unified several specs, rather than starting from scratch, so there's some baggage from older formats left in it (the "OS/2" table, for instance, is a general metadata table and has nothing to do with the OS/2 (Warp) operating system anymore, nor has it for a very long time).

How to export data from C to MATLAB (on different machines)

I am generating long double float data in a C program on a Linux cluster. I need to export the data to Matlab, which is not installed on the cluster.
What is the best way? My advisor says to export using printf statements. I assume he means sending the data to a comma separated file (and fprintf). But it seems to me like that could be slow and use too much memory and we may lose a lot of precision.
I've found this web page for reading and writing .MAT files, but I don't really understand the page, or the example, which I copied to my cluster, but cannot compile (because it's missing libraries which, obviously, come from MATLAB.
What is the best, or easiest, or fastest way to export data from Linux/C to Windows/MATLAB? How do I get started with that method? Be advised when you answer that I am pretty new to C and will likely need help figuring out how to obtain, install, configure, and link any libraries. But once that's done, I think I'm pretty good at learning to use them.
Why do you think you would you lose precision? The only drawback with CSVs is that ASCII files require much more storage than binary files, but in this day and age where you get terabytes of storage for the price of a good haircut, that hardly seems like a problem.
It will only be noticeably slower if you're writing gigabytes upon gigabytes, but normally calculations take so much longer that the difference between ASCII and binary is completely negligible (and if the calculations don't take so long: why do you need a cluster then?)
In any case, I'd go for ASCII -- how to write and read some binary blob needs to be documented in two places, it's easier to create bugs in both the writing end and the reading end, it's harder to solve them since no human can read the file, etc. Also, MAT file formats may change in the next Matlab release (as they have in the past).
With ASCII, you have none of these problems, the only drawback I can think of is that you have to write a small cluster-specific file reader in Matlab (which is still a lot less work than working out all the bugs and maintaining a MAT file writer).
Anyway, there's tons of tools available in Matlab for ASCII: textread, dlmread, importdata, to name a few. On the C-end, indeed just use fprintf (documentation here).
I once had this problem as well (well, kind of...) and used a simple binary format to do the job.
If your data format is static, that means if it will never change, you can restrict yourself to exactly what you need and hard-code the exact format in your loading program. If you want to stay flexible to add and remove columns, however, you should define a kind of header to add information about the data format and evaluate that on reading.
The trick for simple importing of data is the following:
Let the MATLAB program know how longs your data records are and how they are composed.
Read the data with
rest = fread(fid, 'uchar=>uint8', 'b').';
in order to have a row vector of uint8s.
Reshape the data with
rest = reshape(rest, recordlength, []).';
in order to get your data in recordlength columns and as many rows as you need.
For each data column, combine the relevant uint8 rows into a "sub-matrix", using a combination of reshape, typecast, swapbytes to group your data appropriately and convert them into the wanted format.
The most important thing here is the typecast() function, which accepts the "byte-wise" data as 1st and the wanted data type as 2nd parameter. There is a wide range of accepted data types, such as intXX, uintXX (with XX one of 8, 16, 32 and (AFAIK) 64) as well as float and double.
For example, typecast([1, 1], 'uint16') gives you 257, while typecast([0, 0, 96, 64], 'float') gives you 3.5.
Once you do so, you can improve the reading speed - compared with a text file - by factor 20 or so. (At least, this was the case in the application I wrote this for: there were about 10 different measure values every 10 ms, one measurement could be of several minutes or even hours and I wanted to read in such a file as fast as possible. So I recoded the stuff from text to binary and gained about factor 20, or maybe 15 - don't know exactly. But it was a lot...)
I would save the workspace as a .MAT file, as you said. Then you have whatever values are contained in all the present variables saved as a workspace at that moment. However, if you are reading arrays (your data) that are gigabytes of long, then probably you read them chunk by chunk (due to RAM restrictions maybe?) and saving the workspace in that case might not help you.
I would never printf anything for transporting. In my work (on long time asymptotics, so I have huge outputs), I save everything as binary files using fwrite. Converting to text is slow and expensive, as far as I know.
I hope this helps a little bit!

C: creating an archive file header

I am creating a file archiver/extractor (like tar), using POSIX API system calls in C. I have done part of the archiving bit.
I would like to know if any one could help me with some C source code (using above) to create a file header for a file in C (where header acts as an index), which describes the files attributes/meta data (name,date time,etc). All I have done so far is understand (not sure if that's even correct) that to create a file header it needs
a struct to hold the meta data, and lseek is needed to seek to beginning/end of file
like:
FileName=file.txt FileSize=0
FileDir=./blah/blah
FilePerms=000
\n\n
The archiving part of program has this process:
Get a list of all the files from the command line. (I can do this part)
Create a structure to hold the meta data about each file: name (255 char), size (64-bit int), date and time, and permissions.
For each file, get its stats.
Store the stats of each file within an array of structures.
Open the archive for writing. (I can do this part)
Write the header structure.
For each file, append its content to the archive file (at the end/start of each file).
Close the archive file. (I can do this part)
I am having difficulty in creating the header file as a whole even though I know what it needs to do, as mentioned in numbered points above the bits I cant do are stated (2,3,4,6,7).
Any help will be appreciated.
Thanks.
As ijw notes, there are several ways to create an archive file header. If cross-platform portability is going to be an issue at all - or if you need to switch between 32-bit and 64-bit builds of the software on the same platform, even - then you need to ensure that the sizes and layouts of the fields are fully understood on all platforms.
Per-file Metadata
One way to do that is to use a fixed format binary header with types of known size and endianness. This is what ijw suggested. However, you will need to handle long file names, and so you will need to store a length (probably in a 2-byte unsigned integer) and then follow that with the actual pathname.
The alternative, and generally now favoured technique, is to use printable fields (often called ASCII format, though that is something of a misnomer). The time is recorded as the decimal number of seconds since the Epoch converted to a string, etc. This is what modern ar archives use; it is what GNU tar does (more or less; there are some historical quirks that make that more confusing); it is what cpio -c (which is usually the default these days) does. The fields might be separated by nulls or spaces; there is an easy way to detect the end of the header; the header contains information about the file name (not necessarily as directly as you'd like or expect, but again, that is usually because the format has evolved over the years), and then is followed by the actual data. Somehow, you know the size of each field, and the file which the header describes, so that you can read the data reliably.
Efficiency is a red herring. The conversion to/from the text format is so swift by comparison with the first disk access that there is essentially no measurable performance issue. And the guaranteed portability typically far outweighs the (microscopic) performance benefit from using binary data format instead - doubly so when the binary data has to be transformed on input or output anyway to get it into an architecture-neutral format.
Central Index vs Distributed Index
The other issue to consider is whether the index of files in the archive is centralized (at the front, or at the end) or distributed (the metadata for each file immediately precedes the data for the file). There are some advantages to each format - generally, systems use the distributed version because you can write the information for each file without knowing how many files there are to process in total (for example, because you are recursively archiving a directory's contents). Having a central index up front means you can list the files without reading the whole archive - distributed metadata means you have to read the whole file. However, the central index complicates the building of the archive.
Note that even with a distributed index, you will normally need a header for the archive as a whole so that you can detect that the file is in the format you expect. Typically, there is some sort of marker information (!<arch>\n for an ar archive, usually; %PDF-1.2\n at the start of a PDF file, etc) to reassure you that the file contains what you expect. There might be some overall (archive-level) metadata. Then you will have the first file metadata followed by the file data, repeating until the end of archive (which might, or might not, have a formal end marker - more metadata).
[H]ow would I go about implementing it in the 'fixed format binary header' you suggested. I am having trouble with deciding what commands/functions are needed.
I intended to suggest that you do not go with a fixed format binary header; you should use a text-based header format. If you can work out how to do the binary format, be my guest (I've done it numerous times over the years - that doesn't mean I think it is a good idea).
So, some pointers here towards the 'text header' format.
For the file metadata, you might define that you include:
size
mode (permissions, type)
owner
group
modification time
length of name
name
You might reasonably decide that your file sizes are limited to 64-bit unsigned integer quantities, which means 20 decimal digits. The mode might be printed as a 16-bit octal number, requiring 6 octal digits. The owner and group might be printed as UID and GID values (rather than name), in which case you could use 10 digits for each. Alternatively, you could decide to use names, but you should then allow for names up to say 32 characters each. Note that names are typically more portable than numbers. Neither name nor number is of much relevance on the receiving machine unless you extract the data as root (but why would you want to do that?). The modification time is classically a 32-bit signed integer, representing the number of seconds since the Epoch (1970-01-01 00:00:00Z). You should allow for the Y2038 bug by allowing the number of seconds to grow bigger than the 32-bit quantity; you might decide that 12 leading digits will take you beyond the Y10K crisis (by a factor of 4 or so), and this is good enough; you might decide to allow for fractional seconds too. Together, this suggests that 26 spaces for the timestamp should be overkill. You can decide that each field will be separated from the next by a space (for legibility - think 'ease of debugging'!). You might reasonably decide that all file names will be restricted to 4 decimal digits in total length.
You need to know how to format the types portably - #include <inttypes.h> is your friend.
You then devise a format string for printing (writing) the file metadata, and a parallel string for scanning (reading) the file metadata.
Printing:
"%20" PRIu64 " %06o %-.32s %-.32s %26" PRIu64 " %-4d %s\n"
This prints the name too. It terminates the header with a newline. The total size is 127 bytes plus the length of the file name. That's probably excessive, but you can tweak the numbers to suit yourself.
Scanning:
"%" SCNu64 " %o %.32s %.32s %" SCNu64 "%d"
This does not scan the name; you need to create the scanner for the name carefully, not least because you need to read spaces in the name. In fact, the code for scanning the user name and group name both assume no spaces too. If that is not acceptable (that is, names may contain spaces), then you need a more complex scan format, or something other than sscanf() to process the input data.
I'm assuming a 64-bit integer for the time field, rather than mixing up fractional seconds, etc, even though there's enough space to allow for fractional seconds. You'd likely save some space here.
The getting of information for each file you can do with the stat() system call.
For the writing of the header, here's two solutions.
Trivial but evil:
struct file_header {
... data you want to put in
} fhdr;
fwrite(file, fhdr, sizeof(fhdr));
This is evil because structure packing varies from machine to machine, as does byte order and the size of basic types like 'int'. A file written by your program may not be readable by your program when it's compiled on another machine, or even with another compiler on the same machine in some cases.
Non-trivial but safe:
char name[xxx];
uint32_t length; /* Fixed byte length across architectures */
...
fwrite(file, name, sizeof(name));
length=htonl(length); /* Or something else that converts
the length to a known endianness */
fwrite(file, &length, sizeof(length);
Personally I'm not a fan of htonl() and friends, I prefer to write something that converts a uint32_t to a uchar[4] using shift operators (which can be written trivially using shift operators) because C doesn't pin down the format of even an integer in memory. In practice you'd be hard pushed to find something that doesn't store a uint32_t as 4 bytes of 8 bits, but it's a thing to consider.
The variables listed above can be structure members in your structure. Reversing the process on read is left as an exercise to the reader.

How do I choose a good magic number for my file format?

I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?
Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.
There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.
If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.
In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.
Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.
That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.

Resources