File and networking portability among different byte sizes - c

In C, the fread function is like this:
size_t fread(void *buf, size_t max, FILE *file);
Usually char* arrays are used as buf. People usually assume that char = 8 bit. But what if it isn't true? What happens if files written in 8 bit byte systems are read on 10 bit byte systems? Is there any single standard on portability of files and network streams between systems with bytes of different size? And most importantly, how to write portable code in this regard?

With regard to network communications, the physical access protocols (like ethernet) define how many bits there go in a "unit of information" and it is up to the implementation to map this to an appropriate type. So, for network communications there is no problem with supporting weird architectures.
For file access, stuff gets more interesting if you want to support weird architectures, because there are no standards to refer to and even the method of putting the files on the system may influence how you can access them.
Fortunately, the only systems currently in use that don't support 8-bit bytes are DSP's and similar small embedded systems that don't support a filesystem at all, so the issue is essentially moot.

Systems with bit sizes other than 8 is pretty rare these days. But there are machines with other sizes, and files are not guaranteed to be portable to those machines.
If uberportability is required, then you will have to have some sort of encoding in your file that copes with char != 8 bits.
Do you have something in mind where this may have to run on a DEC 10 or really old IBM mainframes, in DSP's or some such, or are you just asking for the purpose of "I want to know". If the latter, I would just "ignore the case". It is pretty special machines that don't have 8-bit characters - and you most likely will have OTHER problems than bits per char to use your "files" on the system then - like how to get the file there in the first place, as you probably can't plug in a USB stick or transfer it with FTP (although the latter is perhaps the most likely one)

Related

What is the least memory usage for a binary file on UNIX-like OS?

I'm actually attending an operating system course at university.
The professor has told us about fread(), fwrite() ... C functions and read(), write() ... system calls.
My doubt became when I had to define the block's size and the number of blocks. As the documentation says this functions return the exact number of blocks red or written.
So my question is: is it possible to have a file on the hard disk smaller than one byte, or is possible to have a file which dimension is not a multiple of Byte?
Thank you in advance.
EDIT: as someone suggested I've not posted a practical example. This is the exercise I'm working on. It's just a program that clones a file
https://gitlab.com/clementefnc/laboratori_so/blob/master/Lab01/Es4/Es4p4.c
Is it possible to have a file on the hard disk smaller than one byte, or is possible to have a file which dimension is not a multiple of Byte?
Yes, in theory this is entirely possible. "Files" are an abstraction and nothing prevents the existence of an OS that has different limitations or a completely different abstraction for "files". In fact, the minimum unit supported by hardware is typically a block of many bytes (e.g. a 512 byte sector) and the OS is already providing "smaller than minimum size supported by hardware" abstractions.
In practice, no operating system has ever supported this; and it's hard to see a use case for it (so it's unlikely that any operating system will support it in future).

Is it a bad practice to use uint64_t in this context?

I've been playing with C sockets recently, I managed to exchange files between a client and server. However I stumbled upon this problem: when sending the file size between my mac (64 bit) and a raspberry pi (32 bit), it fails since size_t is different between the two. I solved by switching to uint64_t.
I'm wondering, is this a bad practice to use it in place of size_t, which is defined in all prototypes of fread(), fwrite(), read(), write(), stat.size?
Is uint64_t going to be slower on the raspberry pi?
This is not only good practice, but ultimately a necessity. You can't exchange data between different computers of different architectures, without defining the format and size of your data, and coming up with portable ways to interpret it. These fixed-width types are literally designed for the purpose.
Will it be slower to use a uint64_t than a uint32_t on a 32-bit platform? Probably, yes. Noticeably? Doubt it. But you can measure it and find out.
Don't forget to account for differences in endianness, too.

fwrite portability

Is fwrite portable? I'm not really faced to the problem described below but I'd like to understand the fundamentals of C.
Lest assume we have two machines A8 (byte = 8bits) and B16 (byte = 16 bits).
Will the following code produce the same output on both machines ?
unsigned char[10] chars;
...
fwrite(chars,sizeof(unsigned char),10,mystream);
I guess A8 will produce 80 bits (10 octets) and B16 will produce 160 bits (20 octets).
Am I wrong?
This problem won't appear if only uintN_t types were used as their lengths in bits are independent of the size of the byte. But maybe uint8_t won't exist on B16.
What is the solution to this problem?
I guess building an array of uint32_t, putting my bytes in this array (with smart shifts and masks depending on machine's architecture) and writing this array will solve the problem. But this not really satisfactory.There is again an assumption that uint32_t exists on all platforms.The filling of this array will be very dependant on the current machine's architecture.
Thanks for any response.
fwrite() is a standard library function. So it must be portable for each C compiler.
That is it must be defined in C standard library of that compiler to support your machine.
So machine of 8bit, 16 bit, 32 bit give you same high level operation.
But if you want to design those library function then you have to consider machine architecture, memory organization of that machine.
As a C compiler user you should not bother about internal behavior.
I think you just want to use those C library function. So no difference in behavior of the function for different machine.
A byte is on almost every modern computer 8 bits. But there is an other reason fwrite isn't portable:
A file which was written on a Little Endian machine can't be readed by a big endian machine and other way.
In C, char is defined as "smallest addressable unit of the machine". That is, char is not necessarily 8 bits.
In most cases, it's safe enough to rely on a fact that char is 8 bits, and not to deal with some extreme cases.
To speak generally, you probably won't be able to write "half of a byte" to a file on a storage. Additionally, there will be issues with portability on hardware level between devices which are designed to work with different byte size machines. If you are dealing with other devices (such as telecom or stuff), you will have to implement bit streams.

Writing a portable C program - which things to consider?

For a project at university I need to extend an existing C application, which shall in the end run on a wide variety of commercial and non-commercial unix systems (FreeBSD, Solaris, AIX, etc.).
Which things do I have to consider when I want to write a C program which is most portable?
The best advice I can give, is to move to a different platform every day, testing as you go.
This will make the platform differences stick out like a sore thumb, and teach you the portability issues at the same time.
Saving the cross platform testing for the end, will lead to failure.
That aside
Integer sizes can vary.
floating point numbers might be represented differently.
integers can have different endianism.
Compilation options can vary.
include file names can vary.
bit field implementations will vary.
It is generally a good idea to set your compiler warning level up as high as possible,
to see the sorts of things the compiler can complain about.
I used to write C utilities that I would then support on 16 bit to 64 bit architectures, including some 60 bit machines. They included at least three varieties of "endianness," different floating point formats, different character encodings, and different operating systems (though Unix predominated).
Stay as close to standard C as you can. For functions/libraries not part of the standard, use as widely supported a code base as you can find. For example, for networking, use the BSD socket interface, with zero or minimal use of low level socket options, out-of-band signalling, etc. To support a wide number of disparate platforms with minimal staff, you'll have to stay with plain vanilla functions.
Be very aware of what's guaranteed by the standard, vice what's typical implementation behavior. For instance, pointers are not necessarily the same size as integers, and pointers to different data types may have different lengths. If you must make implementation dependent assumptions, document them thoroghly. Lint, or --strict, or whatever your development toolset has as an equivalent, is vitally important here.
Header files are your friend. Use implementaton defined macros and constants. Use header definitions and #ifdef to help isolate those instances where you need to cover a small number of alternatives.
Don't assume the current platform uses EBCDIC characters and packed decimal integers. There are a fair number of ASCII - two's complement machines out there as well. :-)
With all that, if you avoid the tempation to write things multiple times and #ifdef major portions of code, you'll find that coding and testing across disparate platforms helps find bugs sooner. You'll end up producing more disciplined, understandable, maintainable programs.
Use atleast two compilers.
Have a continuous build system in place, which preferably builds on the various target platforms.
If you do not need to work very low-level, try to use some library that provides abstraction. It is unlikely that you won't find third-party libraries that provide good abstraction for the things you need. For example, for network and communication, there is ACE. Boost (e.g. filesystem) is also ported to several platforms. These are C++ libraries, but there may be other C libraries too (like curl).
If you have to work at the low level, be aware that the platforms occasionally have different behavior even on things like posix where they are supposed to have the same behavior. You can have a look at the source code of the libraries above.
One particular issue that you may need to stay abreast of (for instance, if your data files are expected to work across platforms) is endianness.
Numbers are represented differently at the binary level on different architectures. Big-endian systems order the most significant byte first and little-endian systems order the least-significant byte first.
If you write some raw data to a file in one endianness and then read that file back on a system with a different endianness you will obviously have issues.
You should be able to get the endianness at compile-time on most systems from sys/param.h. If you need to detect it at runtime, one method is to use a union of an int and a char, then set the char to 1 and see what value the int has.
It is a very long list. The best thing to do is to read examples. The source of perl, for example. If you look at the source of perl, you will see a gigantic process of constructing a header file that deals with about 50 platform issues.
Read it and weep, or borrow.
The list may be long, but it's not nearly as long as also supporting Windows and MSDOS. Which is common with many utilities.
The usual technique is to separate core algorithm modules from those which deal with the operating system—basically a strategy of layered abstraction.
Differentiating amongst several flavors of unix is rather simple in comparison. Either stick to features all use the same RTL names for, or look at the majority convention for the platforms supported and #ifdef in the exceptions.
Continually refer to the POSIX standards for any library functions you use. Parts of the standard are ambiguous and some systems return different styles of error codes. This will help you to proactively find those really hard to find slightly different implementation bugs.

When to worry about endianness?

I have seen countless references about endianness and what it means. I got no problems about that...
However, my coding project is a simple game to run on linux and windows, on standard "gamer" hardware.
Do I need to worry about endianness in this case? When should I need to worry about it?
My code is simple C and SDL+GL, the only complex data are basic media files (png+wav+xm) and the game data is mostly strings, integer booleans (for flags and such) and static-sized arrays. So far no user has had issues, so I am wondering if adding checks is necessary (will be done later, but there are more urgent issues IMO).
The times when you need to worry about endianess:
you are sending binary data between machines or processes (using a network or file). If the machines may have different byte order or the protocol used specifies a particular byte order (which it should), you'll need to deal with endianess.
you have code that access memory though pointers of different types (say you access a unsigned int variable through a char*).
If you do these things you're dealing with byte order whether you know it or not - it might be that you're dealing with it by assuming it's one way or the other, which may work fine as long as your code doesn't have to deal with a different platform.
In a similar vein, you generally need to deal with alignment issues in those same cases and for similar reasons. Once again, you might be dealing with it by doing nothing and having everything work fine because you don't have to cross platform boundaries (which may come back to bite you down the road if that does become a requirement).
If you mean a PC by "standard gamer hardware", then you don't have to worry about endianness as it will always be little endian on x86/x64. But if you want to port the project to other architectures, then you should design it endianness-independently.
Whenever you recieve/transmit data from a network, remeber to convert to/from network and host byte order. The C functions htons, htonl etc, or equivalients in your language, should be used here.
Whenever you read multi-byte values (like UTF-16 characters or 32 bit ints) from a file, since that file might have originated on a system with different endianness. If the file is UTF 16 or 32 it probably has a BOM (byte-order mark). Otherwise, the file format will have to specify endianness in some way.
You only need to worry about it if your game needs to run on different hardware architectures. If you are positive that it will always run on Intel hardware then you can forget about it. If it will run on Linux though many people use different architectures than Intel and you may end up having to think about it.
Are you distributing you game in source code form?
Because if you are distributing you game as a binary only, then you know exactly which processor families your game will run on. Also, the media files, are they user generated (possibly via a level editor) or are they really only ment to be supplied by yourself?
If this is a truly closed environment (your distribute binaries and the game assets are not intended to be customized) then you know your own risks to endians and I personally wouldn't fool with it.
However, if you are either distributing source and/or hoping people will customize their game, then you have a potential for concern. However, with most of the desktop/laptop computers around these days moving to x86 I would think this is a diminishing concern.
The problem occurs with networking and how the data is sent and when you are doing bit fiddling on different processors since different processors may store the data differently in memory.
I believe Power PC has the opposite endianness of the Intel boards. Might be able to have a routine that sets the endianness dependant on the architecture? I'm not sure if you can actually tell what the hardware architecture is in code...maybe someone smarter then me does know the answer to that question.
Now in reference to your statement "standard" Gamer H/W, I would say typically you're going to look at Consumer Off the Shelf solutions are really what most any Standard Gamer is using, so you're almost going to for sure get the same endian across the board. I'm sure someone will disagree with me but that's my $.02
Ha...I just noticed on the right there is a link that is showing up related to the suggestion I had above.
Find Endianness through a c program

Resources