Alignment of char array struct members in C standard - c

Let us suppose I would like to read/write a tar file header.
Considering standard C (C89, C99, or C11),
do char arrays have any special treatment in structs, regarding padding? Can the compiler add padding to such a struct:
struct header {
char name[100];
char mode[8];
char uid[8];
char gid[8];
char size[12];
char mtime[12];
char chksum[8];
char typeflag;
char linkname[100];
char tail[255];
};
I've seen it used in code on the web as well. Just freading, fwriting this struct to the file in one chunk, assuming there will not be any padding. Of course also assuming CHAR_BITS == 8.
I'm thinking such C code is so common, the standard would deal with this case, but I just can't find it in it, maybe I would not be a good lawyer.
EDIT
The accepted answer would give a strict, or the strictest possible portable implementation according one of the C standards, that lets me treat these fields with standard library string functions. Considering CHAR_BITS and all. I'm thinking one needs to read an array of 512 uint8_t for this, and after that maybe convert them to chars, one by one. Any easier way?

C11 (the latest freely available draft) says only "There may be unnamed padding within a structure object, but not at its beginning" (§6.7.2.1 ¶15) and "There may be unnamed padding at the end of a structure or union" (§6.7.2.1 ¶17). It gives no further restriction on padding within a structure.
The platform ABI may have more stringent requirements on padding, but depending on this will be platform-specific, as other platforms may have other padding requirements. The x86-64 ABI for Unix/Linux gives char 1 byte alignment, and specifies:
Structures and unions assume the alignment of their most strictly aligned component. Each member is assigned to the lowest available offset with the appropriate
alignment. The size of any object is always a multiple of the object’s alignment.
An array uses the same alignment as its elements, except that a local or global
array variable of length at least 16 bytes or a C99 variable-length array variable
always has alignment of at least 16 bytes4
Structure and union objects can require padding to meet size and alignment
constraints. The contents of any padding is undefined.
4The alignment requirement allows the use of SSE instructions when operating on the array.
The compiler cannot in general calculate the size of a variable-length array (VLA), but it is ex-
pected that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at
least a 16-byte alignment.
This seems to imply that on this platform, there will be no padding within the struct. However, there are cases in which array variables have stricter alignment restriction in order to be able to be used with vector instructions; other platforms may impose such restrictions on array structure members as well.
If you would like to be portable, while reading the structure in a single call, you might want to look at readv. This is a vectored or scatter/gather I/O operation, which allows you to specify an array of arrays and lengths to read into. For instance, for this case you might write:
struct header h;
struct iovec iov[10];
iov[0].iov_base = &h.name;
iov[0].iov_len = sizeof(h.name);
iov[1].iov_base = &h.mode;
iov[1].iov_len = sizeof(h.mode);
/* ... etc ... */
bytes_read = readv(fd, iov, 10);
Note that readv is defined in POSIX/Single Unix Specification, not in the C standard. In standard C, the easiest thing to do is just read each of these elements individually (and even with vectored I/O available, just reading and writing each element individually will probably be more clear unless you absolutely need to use a single call for the whole I/O operation).
In your edit, you write:
The accepted answer would give a strict, or the strictest possible portable implementation according one of the C standards, that lets me treat these fields with standard library string functions. Considering CHAR_BITS and all. I'm thinking one needs to read an array of 512 uint8_t for this, and after that maybe convert them to chars, one by one. Any easier way?
The C specification does not guarantee that uint8_t is available: "The typedef name uintN_t designates an unsigned integer type with width N and no padding bits.... These types are optional." (C11 draft, §7.20.1.1, ¶2–3). However, if 8 bit values are available, then char is guaranteed to be an 8 bit value, as it is guaranteed to be at least 8 bits and is guaranteed to be the smallest object that is not a bit-field (§5.2.4.2.1 ¶1):
The values given below shall be replaced by constant expressions suitable for use in #if preprocessing directives. Moreover, except for CHAR_BIT and MB_LEN_MAX, the following shall be replaced by expressions that have the same type as would an expression that is an object of the corresponding type converted according to the integer promotions. Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— number of bits for smallest object that is not a bit-field (byte)
CHAR_BIT 8
So, if you don't have an 8-bit bytes available, you won't be able to read these fields in directly and access octets from them as individual array elements; you would have to manually split out individual bytes using bit shifting and masking. However, there are no modern architectures that I know of which lack 8 bit bytes (for general purpose computing, where file I/O is at all a concern; some DSPs might, but they probably won't have standard C file I/O).
If you do have an 8-bit bytes, then char is guaranteed to be 8 bits, so there's not much benefit other than clarity for using uint8_t vs char. If you're really concerned, I would just ensure that you have a check somewhere in your build process that CHAR_BIT is 8 and call it good.

Actually padding, name mangling and such is not governed by the C standard but the specific ABI: http://en.wikipedia.org/wiki/Application_binary_interface.
There are clear standards how to pad datatypes so that they can be shared between different compilers. Your man page will most likely tell you switches to change the ABI.

The draft C99 and C11 standard says in section 6.7.2.1 Structure and union specifiers in paragraph 13(paragraph 15 in C11):
[...]There may be unnamed padding within a structure object, but not at its beginning.
and in paragraph 15(paragraph 17 in C11):
There may be unnamed padding at the end of a structure or union.

Related

Misalignment of members in structures [duplicate]

This question already has answers here:
Practical Use of Zero-Length Bitfields
(5 answers)
Closed 8 years ago.
In C, sometimes certain members of a structure tend to have misaligned offsets, as in case of this thread in HPUX community
In such a case, one is suggested to use zero-width bit field to align the(misaligned) next member.
Under what circumstance does misalignment of structure members happen? Is it not the job of the compiler to align offsets of members at word boundary?
"Misalignment" of a structure member can only occur if the alignment requirements of the structure member are deliberately hidden. (Or if some implementation-specific mechanism is used to suppress alignment, such as gcc's packed attribute`.)
For example, in the referenced problem, the issue is that there is a struct:
struct {
// ... stuff
int val;
unsigned char data[DATA_SIZE];
// ... more stuff
}
and the programmer attempts to use data as though it were a size_t:
*(size_t*)s->data
However, the programmer has declared data as unsigned char and the compiler therefore only guarantees that it is aligned for use as an unsigned char.
As it happens, data follows an int and is therefore also aligned for an int. On some architectures this would work, but on the target architecture a size_t is bigger than an int and requires a stricter alignment.
Obviously the compiler cannot know that you intend to use a structure member as though it were some other type. If you do that and compile for an architecture which requires proper alignment, you are likely to experience problems.
The referenced thread suggests inserting a zero-length size_t bit-field before the declaration of the unsigned char array in order to force the array to be aligned for size_t. While that solution may work on the target architecture, it is not portable and should not be used in portable code. There is no guarantee that a 0-length bit-field will occupy 0 bits, nor is there any guarantee that a bit-field based on size_t will actually be stored in a size_t or be appropriately aligned for any non bit-field use.
A better solution would be to use an anonymous union:
// ...
int val;
union {
size_t dummy;
unsigned char data[DATA_SIZE];
};
// ...
With C11, you can specify a minimum alignment explicitly:
// ...
int val;
_Alignas(size_t) unsigned char data[DATA_SIZE];
// ...
In this case, if you #include <stdalign.h>, you can spell _Alignas in a way which will also work with C++11:
int val;
alignas(size_t) unsigned char data[DATA_SIZE];
Q: Why does it misalignment happen? Is it not the job of the compiler to align offsets of members at word boundary?
You are probably aware that the reason that structure fields are aligned to specific boundaries is to improve performance. A properly aligned field may only require a single memory fetch operation by the CPU; where a mis-aligned field will require at least two memory fetch operations (twice the CPU time).
As you indicated, it is the compilers job to align structure fields for fastest CPU access; unless a programmer over-rides the compiler's default behavior.
Then the question might be; Why would the programmer over-ride the compiler's default alignment of structure fields?
One example of why a programmer would want to over-ride the default alignment is when sending a structure 'over the wire' to another computer. Generally, a programmer wants to pack as much data as possible, into fewest number of bytes.
Hence, the programmer will disable the default alignment when structure density is more important than CPU performance accessing structure fields.

Why aren't bitfields allowed with normal variables?

I wonder why bitfields work with unions/structs but not with a normal variable like int or short.
This works:
struct foo {
int bar : 10;
};
But this fails:
int bar : 10; // "Expected ';' at end of declaration"
Why is this feature only available in unions/structs and not with variables? Isn't it technical the same?
Edit:
If it would be allowed you could make a variable with 3 bytes for instance without using the struct/union member each time. This is how I would to it with a struct:
struct int24_t {
int x : 24 __attribute__((packed));
};
struct int24_t var; // sizeof(var) is now 3
// access the value would be easier:
var.x = 123;
This is a subjective question, "Why does the spec say this?" But I'll give it my shot.
Variables in a function normally have "automatic" storage, as opposed to one of the other durations (static duration, thread duration, and allocated duration).
In a struct, you are explicitly defining the memory layout of some object. But in a function, the compiler automatically allocates storage in some unspecified manner to your variables. Here's a question: how many bytes does x take up on the stack?
// sizeof(unsigned) == 4
unsigned x;
It could take up 4 bytes, or it could take up 8, or 12, or 0, or it could get placed in three different registers at the same time, or the stack and a register, or it could get four places on the stack.
The point is that the compiler is doing the allocation for you. Since you are not doing the layout of the stack, you should not specify the bit widths.
Extended discussion: Bitfields are actually a bit special. The spec states that adjacent bitfields get packed into the same storage unit. Bitfields are not actually objects.
You cannot sizeof() a bit field.
You cannot malloc() a bit field.
You cannot &addressof a bit field.
All of these things you can do with objects in C, but not with bitfields. Bitfields are a special thing made just for structures and nowhere else.
About int24_t (updated): It works on some architectures, but not others. It is not even slightly portable.
typedef struct {
int x : 24 __attribute__((packed));
} int24_t;
On Linux ELF/x64, OS X/x86, OS X/x64, sizeof(int24_t) == 3. But on OS X/PowerPC, sizeof(int24_t) == 4.
Note the code GCC generates for loading int24_t is basically equivalent to this:
int result = (((char *) ptr)[0] << 16) |
(((unsigned char *) ptr)[1] << 8) |
((unsigned char *)ptr)[2];
It's 9 instructions on x64, just to load a single value.
Members of a structure or union have relationships between their storage location. A compiler cannot reorder or pack them in clever ways to save space due to strict constraints on the layout; basically the only freedom a compiler has in laying out structures is the freedom to add extra padding beyond the amount that's needed for alignment. Bitfields allow you to manually give the compiler more freedom to pack information tightly by promising that (1) you don't need the address of these members, and (2) you don't need to store values outside a certain limited range.
If you're talking about individual variables rather than structure members, in the abstract machine they have no relationship between their storage locations. If they're local automatic variables in a function and their addresses are never taken, the compiler is free to keep them in registers or pack them in memory however it likes. There would be little or no benefit to providing such hints to the compiler manually.
Because it's not meaningful. Bitfield declarations are used to share and reorganize bits between fields of a struct. If you have no members, just a single variable, that is of constant size (which is implementation-defined), For example, it's a contradiction to declare a char, which is almost certainly 8 bits wide, as a one or twelwe bit variable.
If one has a struct QBLOB which contains combines four 2-bit bitfields into a single byte, every time that struct is used will represent a savings of three bytes as compared with a struct that simply contained four fields of type unsigned char. If one declares an array QBLOB myArray[1000000], such an array will take only 1,000,000 bytes; if QBLOB had been a struct with four unsigned char fields, it would have needed 3,000,000 bytes more. Thus, the ability to use bitfields may represent a big memory savings.
By contrast, on most architectures, declaring a simple variable to be of an optimally-sized bitfield type could save at most 15 bits as compared with declaring it to be the smallest suitable standard integral type. Since accessing bitfields generally requires more code than accessing variables of standard integral types, there are few cases where declaring individual variables as bit fields would offer any advantage.
There is one notable exception to this principle, though: some architectures include features which can set, clear, and test individual bits even more efficiently than they can read and write bytes. Compilers for some such architectures include a bit type, and will pack eight variables of that type into each byte of of storage. Such variables are often restricted to static or global scope, since the specialized instructions that handle them may be restricted to using certain areas of memory (the linker can ensure any such variables get placed where they have to go).
All objects must occupy one or more contiguous bytes or words, but a bitfield is not an object; it's simply a user-friendly way of masking out bits in a word. The struct containing the bitfield must occupy a whole number of bytes or words; the compiler just adds the necessary padding in case the bitfield sizes don't add up to a full word.
There's no technical reason why you couldn't extend C syntax to define bitfields outside of a struct (AFAIK), but they'd be of questionable utility for the amount of work involved.

are pad lengths different for each element in a struct? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
C struct sizes inconsistence
For the following program, i'd like to obtain the size of a struct. However, it turns the size of it is 12 rather than 4*4=16. Does it means that each element can align to a different pad number? like int with 4 and short with 2, but in this case char should have 1.
Thx.
#include <stdio.h>
struct test{
int a;
char b;
short c;
int d;
};
struct test A={1,2,3,4};
int main()
{
printf("0X%08X\n",&A.a);
printf("0X%08X\n",&A.b);
printf("0X%08X\n",&A.c);
printf("0X%08X\n",&A.d);
printf("%d\n",sizeof(A));
}
And the result is:
0X00424A30
0X00424A34
0X00424A36
0X00424A38
12
Yes, every type don't have the same alignment. Each of your variable shall be aligned correctly, ie their addresses shall be a multiple of a certain size. The usual rule (for Intel and AMD, among other) is that every data type is aligned by its own size. Assuming x86 architecture, it seems to be right here:
0X00424A30: first address of the structure.
0X00424A34: 4 bytes (maybe sizeof(int)) after the first member. char requires an alignment of 1, so it doesn't need padding here.
0X00424A36: 2 bytes after the second member. short requires an alignment of 2, so there is 1 byte of padding.
0X00424A38: 2 bytes after the second member. int requires an alignment of 4, but the address is already a multiple of 4. So there is no padding byte.
Anyway, it is not portable assumption: C standard doesn't force anything here. It just allow padding bytes between your members and at the end of the structure.
By the way, you should rather use the following formats:
%p and typecast for pointers;
%zu or %u with typecast for sizeof.
Yes. Note that padding is up to the implementation, so it may end up differently on various platforms. C99 spec section 6.7.2.1 only states that thay may be padding between member of the structure and at its end. To make portable programs, you should not make any assumptions about the length of the padding.
Yes, each type has its own alignment restrictions.
The alignment restrictions of type T can never be stricter than requiring alignment to addresses that are a multiple of sizeof(T), as the two elements of the array T arr[2] are required to follow each other immediately without additional padding to make arr[1] correctly aligned.
It is allowed for a compiler to use less strict alignment requirements.
For example,
a char object must be byte-aligned (as sizeof(char) == 1 by definition)
a short object will typically be two-byte aligned (with sizeof(short) == 2), but could also be byte-aligned on some architectures
a int object will typically be four-byte aligned (with sizeof(int) == 4), but could also be two or even one byte-aligned on some architectures
a struct type will typically require an alignment equal to the alignment requirements of the most strictly aligned type among its members (sometimes with a minimum alignment > 1).
When building a struct, the members must all be correctly aligned, relative to the start of the struct, with the first member being at offset 0. To achieve this, the compiler may have to insert padding after a member to get the next member correctly aligned.
yes .because of Packing and byte alignment
The general answer is that compilers are free to add padding between members for alignment purpose.

Would this union work if char had stricter alignment requirements than int?

Recently I came across the following snippet, which is an attempt to ensure all bytes of i (nad no more) are accessible as individual elements of c:
union {
int i;
char c[sizeof(int)];
};
Now this seems a good idea, but I wonder if the standard allows for the case where the alignment requirements for char are more restrictive than that for int.
In other words, is it possible to have a four-byte int which is required to be aligned on a four-byte boundary with a one-byte char (it is one byte, by definition, see below) required to be aligned on a sixteen-byte boundary?
And would this stuff up the use of the union above?
Two things to note.
I'm talking specifically about what the standard allows here, not what a sane implementor/architecture would provide.
I'm using the term "byte" in the ISO C sense, where it's the width of a char, not necessarily 8 bits.
No type can ever have stricter alignment requirements than its size (because of how arrays work), and sizeof(char) is 1.
In case it's not obvious:
sizeof(T [N]) is sizeof(T)*N.
sizeof is in units of char; all types are represented as a fixed number of bytes (char), that number being their size. See 6.2.6 (Representation of Types) for details.
Given T A[2];, (char *)&A[1] - (char *)&A[0] is equal to sizeof A[0].
Therefore the alignment requirement for T is no greater than sizeof(T) (in fact it divides sizeof(T))
Have a look at this thread. There, I questioned the usefulness of C Unions and there are some interesting insights. The important thing is that the Standard does not ensure the alignment of the different fields at all!
EDIT: paxdiablo, just noticed you were one of the guys answering that question, so you should probably be familiar with this limitation.

Can C arrays contain padding in between elements?

I heard a rumor that, in C, arrays that are contained inside structs may have padding added in between elements of the array. Now obviously, the amount of padding could not vary between any pair of elements or calculating the next element in an array is not possible with simple pointer arithmetic.
This rumor also stated that arrays which are not contained in structures are guaranteed to contain no padding. I know at least that part is true.
So, in code, the rumor is:
{
// Given this:
struct { int values[20]; } foo;
int values[20];
// This may be true:
sizeof(values) != sizeof(foo.values);
}
I'm pretty certain that sizeof(values) will always equal sizeof(foo.values). However, I have not been able to find anything in the C standard (specifically C99) that explicitly confirms or denies this.
Does anyone know if this rumor is addressed in any C standard?
edit: I understand that there may be padding between the end of the array foo.values and the end of the struct foo and that the standard states that there will be no padding between the start of foo and the start of foo.values. However, does anyone have a quote from or reference to the standard where it says there is no padding between the elements of foo.values?
No, there will never be padding in between elements of an array. That is specifically not allowed. The C99 standard calls array types "An array type describes a contiguously allocated nonempty set of objects...". For contrast, a structure is "sequentially", not "contiguously" allocated.
There might be padding before or after an array within a structure; that is another animal entirely. The compiler might do that to aid alignment of the structure, but the C standard doesn't say anything about that.
Careful here. Padding may be added at the end of the struct, but will not be added between the elements of the array as you state in your question. Arrays will always reference contiguous memory, though an array of structures may have padding added to each element as part of the struct itself.
In your example, the values and foo.values arrays will have the same size. Any padding will be part of the struct foo instead.
Here's the explanation as to why a structure may need padding between its members or even after its last member, and why an array doesn't:
Different types might have different alignment requirements. Some types need to be aligned on word boundaries, others on double or even quad word boundaries. To accomplish this, a structure may contain padding bytes between its members. Trailing padding bytes might be needed because the memory location directly ofter a structure must also conform to the structure's alignment requirements, ie if bar is of type struct foo *, then
(struct foo *)((char *)bar + sizeof(struct foo))
yields a valid pointer to struct foo (ie doesn't fail due to mis-alignment).
As each 'member' of an array has the same alignment requirement, there's no reason to introduce padding. This holds true for arrays contained in structures as well: If an array's first elment is correctly aligned, so are all following elements.
Yes, sort of. Variables are often aligned to some boundry, depending on the variable. Take the following, for instance:
typedef struct
{
double d;
char c;
} a_type_t;
double and char are 8 and 1 bytes, on my system, respectively. Total of 9. That structure, however, will be 16 bytes, so that the doubles will always be 8-byte aligned. If I had just used ints, chars, etc, then the alignment might be 1, 2, 4, or 8.
For some type T, sizeof(T) may or may not equal sizeof(T.a) + sizeof(T.b) + sizeof(T.c) ... etc.
Generally, this is entirely compiler and architecture dependent. In practice, it never matters.
Consider:
struct {
short s;
int i;
} s;
Assuming shorts are 16 bits and you're on 32 bits, the size will probably be 8 bytes as each struct members tends to be aligned a word (32 bit in this case) boundary. I say "probably" because it is implementation specific behaviour that can be varied by compiler flags and the like.
It's worth stressing that this is implementation behaviour not necessarily defined by the C standard. Much like the size of shorts, ints and longs (the C standard simply says shorts won't be larger than ints and longs won't be smaller than ints, which can end up as 16/32/32, 16/32/64, 32/32/64 or a number of other configurations).

Resources