disjoint unions in C,Pascal and SML - c

I was studying about disjoint unions in programming. I came across with the saying that Pascal,SML and C have their own union version: variant record,construction and union. It was also saying that Pascal contains a "tag" that you don't have to use it, SML has a tag that you required to use it and C does not have a tag. furthermore, SML will throw exception if we used it wrong, Pascal allows check during runtime and C does not have a feature for checking during runtime and the programmer have to add a field for a "tag" manually.
First of all, I don't understand what is "tag". I was trying to look at some examples of those unions but didn't understand what "tag" represents. If "tags" are important, how come C does have one? what is the difference between those unions.
Also, I didn't find any material related to the "tag" of unions.
Futhermore, what does it mean "checking during runtime", checking what? It will be great to see smiple examples that show those features.

One could call such disjoint unions a very early form of polymorphism. You have one type that can have several forms. In some languages, which of these forms is being used (is active) is distinguished by a member of the type, called a tag. This can be a boolean, a byte, an enum, or some other ordinal.
In some (older?) versions of Pascal, the tag is actually required to contain the correct value. A Pascal "union" (or, as they are called in Pascal, variant record) contains a value that distinguishes which of the branches is currently "active".
An example:
type
MyUnion = record // Pascal's version of a struct -- or union
case Tag: Byte of // This doesn't have to be called Tag, it can have any name
0: (B0, B1, B2, B3: Byte); // only one of these branches is present
1: (W0, W1: Word); // they overlap each other in memory
2: (L: Longint);
end;
In such versions of Pascal, if Tag has the value 0, you can only access B0, B1, B2 or B3 and not the other variants. If Tag is 1, you can only access W0 and W1, etc...
In most Pascal versions, there is no such restriction and the tag value is purely informative. In many of those, you don't even need an explicit tag value anymore:
MyUnion = record
case Byte of // no tag, just a type, to keep the syntax similar
etc...
Note that Pascal variant records are not pure unions, where each part is an alternative:
type
MyVariantRec = record
First: Integer; // the non-variant part begins here
Second: Double;
case Byte of // only the following part is a "union", the variant part.
0: ( B0, B1, B2, B3: Byte; );
1: ( W0, W1: Word; );
2: ( L: Longint);
end;
In C, you would have to nest a union in a struct to get something nearly the same:
// The following is more or less the equivalent of the Pascal record above
struct MyVariantRec
{
int first;
double second;
union
{
struct { unsigned char b0, b1, b2, b3; };
struct { unsigned short w0, w1 };
struct { long l };
};
}

First of all, I don't understand what is "tag".
Wikipedia has a reasonably nice discussion of the overall concept, which starts off with a list of synonyms, including "tagged union". Actually, "tagged union" is the main heading of the article, and disjoint union is one of the synonyms. It starts with a pretty succinct explanation:
a data structure used to hold a value that could take on several
different, but fixed, types. Only one of the types can be in use at
any one time, and a tag field explicitly indicates which one is in
use.
You go on to ask,
If "tags" are important, how come C does have one?
How important tags are in this context is a language-design question on which C, Pascal, and SML take different positions. Inasmuch as C is inclined to take a rather low-level approach to most things, and to allow users a great deal of control, it is not surprising that it does not force tag usage. Users who want tags can implement them themselves with comparative ease, as indeed I have done myself on occasion.
Alternatively, it might be easier to say that C doesn't have tagged unions as a built-in language feature at all, only plain, untagged unions. From that perspective, if you want a tagged union in C then you have to implement it yourself. This is probably the most consistent view, but I gather that it differs from the one presented in the material you have been studying.
what is the difference between those unions.
They are different implementations of a similar concept, provided by different languages. A full analysis would be beyond the reasonable scope of an SO answer. Like many things in Computer Science and elsewhere, the abstract idea of disjoint unions can be realized in a great many different ways.
Also, I didn't find any material related to the "tag" of unions.
See above, and the linked Wikipedia article. I'm sure you could turn up a lot more material, too, especially with WP's synonym list to work with.
Futhermore, what does it mean "checking during runtime", checking what?
I'd have to see the context and exact statement to be sure, but it seems likely that your source was talking about checking one or more of these things:
that the tag of a particular instance of the union is one of those defined for that union type, or
that the contents of the union are of the type indicated by the tag, or
that a list of alternative actions (see below) covers all possible alternatives.
It will be great to see smiple examples that show those features.
My Pascal is too rusty to be of any use, and I do not know SML. Even just a C example may be instructive, however:
enum my_tag { INT_TAG, STRING_TAG, DOUBLE_TAG };
union disjoint_union {
struct {
enum my_tag tag;
int an_int;
};
struct {
enum my_tag tag_s;
char *a_string;
};
struct {
enum my_tag tag_d;
double a_double;
};
};
union disjoint_union u = { .tag = INT_TAG, .an_int = 42 };
union disjoint_union u2 = { .tag = STRING_TAG, .a_string = "hello" };
union disjoint_union u3 = { .tag = DOUBLE_TAG, .a_double = 3.14159 };
This being C, the tag is provided manually and explicitly, and the language does not distinguish it specially. Also, it is up to the programmer to ensure that the union's content bears the correct tag.
You might use such a thing with a function like this, which relies on the tag to determine how to handle instances of the union type:
void print_union(union disjoint_union du) {
switch (du.tag) {
case INT_TAG:
printf("%d", du.an_int);
break;
case STRING_TAG:
printf("%s", du.a_string);
break;
case DOUBLE_TAG:
printf("%f", du.a_double);
break;
}
}

The tag is anything that tells you what member of the union is currently being used.
It's usually an enum but could be an integer, a boolean, or a bitfield based on one of these.
Example:
union my_union { char *string; void *void_ptr; long integer; };
struct my_tagged_union {
union my_union the_union;
enum { is_string, is_void_ptr, is_integer } the_tag;
};
C not forcing you to use a builtin-tag means you have more control over the layout and size of your data. For example you can use a bitfield tag and place it next to other bitfield information you store in your structure so that the bitfields get merged, yielding you space savings; or sometimes the union member currently in use is implicit from the context your code is in, in which case no tag is needed at all.

SML has a tag that you required to use it [...]. furthermore, SML will throw exception if we used it wrong,
Standard ML has algebraic data types which contain sum types and product types. The sum types build on top of unions (and the product types build on top of structs), but handle what you call a tagged or disjoint union automatically in the compiler; you specify the constructors, and the compiled code figures out how to differentiate between the different constructors via pattern matching. For example,
datatype pokemon = Pikachu of int
| Bulbasaur of string
| Charmander of bool * char
| Squirtle of pokemon list
So a sum type can have different constructors with different parameters, and the parameters can themselves be a product of other types, including sum types, and including the type being defined itself, making the data type definition recursive. This is implemented with tagged unions, but the abstractions on top provide for more syntactic convenience.
To clarify, Standard ML will not throw an exception if used wrong, but throw a type error during compilation. This is because of Standard ML's type system. So you can't accidentally have a (void *)-pointer that you cast to something it isn't, which is possible in C.

Related

Good use cases for declaring enum type tags and variable names in C?

If you need some constants in your code you can declare them with enums:
enum {
DOG,
CAT,
FISH,
};
enum {
CAR,
BUS,
TRAIN,
};
and then use DOG, BUS, etc. as needed. But enums may be declared in a more verbose style as well:
enum animals {
DOG,
CAT,
FISH,
} pets;
enum transport {
CAR,
BUS,
TRAIN,
} vehicles;
Given that enum constants have global scope and cannot be referred by pets.DOG in the way that structs and unions can, are there any good use cases for the verbose style? To me the type tags and variable names for enums look quite redundant, even offputting as they look like structs but can't be used like structs. I hope I'm missing something and they do have a good use.
There is a related SO Q&A where the overriding assumption is that one would use the type tags and variable names when using enums. So my question can be restated as "In what tasks would I fail if I use anonymous enums only?" Because to me, the whole point of enums are the DOG, CAT, CAR constants and I see no use for assigning one of these to an enum variable. I'm still learning, so I'm sure I must be missing something.
You can give enum types a name in case you want to declare variables of that type:
enum animals a1 = DOG;
enum animals a2 = CAT;
Or have them as function arguments:
void foo(enum animals a);
While enums are considered integer types, and you could also use an int to store one of these values, using a variable of an enum type helps to document your code and make you intent clear to the reader.
"To me the type tags and variable names for enums look quite redundant..."
The value in using a sequential collection of named integer values in the form of an enumerated list ( enum ) might seem subtle at first glance, but becomes very apparent when used in C projects for a couple of reasons:
The names associated with an enumerated list provide
self-documenting code, i.e. particularly when collection of names
chosen to represent a set of enumerated values forms a theme related
to the task at hand. (Your animal enum is a good example, as
would be one used to enumerate eg. a large list of commands, or
position types within a company.)
The default assignment of values in an enumerated list are
sequential, starting from 0, and increment by one until the end of
the list, resulting in a list of unique values, very well suited for
use when indexing through an array of strings with particular
meaning, or when used in a switch statement as the constant integer
value for each of the case statements.
And regarding comment: "...but the advantage of using the ANML type instead of int is minimal,..."
enum lists also provide a documented constraint. For example using ANML anml;
rather than int anml; as a struct member will quickly indicate to those who will maintain/update the source code (In the months or years to come.) that there is an associated list of
related values that this member is constrained to use, rather than
any random integer value. This is important when an enumerated list
will be used, eg. in a switch statement designed only to handle a set
of case statements that correspond with the constant integer values in that enum.
These two together are part of the use-case I have found particularly useful i.e. to use enumerations in conjucntion with string arrays for selecting content for user interface, or for sub-string search, etc.
eg:
typedef enum {
CAT,
DOG,
FISH,
MAX_ANML
}ANML;//for use in struct
char *strings[MAX_ANML] = {"cat","dog","fish"};
typedef struct {
char content[80];
ANML anml;
}SEARCH;
Where for example, the two constructs then can be used in conjunction with a switch statement:
bool searchBuf(SEARCH *animal)
{
bool res = FALSE;
switch (animal->anml) {
case CAT:
//use the string animal[type] for a search, or user interface content, etc.
if(strstr(animal->content, strings[CAT]))
res = TRUE;
break;
case DOG:
if(strstr(animal->content, strings[DOG]))
res = TRUE;
break;
case FISH:
if(strstr(animal->content, strings[FISH]))
res = TRUE;
break;
};
return res;
}
int main(void)
{
char buffer[] = {"this is a string containing cat."};
SEARCH search;
strcpy(search.content, buffer);
search.anml = CAT;
bool res = searchBuf(&search);
//use res...
return 0;
}
I like enums instead of #defines when I debug. The debuer shows me not only the numeric value, but also the enum name - very handy
I am posting my own answer after a number of others have posted useful answers and comments, and we got as far as establishing that enum typenames and variable names are useful as self-documenting code and making the intent of the code clear (thanks to answers by ryyker and dbush).
As I was experimenting and looking for stronger reasons to use non-anonymous enums, I established that there cannot be any by definition. Enums have no scoping and bounds checks, not at compile time (GCC 6) nor at runtime. Here is a snippet demonstrating the weakness:
enum withType { // enum with type name and variable name
ONE,
TWO,
THREE,
} wtEnum;
enum { // Anonymous enum
TINY,
SMALL,
MID,
LARGE,
BIGGEST,
};
int main(void) {
enum withType wt1 = LARGE; // Overflow!
wtEnum = BIGGEST; // Overflow!
printf("Enum test values: %d, %d, %d\n", THREE, wt1, wtEnum);
return 0;
};
The example makes clear that any "scoping" you may want to do with enum type names and variable names is by convention only, and relies on coder discipline not to cross enum "domains". I would go so far as to claim that given this reality, enum functionality in C is "mis-designed". It creates the impression of the kind of utility we see with structs and unions, but provides nothing of the kind. After this insight, I consider anonymous enums the only safe enums to use!
All of that said, I have accepted ryyker's answer as it provides a nice demonstration of mainstream usage of enums. But I am leaving my own answer here as well, because the points I have raised are valid.

Typechecking in const anonymous union

First off, typechecking is not exactly the correct term I'm looking for, so I'll explain:
Say I want to use an anonymous union, I make the union declaration in the struct const, so after initialization the values will not change. This should allow for statically checking whether the uninitialized member of the union is being accessed. In the below example the instances are initialized for either a (int) or b (float). After this initialization I would like to not be able to access the other member:
struct Test{
const union{
const int a;
const float b;
};
};
int main(){
struct Test intContainer = { .a=5 };
struct Test floatContainer = { .b=3.0 };
int validInt = intContainer.a;
int validFloat = floatContainer.b;
// For these, it could be statically determined that these values are not in use (therefore invalid access)
int invalidInt = floatContainer.a;
float invalidFloat = intContainer.b;
return 0;
}
I'd hope to have the last two assignments to give an error (or at least a warning), but it gives none (using gcc 4.9.2). Is C designed to not check for this, or is it actually a shortcoming of the language/compiler? Or is it just plain stupid to want to use such a pattern?
In my eyes it looks like it has a lot of potential if this was a feature, so can someone explain to me why I can't use this as a way to differentiate between two "sub-types" of a same struct (one for each union value). (Potentially any suggestions how I can still do something like this?)
EDIT:
So apparently it is not in the language standard, and also compilers don't check it. Still I personally think it would be a good feature to have, since it's just eliminating manually checking for the union's contents using tagged unions. So I wonder, does anyone have an idea why it is not featured in the language (or it's compilers)?
I'd hope to have the last two assignments to give an error (or at least a warning), but it gives none (using gcc 4.9.2). Is C designed to not check for this, or is it actually a shortcoming of the language/compiler?
This is a correct behavior of the compiler.
float invalidInt = floatContainer.a;
float invalidFloat = intContainer.b;
In the first declaration you are initializing a float object with an int value and in the second you are initializing a float object with a float value. In C you can assign (or initialize) any arithmetic types to any arithmetic types without any cast required. So no diagnostic required.
In your specific case you are also reading union members that are not the same members as the union member last used to store its value. Assuming the union members are of the same size (e.g., float and int here), this is a specified behavior and no diagnostic is required. If the size of union members are different, the behavior is unspecified (but still, no diagnostic required).

How to define constructor in C

How can one create a Haskell/C++ style constructor in C? Also, once I have this new object how can I define operations in it (such as LinkedList/Tree)?
C has no language support for objected-oriented concepts such as constructors.
You could manually implement this, along the lines of:
typedef struct
{
int field1;
int field2;
} MyObject;
MyObject *MyObject_new(int field1, int field2)
{
MyObject *p = malloc(sizeof(*p));
if (p != NULL)
{
// "Initialise" fields
p->field1 = field1;
p->field2 = field2;
}
return p;
}
void MyObject_delete(MyObject *p)
{
// Free other resources here too
if (p != NULL)
{
free(p);
}
}
void MyObject_doSomethingInteresting(const MyObject *p)
{
printf("How interesting: %d\n", p->field1 * p->field2);
}
If you want to get really advanced, you can use hideously-complex macros and function pointers to emulate polymorphism and all sorts (at the expense of type-safety); see this book.
See also this question.
See also this FAQ answer if you're interested in compiling object-oriented C++ into C.
I suspect that #Oli Charlesworth's answer is what you really wanted here, but for completeness' sake, if you really want Haskell-style constructors:
Nullary data constructors like True and False can be represented easily by regular enum types. To stay accurate to Haskell you'll want to pretend that you can't treat them as integer values, unless you're also pretending that your Haskell type is an instance of Enum.
Data constructors for product types like (True, "abc") can be represented by functions that take appropriate arguments and return an appropriate struct. Haskell's "record syntax" can be imitated using a struct because C-style structs are already being imitated by Haskell's record syntax. Yay, recursion!
Data constructors for sum types like Nothing and Just 5 can be imitated by a union. To stay accurate to Haskell you'll need to "tag" the union to distinguish the cases safely, such as using a struct and an enum. Think of the | in Haskell's data declarations as indicating an untagged union, with the individual constructors like Nothing and Just being nullary constructors required to be the first element of a product type forming each branch. This is slightly more complicated to implement than you'd think after realizing that it's simpler than it sounds at first.
Type constructors don't really translate well to C. You might be able to fake it badly, if you want to, using macros and such. But you probably don't want to, so stick to monomorphic code.
If you want to put any of the above into practice in actual code, well... good luck and godspeed.

union versus void pointer

What would be the differences between using simply a void* as opposed to a union? Example:
struct my_struct {
short datatype;
void *data;
}
struct my_struct {
short datatype;
union {
char* c;
int* i;
long* l;
};
};
Both of those can be used to accomplish the exact same thing, is it better to use the union or the void* though?
I had exactly this case in our library. We had a generic string mapping module that could use different sizes for the index, 8, 16 or 32 bit (for historic reasons). So the code was full of code like this:
if(map->idxSiz == 1)
return ((BYTE *)map->idx)[Pos] = ...whatever
else
if(map->idxSiz == 2)
return ((WORD *)map->idx)[Pos] = ...whatever
else
return ((LONG *)map->idx)[Pos] = ...whatever
There were 100 lines like that. As a first step, I changed it to a union and I found it to be more readable.
switch(map->idxSiz) {
case 1: return map->idx.u8[Pos] = ...whatever
case 2: return map->idx.u16[Pos] = ...whatever
case 3: return map->idx.u32[Pos] = ...whatever
}
This allowed me to see more clearly what was going on. I could then decide to completely remove the idxSiz variants using only 32-bit indexes. But this was only possible once the code got more readable.
PS: That was only a minor part of our project which is about several 100’000 lines of code written by people who do not exist any more. The changes to the code have to be gradual, in order not to break the applications.
Conclusion: Even if people are less used to the union variant, I prefer it because it can make the code much lighter to read. On big projects, readability is extremely important, even if it is just you yourself, who will read the code later.
Edit: Added the comment, as comments do not format code:
The change to switch came before (this is now the real code as it was)
switch(this->IdxSiz) {
case 2: ((uint16_t*)this->iSort)[Pos-1] = (uint16_t)this->header.nUz; break;
case 4: ((uint32_t*)this->iSort)[Pos-1] = this->header.nUz; break;
}
was changed to
switch(this->IdxSiz) {
case 2: this->iSort.u16[Pos-1] = this->header.nUz; break;
case 4: this->iSort.u32[Pos-1] = this->header.nUz; break;
}
I shouldn't have combined all the beautification I did in the code and only show that step. But I posted my answer from home where I had no access to the code.
In my opinion, the void pointer and explicit casting is the better way, because it is obvious for every seasoned C programmer what the intent is.
Edit to clarify: If I see the said union in a program, I would ask myself if the author wanted to restrict the types of the stored data. Perhaps some sanity checks are performed which make sense only on integral number types.
But if I see a void pointer, I directly know that the author designed the data structure to hold arbitrary data. Thus I can use it for newly introduced structure types, too.
Note that it could be that I cannot change the original code, e.g. if it is part of a 3rd party library.
It's more common to use a union to hold actual objects rather than pointers.
I think most C developers that I respect would not bother to union different pointers together; if a general-purpose pointer is needed, just using void * certainly is "the C way". The language sacrifices a lot of safety in order to allow you to deliberately alias the types of things; considering what we have paid for this feature we might as well use it when it simplifies the code. That's why the escapes from strict typing have always been there.
The union approach requires that you know a priori all the types that might be used. The void * approach allows storing data types that might not even exist when the code in question is written (though doing much with such an unknown data type can be tricky, such as requiring passing a pointer to a function to be invoked on that data instead of being able to process it directly).
Edit: Since there seems to be some misunderstanding about how to use an unknown data type: in most cases, you provide some sort of "registration" function. In a typical case, you pass in pointers to functions that can carry out all the operations you need on an item being stored. It generates and returns a new index to be used for the value that identifies the type. Then when you want to store an object of that type, you set its identifier to the value you got back from the registration, and when the code that works with the objects needs to do something with that object, it invokes the appropriate function via the pointer you passed in. In a typical case, those pointers to functions will be in a struct, and it'll simply store (pointers to) those structs in an array. The identifier value it returns from registration is just the index into the array of those structs where it has stored this particular one.
Although using union is not common nowadays, since union is more definitive for your usage scenario, suits well. In the first code sample it's not understood the content of data.
My preference would be to go the union route. The cast from void* is a blunt instrument and accessing the datum through a properly typed pointer gives a bit of extra safety.
Toss a coin. Union is more commonly used with non-pointer types, so it looks a bit odd here. However the explicit type specification it provides is decent implicit documentation. void* would be fine so long as you always know you're only going to access pointers. Don't start putting integers in there and relying on sizeof(void*) == sizeof (int).
I don't feel like either way has any advantage over the other in the end.
It's a bit obscured in your example, because you're using pointers and hence indirection. But union certainly does have its advantages.
Imagine:
struct my_struct {
short datatype;
union {
char c;
int i;
long l;
};
};
Now you don't have to worry about where the allocation for the value part comes from. No separate malloc() or anything like that. And you might find that accesses to ->c, ->i, and ->l are a bit faster. (Though this might only make a difference if there are lots of these accesses.)
It really depends on the problem you're trying to solve. Without that context it's really impossible to evaluate which would be better.
For example, if you're trying to build a generic container like a list or a queue that can handle arbitrary data types, then the void pointer approach is preferable. OTOH, if you're limiting yourself to a small set of primitive data types, then the union approach can save you some time and effort.
If you build your code with -fstrict-aliasing (gcc) or similar options on other compilers, then you have to be very careful with how you do your casting. You can cast a pointer as much as you want, but when you dereference it, the pointer type that you use for the dereference must match the original type (with some exceptions). You can't for example do something like:
void foo(void * p)
{
short * pSubSetOfInt = (short *)p ;
*pSubSetOfInt = 0xFFFF ;
}
void goo()
{
int intValue = 0 ;
foo( &intValue ) ;
printf( "0x%X\n", intValue ) ;
}
Don't be suprised if this prints 0 (say) instead of 0xFFFF or 0xFFFF0000 as you may expect when building with optimization. One way to make this code work is to do the same thing using a union, and the code will probably be easier to understand too.
The union reservs enough space for the largest member, they don't have to be same, as void* has a fixed size, whereas the union can be used for arbitrary size.
#include <stdio.h>
#include <stdlib.h>
struct m1 {
union {
char c[100];
};
};
struct m2 {
void * c;
};
int
main()
{
printf("sizeof m1 is %d ",sizeof(struct m1));
printf("sizeof m2 is %d",sizeof(struct m2));
exit(EXIT_SUCCESS);
}
Output:
sizeof m1 is 100 sizeof m2 is 4
EDIT: assuming you only use pointers of the same size as void* , I think the union is better, as you will gain a bit of error detection when trying to set .c with an integer pointer, etc'.
void* , unless you're creating you're own allocator, is definitely quick and dirty, for better or for worse.

Enforce strong type checking in C (type strictness for typedefs)

Is there a way to enforce explicit cast for typedefs of the same type? I've to deal with utf8 and sometimes I get confused with the indices for the character count and the byte count. So it be nice to have some typedefs:
typedef unsigned int char_idx_t;
typedef unsigned int byte_idx_t;
With the addition that you need an explicit cast between them:
char_idx_t a = 0;
byte_idx_t b;
b = a; // compile warning
b = (byte_idx_t) a; // ok
I know that such a feature doesn't exist in C, but maybe you know a trick or a compiler extension (preferable gcc) that does that.
EDIT
I still don't really like the Hungarian notation in general. I couldn't use it for this problem because of project coding conventions, but I used it now in another similar case, where also the types are the same and the meanings are very similar. And I have to admit: it helps. I never would go and declare every integer with a starting "i", but as in Joel's example for overlapping types, it can be life saving.
For "handle" types (opaque pointers), Microsoft uses the trick of declaring structures and then typedef'ing a pointer to the structure:
#define DECLARE_HANDLE(name) struct name##__ { int unused; }; \
typedef struct name##__ *name
Then instead of
typedef void* FOOHANDLE;
typedef void* BARHANDLE;
They do:
DECLARE_HANDLE(FOOHANDLE);
DECLARE_HANDLE(BARHANDLE);
So now, this works:
FOOHANDLE make_foo();
BARHANDLE make_bar();
void do_bar(BARHANDLE);
FOOHANDLE foo = make_foo(); /* ok */
BARHANDLE bar = foo; /* won't work! */
do_bar(foo); /* won't work! */
You could do something like:
typedef struct {
unsigned int c_idx;
} char_idx;
typedef struct {
unsigned int b_idx;
} byte_idx;
Then you would see when you are using each:
char_idx a;
byte_idx b;
b.b_idx = a.c_idx;
Now it is more clear that they are different types but would still compile.
What you want is called "strong typedef" or "strict typedef".
Some programming languages [Rust, D, Haskell, Ada, ...] give some support for this at language level, C[++] does not. There was a proposal to include it into the language with the name "opaque typedef", but was not accepted.
The lack of language support is really not a problem though. Just wrap the type to be aliased into a new class having exactly 1 data member, of type T. Much of the repetition can be factored out by templates and macros. This simple technique is just as convenient as in the programming languages with direct support.
Use a lint. See Splint:Types and strong type check.
Strong type checking often reveals
programming errors. Splint can check
primitive C types more strictly and
flexibly than typical compilers (4.1)
and provides support a Boolean type
(4.2). In addition, users can define
abstract types that provide
information hiding (0).
In C, the only distinction between user-defined types that is enforced by the compiler is the distinction between structs. Any typedef involving distinct structs will work. Your major design question is should different struct types use the same member names? If so, you can simulate some polymorphic code using macros and other scurvy tricks. If not, you are really committed to two different representations. E.g., do you want to be able to
#define INCREMENT(s, k) ((s).n += (k))
and use INCREMENT on both byte_idx and char_idx? Then name the fields identically.
You asked about extensions. Jeff Foster's CQual is very nice, and I think it could do the job you want.
With C++11 you can use an enum class, e.g.
enum class char_idx_t : unsigned int {};
enum class byte_idx_t : unsigned int {};
The compiler will enforce an explicit cast between the two types; it is like a thin wrapper class. Unfortunately you won't have operator overloading, e.g. if you want to add two char_idx_t together you will have to cast them to unsigned int.
If you were writing C++, you could make two identically defined classes with different names that were wrappers around an unsigned int. I don't know of a trick to do what you want in C.
Use strong typedef as defined in BOOST_STRONG_TYPEDEF

Resources