if we declare char * p="hello"; then since it is written in data section we cannot modify the contents to which p points but we can modify the pointer itself. but i found this example in C Traps and Pitfalls
Andrew Koenig
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
the example is
char *p, *q;
p = "xyz";
q = p;
q[1] = ’Y’;
q would point to memory containing the string xYz. So would p, because p and q point to the same memory.
how is it true if the first statement i mentioned is also true..
similarly i ran the following code
main()
{
char *p="hai friends",*p1;
p1=p;
while(*p!='\0') ++*p++;
printf("%s %s",p,p1);
}
and got the output as
ibj!gsjfoet
please explain how in both these cases we are able to modify contents?
thanks in advance
Your same example causes a segmentation fault on my system.
You're running into undefined behavior here. .data (note that the string literal might be in .text too) is not necessarily immutable - there is no guarantee that the machine will write protect that memory (via page tables), depending on the operating system and compiler.
Only your OS can guarantee that stuff in the data section is read-only, and even that involves setting segment limits and access flags and using far pointers and such, so it's not always done.
C itself has no such limitation; in a flat memory model (which almost all 32-bit OSes use these days), any bytes in your address space are potentially writable, even stuff in your code section. If you had a pointer to main(), and some knowledge of machine language, and an OS that had stuff set up just right (or rather, failed to prevent it), you could potentially rewrite it to just return 0. Note that this is all black magic of a sort, and is rarely done intentionally, but it's part of what makes C such a powerful language for systems programming.
Even if you can do this and it seems that there are no errors, it's a bad idea. Depending on the program in question, you could end up making it very easy for buffer overflow attacks. A good article explaining this is:
https://www.securecoding.cert.org/confluence/display/seccode/STR30-C.+Do+not+attempt+to+modify+string+literals
It'll depend on the compiler as to whether that works or not.
x86 is a von Neumann architecture (as opposed to Harvard), so there's no clear difference between the 'data' and 'program' memory at the basic level (i.e. the compiler isn't forced into having different types for program vs data memory, and so won't necessarily restrict any variable to one or the other).
So one compiler may allow modification of the string while another does not.
My guess is that a more lenient compiler (e.g. cl, the MS Visual Studio C++ compiler) would allow this, while a more strict compiler (e.g. gcc) would not. If your compiler allows it, chances are it's effectively changing your code to something like:
...
char p[] = "hai friends";
char *p1 = p;
...
// (some disassembly required to really see what it's done though)
perhaps with the 'good intention' of allowing new C/C++ coders to code with less restriction / fewer confusing errors. (whether this is a 'Good Thing' is up to much debate and I will keep my opinions mostly out of this post :P)
Out of interest, what compiler did you use?
In olden days, when C as described by K & R in their book "The C Programming Language" was the ahem "standard", what you describe was perfectly OK. In fact, some compilers jumped through hoops to make string literals writable. They'd laboriously copy the strings from the text segment to the data segment on initialisation.
Even now, gcc has a flag to restore this behaviour: -fwritable-strings.
main()
{
int i = 0;
char *p= "hai friends", *p1;
p1 = p;
while(*(p + i) != '\0')
{
*(p + i);
i++;
}
printf("%s %s", p, p1);
return 0;
}
This code will give output: hai friends hai friends
Modifying string literals is a bad idea, but that doesn't mean it might not work.
One really good reason not to: your compiler is allowed to take multiple instances of the same string literal and make them point to the same block of memory. So if "xyz" was defined somewhere else in your code, you could inadvertently break other code that was expecting it to be constant.
Your program also works on my system(windows+cygwin). However the standard says you shouldn't do that though the consequence is not defined.
Following excerpt from the book C: A Reference Manual 5/E, page 33,
You should never attempt to modify the memory that holds the characters of a string constant since may be read-only
char p1[] = "Always writable";
char *p2 = "Possibly not writable";
const char p3[] = "Never writable";
p1 line will always work; p2 line may work or may cause a run-time error; p3 will always cause a compile-time error.
While modifying a string literal may be possible on your system, that's a quirk of your platform, rather than a guarantee of the language. The actual C language doesn't know anything about .data sections, or .text sections. That's all implementation detail.
On some embedded systems, you won't even have a filesystem to contain a file with a .text section. On some such systems, your string literals will be stored in ROM, and trying to write to the ROM will just crash the device.
If you write code that depends on undefined behavior, and only works on your platform, you can be guaranteed that sooner or later, somebody will think it is a good idea to port it to some new device that doesn't work the way you expected. When that happens, an angry pack of embedded developers will hunt you down and stab you.
p is effectively pointing to read only memory. The result of assigning to the array p points to is probably undefined behavior. Just because the compiler lets you get away with it doesn't mean it's OK.
Take a look at this question from the C-FAQ: comp.lang.c FAQ list · Question 1.32
Q: What is the difference between
these initializations?
char a[] = "string literal";
char *p = "string literal";
My program crashes if I try to assign
a new value to p[i].
A: A string literal (the formal term
for a double-quoted string in C
source) can be used in two slightly
different ways:
As the initializer for an array of char, as in the declaration of char
a[] , it specifies the initial values
of the characters in that array (and,
if necessary, its size).
Anywhere else, it turns into an unnamed, static array of characters,
and this unnamed array may be stored
in read-only memory, and which
therefore cannot necessarily be
modified. In an expression context,
the array is converted at once to a
pointer, as usual (see section 6), so
the second declaration initializes p
to point to the unnamed array's first
element.
Some compilers have a switch
controlling whether string literals
are writable or not (for compiling old
code), and some may have options to
cause string literals to be formally
treated as arrays of const char (for
better error catching).
I think you are making a big confusion on a very important general concept to understand when using C, C++ or other low-level languages. In a low-level language there is an implicit assumption than the programmer knows what s/he is doing and makes no programming error.
This assumption allows the implementers of the language to just ignore what should happen if the programmer is violating the rules. The end effect is that in C or C++ there is no "runtime error" guarantee... if you do something bad simply it's NOT DEFINED ("undefined behaviour" is the legalese term) what is going to happen. May be a crash (if you're very lucky), or may be just apparently nothing (unfortunately most of the times... with may be a crash in a perfectly valid place one million executed instructions later).
For example if you access outside of an array MAY BE you will get a crash, may be not, may even be a daemon will come out of your nose (this is the "nasal daemon" you may find on the internet). It's just not something that who wrote the compiler took care thinking to.
Just never do that (if you care about writing decent programs).
An additional burden on who uses low level languages is that you must learn all the rules very well and you must never violate them. If you violate a rule you cannot expect a "runtime error angel" to help you... only "undefined behaviour daemons" are present down there.
Related
In K&R (The C Programming Language 2nd Edition) chapter 5 I read the following:
First, pointers may be compared under certain circumstances.
If p and q point to members of the same array, then relations like ==, !=, <, >=, etc. work properly.
Which seems to imply that only pointers pointing to the same array can be compared.
However when I tried this code
char t = 't';
char *pt = &t;
char x = 'x';
char *px = &x;
printf("%d\n", pt > px);
1 is printed to the screen.
First of all, I thought I would get undefined or some type or error, because pt and px aren't pointing to the same array (at least in my understanding).
Also is pt > px because both pointers are pointing to variables stored on the stack, and the stack grows down, so the memory address of t is greater than that of x? Which is why pt > px is true?
I get more confused when malloc is brought in. Also in K&R in chapter 8.7 the following is written:
There is still one assumption, however, that pointers to different blocks returned by sbrk can be meaningfully compared. This is not guaranteed by the standard which permits pointer comparisons only within an array. Thus this version of malloc is portable only among machines for which the general pointer comparison is meaningful.
I had no issue comparing pointers that pointed to space malloced on the heap to pointers that pointed to stack variables.
For example, the following code worked fine, with 1 being printed:
char t = 't';
char *pt = &t;
char *px = malloc(10);
strcpy(px, pt);
printf("%d\n", pt > px);
Based on my experiments with my compiler, I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
Still, I am confused by what I am reading in K&R.
The reason I'm asking is because my prof. actually made it an exam question. He gave the following code:
struct A {
char *p0;
char *p1;
};
int main(int argc, char **argv) {
char a = 0;
char *b = "W";
char c[] = [ 'L', 'O', 'L', 0 ];
struct A p[3];
p[0].p0 = &a;
p[1].p0 = b;
p[2].p0 = c;
for(int i = 0; i < 3; i++) {
p[i].p1 = malloc(10);
strcpy(p[i].p1, p[i].p0);
}
}
What do these evaluate to:
p[0].p0 < p[0].p1
p[1].p0 < p[1].p1
p[2].p0 < p[2].p1
The answer is 0, 1, and 0.
(My professor does include the disclaimer on the exam that the questions are for a Ubuntu Linux 16.04, 64-bit version programming environment)
(editor's note: if SO allowed more tags, that last part would warrant x86-64, linux, and maybe assembly. If the point of the question / class was specifically low-level OS implementation details, rather than portable C.)
According to the C11 standard, the relational operators <, <=, >, and >= may only be used on pointers to elements of the same array or struct object. This is spelled out in section 6.5.8p5:
When two pointers are compared, the result depends on the
relative locations in the address space of the objects pointed to.
If two pointers to object types both point to the same object, or
both point one past the last element of the same array
object, they compare equal. If the objects pointed to are
members of the same aggregate object,pointers to structure
members declared later compare greater than pointers to
members declared earlier in the structure, and pointers to
array elements with larger subscript values compare greater than
pointers to elements of the same array with lower subscript values.
All pointers to members of the same union object compare
equal. If the expression P points to an element of an array
object and the expression Q points to the last element of the
same array object, the pointer expression Q+1 compares greater than P.
In all other cases, the behavior is undefined.
Note that any comparisons that do not satisfy this requirement invoke undefined behavior, meaning (among other things) that you can't depend on the results to be repeatable.
In your particular case, for both the comparison between the addresses of two local variables and between the address of a local and a dynamic address, the operation appeared to "work", however the result could change by making a seemingly unrelated change to your code or even compiling the same code with different optimization settings. With undefined behavior, just because the code could crash or generate an error doesn't mean it will.
As an example, an x86 processor running in 8086 real mode has a segmented memory model using a 16-bit segment and a 16-bit offset to build a 20-bit address. So in this case an address doesn't convert exactly to an integer.
The equality operators == and != however do not have this restriction. They can be used between any two pointers to compatible types or NULL pointers. So using == or != in both of your examples would produce valid C code.
However, even with == and != you could get some unexpected yet still well-defined results. See Can an equality comparison of unrelated pointers evaluate to true? for more details on this.
Regarding the exam question given by your professor, it makes a number of flawed assumptions:
A flat memory model exists where there is a 1-to-1 correspondence between an address and an integer value.
That the converted pointer values fit inside an integer type.
That the implementation simply treats pointers as integers when performing comparisons without exploiting the freedom given by undefined behavior.
That a stack is used and that local variables are stored there.
That a heap is used to pull allocated memory from.
That the stack (and therefore local variables) appears at a higher address than the heap (and therefore allocated objects).
That string constants appear at a lower address then the heap.
If you were to run this code on an architecture and/or with a compiler that does not satisfy these assumptions then you could get very different results.
Also, both examples also exhibit undefined behavior when they call strcpy, since the right operand (in some cases) points to a single character and not a null terminated string, resulting in the function reading past the bounds of the given variable.
The primary issue with comparing pointers to two distinct arrays of the same type is that the arrays themselves need not be placed in a particular relative positioning--one could end up before and after the other.
First of all, I thought I would get undefined or some type or error, because pt an px aren't pointing to the same array (at least in my understanding).
No, the result is dependent on implementation and other unpredictable factors.
Also is pt>px because both pointers are pointing to variables stored on the stack, and the stack grows down, so the memory address of t is greater than that of x? Which is why pt>px is true?
There isn't necessarily a stack. When it exists, it need not to grow down. It could grow up. It could be non-contiguous in some bizarre way.
Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
Let's look at the C specification, §6.5.8 on page 85 which discusses relational operators (i.e. the comparison operators you're using). Note that this does not apply to direct != or == comparison.
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. ... If the objects pointed to are members of the same aggregate object, ... pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values.
In all other cases, the behavior is undefined.
The last sentence is important. While I cut down some unrelated cases to save space, there's one case that's important to us: two arrays, not part of the same struct/aggregate object1, and we're comparing pointers to those two arrays. This is undefined behavior.
While your compiler just inserted some sort of CMP (compare) machine instruction which numerically compares the pointers, and you got lucky here, UB is a pretty dangerous beast. Literally anything can happen--your compiler could optimize out the whole function including visible side effects. It could spawn nasal demons.
1Pointers into two different arrays that are part of the same struct can be compared, since this falls under the clause where the two arrays are part of the same aggregate object (the struct).
Then asked what
p[0].p0 < p[0].p1
p[1].p0 < p[1].p1
p[2].p0 < p[2].p1
Evaluate to. The answer is 0, 1, and 0.
These questions reduce to:
Is the heap above or below the stack.
Is the heap above or below the string literal section of the program.
same as [1].
And the answer to all three is "implementation defined". Your prof's questions are bogus; they have based it in traditional unix layout:
<empty>
text
rodata
rwdata
bss
< empty, used for heap >
...
stack
kernel
but several modern unices (and alternative systems) do not conform to those traditions. Unless they prefaced the question with " as of 1992 "; make sure to give a -1 on the eval.
On almost any remotely-modern platform, pointers and integers have an isomorphic ordering relation, and pointers to disjoint objects are not interleaved. Most compilers expose this ordering to programmers when optimizations are disabled, but the Standard makes no distinction between platforms that have such an ordering and those that don't and does not require that any implementations expose such an ordering to the programmer even on platforms that would define it. Consequently, some compiler writers perform various kinds of optimizations and "optimizations" based upon an assumption that code will never compare use relational operators on pointers to different objects.
According to the published Rationale, the authors of the Standard intended that implementations extend the language by specifying how they will behave in situations the Standard characterizes as "Undefined Behavior" (i.e. where the Standard imposes no requirements) when doing so would be useful and practical, but some compiler writers would rather assume programs will never try to benefit from anything beyond what the Standard mandates, than allow programs to usefully exploit behaviors the platforms could support at no extra cost.
I'm not aware of any commercially-designed compilers that do anything weird with pointer comparisons, but as compilers move to the non-commercial LLVM for their back end, they're increasingly likely to process nonsensically code whose behavior had been specified by earlier compilers for their platforms. Such behavior isn't limited to relational operators, but can even affect equality/inequality. For example, even though the Standard specifies that a comparison between a pointer to one object and a "just past" pointer to an immediately-preceding object will compare equal, gcc and LLVM-based compilers are prone to generate nonsensical code if programs perform such comparisons.
As an example of a situation where even equality comparison behaves nonsensically in gcc and clang, consider:
extern int x[],y[];
int test(int i)
{
int *p = y+i;
y[0] = 4;
if (p == x+10)
*p = 1;
return y[0];
}
Both clang and gcc will generate code that will always return 4 even if x is ten elements, y immediately follows it, and i is zero resulting in the comparison being true and p[0] being written with the value 1. I think what happens is that one pass of optimization rewrites the function as though *p = 1; were replaced with x[10] = 1;. The latter code would be equivalent if the compiler interpreted *(x+10) as equivalent to *(y+i), but unfortunately a downstream optimization stage recognizes that an access to x[10] would only defined if x had at least 11 elements, which would make it impossible for that access to affect y.
If compilers can get that "creative" with pointer equality scenario which is described by the Standard, I would not trust them to refrain from getting even more creative in cases where the Standard doesn't impose requirements.
It's simple: Comparing pointers does not make sense as memory locations for objects are never guaranteed to be in the same order as you declared them.
The exception is arrays. &array[0] is lower than &array[1]. Thats what K&R points out. In practice struct member addresses are also in the order you declare them in my experience. No guarantees on that....
Another exception is if you compare a pointer for equal. When one pointer is equal to another you know it's pointing to the same object. Whatever it is.
Bad exam question if you ask me. Depending on Ubuntu Linux 16.04, 64-bit version programming environment for an exam question ? Really ?
Pointers are just integers, like everything else in a computer. You absolutely can compare them with < and > and produce results without causing a program to crash. That said, the standard does not guarantee that those results have any meaning outside of array comparisons.
In your example of stack allocated variables, the compiler is free to allocate those variables to registers or stack memory addresses, and in any order it so choose. Comparisons such as < and > therefore won't be consistent across compilers or architectures. However, == and != aren't so restricted, comparing pointer equality is a valid and useful operation.
What A Provocative Question!
Even cursory scanning of the responses and comments in this thread will reveal how emotive your seemingly simple and straight forward query turns out to be.
It should not be surprising.
Inarguably, misunderstandings around the concept and use of pointers represents a predominant cause of serious failures in programming in general.
Recognition of this reality is readily evident in the ubiquity of languages designed specifically to address, and preferably to avoid the challenges pointers introduce altogether. Think C++ and other derivatives of C, Java and its relations, Python and other scripts -- merely as the more prominent and prevalent ones, and more or less ordered in severity of dealing with the issue.
Developing a deeper understanding of the principles underlying, therefore must be pertinent to every individual that aspires to excellence in programming -- especially at the systems level.
I imagine this is precisely what your teacher means to demonstrate.
And the nature of C makes it a convenient vehicle for this exploration. Less clearly than assembly -- though perhaps more readily comprehensible -- and still far more explicitly than languages based on deeper abstraction of the execution environment.
Designed to facilitate deterministic translation of the programmer’s intent into instructions that machines can comprehend, C is a system level language. While classified as high-level, it really belongs in a ‘medium’ category; but since none such exists, the ‘system’ designation has to suffice.
This characteristic is largely responsible for making it a language of choice for device drivers, operating system code, and embedded implementations. Furthermore, a deservedly favoured alternative in applications where optimal efficiency is paramount; where that means the difference between survival and extinction, and therefore is a necessity as opposed to a luxury. In such instances, the attractive convenience of portability loses all its allure, and opting for the lack-lustre performance of the least common denominator becomes an unthinkably detrimental option.
What makes C -- and some of its derivatives -- quite special, is that it allows its users complete control -- when that is what they desire -- without imposing the related responsibilities upon them when they do not. Nevertheless, it never offers more than the thinnest of insulations from the machine, wherefore proper use demands exacting comprehension of the concept of pointers.
In essence, the answer to your question is sublimely simple and satisfyingly sweet -- in confirmation of your suspicions. Provided, however, that one attaches the requisite significance to every concept in this statement:
The acts of examining, comparing and manipulating pointers are always and necessarily valid, while the conclusions derived from the result depends on the validity of the values contained, and thus need not be.
The former is both invariably safe and potentially proper, while the latter can only ever be proper when it has been established as safe. Surprisingly -- to some -- so establishing the validity of the latter depends on and demands the former.
Of course, part of the confusion arises from the effect of the recursion inherently present within the principle of a pointer -- and the challenges posed in differentiating content from address.
You have quite correctly surmised,
I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
And several contributors have affirmed: pointers are just numbers. Sometimes something closer to complex numbers, but still no more than numbers.
The amusing acrimony in which this contention has been received here reveals more about human nature than programming, but remains worthy of note and elaboration. Perhaps we will do so later...
As one comment begins to hint; all this confusion and consternation derives from the need to discern what is valid from what is safe, but that is an oversimplification. We must also distinguish what is functional and what is reliable, what is practical and what may be proper, and further still: what is proper in a particular circumstance from what may be proper in a more general sense. Not to mention; the difference between conformity and propriety.
Toward that end, we first need to appreciate precisely what a pointer is.
You have demonstrated a firm grip on the concept, and like some others may find these illustrations patronizingly simplistic, but the level of confusion evident here demands such simplicity in clarification.
As several have pointed out: the term pointer is merely a special name for what is simply an index, and thus nothing more than any other number.
This should already be self-evident in consideration of the fact that all contemporary mainstream computers are binary machines that necessarily work exclusively with and on numbers. Quantum computing may change that, but that is highly unlikely, and it has not come of age.
Technically, as you have noted, pointers are more accurately addresses; an obvious insight that naturally introduces the rewarding analogy of correlating them with the ‘addresses’ of houses, or plots on a street.
In a flat memory model: the entire system memory is organized in a single, linear sequence: all houses in the city lie on the same road, and every house is uniquely identified by its number alone. Delightfully simple.
In segmented schemes: a hierarchical organization of numbered roads is introduced above that of numbered houses so that composite addresses are required.
Some implementations are still more convoluted, and the totality of distinct ‘roads’ need not sum to a contiguous sequence, but none of that changes anything about the underlying.
We are necessarily able to decompose every such hierarchical link back into a flat organization. The more complex the organization, the more hoops we will have to hop through in order to do so, but it must be possible. Indeed, this also applies to ‘real mode’ on x86.
Otherwise the mapping of links to locations would not be bijective, as reliable execution -- at the system level -- demands that it MUST be.
multiple addresses must not map to singular memory locations, and
singular addresses must never map to multiple memory locations.
Bringing us to the further twist that turns the conundrum into such a fascinatingly complicated tangle. Above, it was expedient to suggest that pointers are addresses, for the sake of simplicity and clarity. Of course, this is not correct. A pointer is not an address; a pointer is a reference to an address, it contains an address. Like the envelope sports a reference to the house. Contemplating this may lead you to glimpse what was meant with the suggestion of recursion contained in the concept. Still; we have only so many words, and talking about the addresses of references to addresses and such, soon stalls most brains at an invalid op-code exception. And for the most part, intent is readily garnered from context, so let us return to the street.
Postal workers in this imaginary city of ours are much like the ones we find in the ‘real’ world. No one is likely to suffer a stroke when you talk or enquire about an invalid address, but every last one will balk when you ask them to act on that information.
Suppose there are only 20 houses on our singular street. Further pretend that some misguided, or dyslexic soul has directed a letter, a very important one, to number 71. Now, we can ask our carrier Frank, whether there is such an address, and he will simply and calmly report: no. We can even expect him to estimate how far outside the street this location would lie if it did exist: roughly 2.5 times further than the end. None of this will cause him any exasperation. However, if we were to ask him to deliver this letter, or to pick up an item from that place, he is likely to be quite frank about his displeasure, and refusal to comply.
Pointers are just addresses, and addresses are just numbers.
Verify the output of the following:
void foo( void *p ) {
printf(“%p\t%zu\t%d\n”, p, (size_t)p, p == (size_t)p);
}
Call it on as many pointers as you like, valid or not. Please do post your findings if it fails on your platform, or your (contemporary) compiler complains.
Now, because pointers are simply numbers, it is inevitably valid to compare them. In one sense this is precisely what your teacher is demonstrating. All of the following statements are perfectly valid -- and proper! -- C, and when compiled will run without encountering problems, even though neither pointer need be initialized and the values they contain therefore may be undefined:
We are only calculating result explicitly for the sake of clarity, and printing it to force the compiler to compute what would otherwise be redundant, dead code.
void foo( size_t *a, size_t *b ) {
size_t result;
result = (size_t)a;
printf(“%zu\n”, result);
result = a == b;
printf(“%zu\n”, result);
result = a < b;
printf(“%zu\n”, result);
result = a - b;
printf(“%zu\n”, result);
}
Of course, the program is ill-formed when either a or b is undefined (read: not properly initialized) at the point of testing, but that is utterly irrelevant to this part of our discussion. These snippets, as too the following statements, are guaranteed -- by the ‘standard’ -- to compile and run flawlessly, notwithstanding the IN-validity of any pointer involved.
Problems only arise when an invalid pointer is dereferenced. When we ask Frank to pick up or deliver at the invalid, non-existent address.
Given any arbitrary pointer:
int *p;
While this statement must compile and run:
printf(“%p”, p);
... as must this:
size_t foo( int *p ) { return (size_t)p; }
... the following two, in stark contrast, will still readily compile, but fail in execution unless the pointer is valid -- by which we here merely mean that it references an address to which the present application has been granted access:
printf(“%p”, *p);
size_t foo( int *p ) { return *p; }
How subtle the change? The distinction lies in the difference between the value of the pointer -- which is the address, and the value of the contents: of the house at that number. No problem arises until the pointer is dereferenced; until an attempt is made to access the address it links to. In trying to deliver or pick up the package beyond the stretch of the road...
By extension, the same principle necessarily applies to more complex examples, including the aforementioned need to establish the requisite validity:
int* validate( int *p, int *head, int *tail ) {
return p >= head && p <= tail ? p : NULL;
}
Relational comparison and arithmetic offer identical utility to testing equivalence, and are equivalently valid -- in principle. However, what the results of such computation would signify, is a different matter entirely -- and precisely the issue addressed by the quotations you included.
In C, an array is a contiguous buffer, an uninterrupted linear series of memory locations. Comparison and arithmetic applied to pointers that reference locations within such a singular series are naturally, and obviously meaningful in relation both to each other, and to this ‘array’ (which is simply identified by the base). Precisely the same applies to every block allocated through malloc, or sbrk. Because these relationships are implicit, the compiler is able to establish valid relationships between them, and therefore can be confident that calculations will provide the answers anticipated.
Performing similar gymnastics on pointers that reference distinct blocks or arrays do not offer any such inherent, and apparent utility. The more so since whatever relation exists at one moment may be invalidated by a reallocation that follows, wherein that is highly likely to change, even be inverted. In such instances the compiler is unable to obtain the necessary information to establish the confidence it had in the previous situation.
You, however, as the programmer, may have such knowledge! And in some instances are obliged to exploit that.
There ARE, therefore, circumstances in which EVEN THIS is entirely VALID and perfectly PROPER.
In fact, that is exactly what malloc itself has to do internally when time comes to try merging reclaimed blocks -- on the vast majority of architectures. The same is true for the operating system allocator, like that behind sbrk; if more obviously, frequently, on more disparate entities, more critically -- and relevant also on platforms where this malloc may not be. And how many of those are not written in C?
The validity, security and success of an action is inevitably the consequence of the level of insight upon which it is premised and applied.
In the quotes you have offered, Kernighan and Ritchie are addressing a closely related, but nonetheless separate issue. They are defining the limitations of the language, and explaining how you may exploit the capabilities of the compiler to protect you by at least detecting potentially erroneous constructs. They are describing the lengths the mechanism is able -- is designed -- to go to in order to assist you in your programming task. The compiler is your servant, you are the master. A wise master, however, is one that is intimately familiar with the capabilities of his various servants.
Within this context, undefined behaviour serves to indicate potential danger and the possibility of harm; not to imply imminent, irreversible doom, or the end of the world as we know it. It simply means that we -- ‘meaning the compiler’ -- are not able to make any conjecture about what this thing may be, or represent and for this reason we choose to wash our hands of the matter. We will not be held accountable for any misadventure that may result from the use, or mis-use of this facility.
In effect, it simply says: ‘Beyond this point, cowboy: you are on your own...’
Your professor is seeking to demonstrate the finer nuances to you.
Notice what great care they have taken in crafting their example; and how brittle it still is. By taking the address of a, in
p[0].p0 = &a;
the compiler is coerced into allocating actual storage for the variable, rather than placing it in a register. It being an automatic variable, however, the programmer has no control over where that is assigned, and so unable to make any valid conjecture about what would follow it. Which is why a must be set equal to zero for the code to work as expected.
Merely changing this line:
char a = 0;
to this:
char a = 1; // or ANY other value than 0
causes the behaviour of the program to become undefined. At minimum, the first answer will now be 1; but the problem is far more sinister.
Now the code is inviting of disaster.
While still perfectly valid and even conforming to the standard, it now is ill-formed and although sure to compile, may fail in execution on various grounds. For now there are multiple problems -- none of which the compiler is able to recognize.
strcpy will start at the address of a, and proceed beyond this to consume -- and transfer -- byte after byte, until it encounters a null.
The p1 pointer has been initialized to a block of exactly 10 bytes.
If a happens to be placed at the end of a block and the process has no access to what follows, the very next read -- of p0[1] -- will elicit a segfault. This scenario is unlikely on the x86 architecture, but possible.
If the area beyond the address of a is accessible, no read error will occur, but the program still is not saved from misfortune.
If a zero byte happens to occur within the ten starting at the address of a, it may still survive, for then strcpy will stop and at least we will not suffer a write violation.
If it is not faulted for reading amiss, but no zero byte occurs in this span of 10, strcpy will continue and attempt to write beyond the block allocated by malloc.
If this area is not owned by the process, the segfault should immediately be triggered.
The still more disastrous -- and subtle --- situation arises when the following block is owned by the process, for then the error cannot be detected, no signal can be raised, and so it may ‘appear’ still to ‘work’, while it actually will be overwriting other data, your allocator’s management structures, or even code (in certain operating environments).
This is why pointer related bugs can be so hard to track. Imagine these lines buried deep within thousands of lines of intricately related code, that someone else has written, and you are directed to delve through.
Nevertheless, the program must still compile, for it remains perfectly valid and standard conformant C.
These kinds of errors, no standard and no compiler can protect the unwary against. I imagine that is exactly what they are intending to teach you.
Paranoid people constantly seek to change the nature of C to dispose of these problematic possibilities and so save us from ourselves; but that is disingenuous. This is the responsibility we are obliged to accept when we choose to pursue the power and obtain the liberty that more direct and comprehensive control of the machine offers us. Promoters and pursuers of perfection in performance will never accept anything less.
Portability and the generality it represents is a fundamentally separate consideration and all that the standard seeks to address:
This document specifies the form and establishes the interpretation of programs expressed in the programming language C. Its purpose is to promote portability, reliability, maintainability, and efficient execution of C language programs on a variety of computing systems.
Which is why it is perfectly proper to keep it distinct from the definition and technical specification of the language itself. Contrary to what many seem to believe generality is antithetical to exceptional and exemplary.
To conclude:
Examining and manipulating pointers themselves is invariably valid and often fruitful. Interpretation of the results, may, or may not be meaningful, but calamity is never invited until the pointer is dereferenced; until an attempt is made to access the address linked to.
Were this not true, programming as we know it -- and love it -- would not have been possible.
In K&R (The C Programming Language 2nd Edition) chapter 5 I read the following:
First, pointers may be compared under certain circumstances.
If p and q point to members of the same array, then relations like ==, !=, <, >=, etc. work properly.
Which seems to imply that only pointers pointing to the same array can be compared.
However when I tried this code
char t = 't';
char *pt = &t;
char x = 'x';
char *px = &x;
printf("%d\n", pt > px);
1 is printed to the screen.
First of all, I thought I would get undefined or some type or error, because pt and px aren't pointing to the same array (at least in my understanding).
Also is pt > px because both pointers are pointing to variables stored on the stack, and the stack grows down, so the memory address of t is greater than that of x? Which is why pt > px is true?
I get more confused when malloc is brought in. Also in K&R in chapter 8.7 the following is written:
There is still one assumption, however, that pointers to different blocks returned by sbrk can be meaningfully compared. This is not guaranteed by the standard which permits pointer comparisons only within an array. Thus this version of malloc is portable only among machines for which the general pointer comparison is meaningful.
I had no issue comparing pointers that pointed to space malloced on the heap to pointers that pointed to stack variables.
For example, the following code worked fine, with 1 being printed:
char t = 't';
char *pt = &t;
char *px = malloc(10);
strcpy(px, pt);
printf("%d\n", pt > px);
Based on my experiments with my compiler, I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
Still, I am confused by what I am reading in K&R.
The reason I'm asking is because my prof. actually made it an exam question. He gave the following code:
struct A {
char *p0;
char *p1;
};
int main(int argc, char **argv) {
char a = 0;
char *b = "W";
char c[] = [ 'L', 'O', 'L', 0 ];
struct A p[3];
p[0].p0 = &a;
p[1].p0 = b;
p[2].p0 = c;
for(int i = 0; i < 3; i++) {
p[i].p1 = malloc(10);
strcpy(p[i].p1, p[i].p0);
}
}
What do these evaluate to:
p[0].p0 < p[0].p1
p[1].p0 < p[1].p1
p[2].p0 < p[2].p1
The answer is 0, 1, and 0.
(My professor does include the disclaimer on the exam that the questions are for a Ubuntu Linux 16.04, 64-bit version programming environment)
(editor's note: if SO allowed more tags, that last part would warrant x86-64, linux, and maybe assembly. If the point of the question / class was specifically low-level OS implementation details, rather than portable C.)
According to the C11 standard, the relational operators <, <=, >, and >= may only be used on pointers to elements of the same array or struct object. This is spelled out in section 6.5.8p5:
When two pointers are compared, the result depends on the
relative locations in the address space of the objects pointed to.
If two pointers to object types both point to the same object, or
both point one past the last element of the same array
object, they compare equal. If the objects pointed to are
members of the same aggregate object,pointers to structure
members declared later compare greater than pointers to
members declared earlier in the structure, and pointers to
array elements with larger subscript values compare greater than
pointers to elements of the same array with lower subscript values.
All pointers to members of the same union object compare
equal. If the expression P points to an element of an array
object and the expression Q points to the last element of the
same array object, the pointer expression Q+1 compares greater than P.
In all other cases, the behavior is undefined.
Note that any comparisons that do not satisfy this requirement invoke undefined behavior, meaning (among other things) that you can't depend on the results to be repeatable.
In your particular case, for both the comparison between the addresses of two local variables and between the address of a local and a dynamic address, the operation appeared to "work", however the result could change by making a seemingly unrelated change to your code or even compiling the same code with different optimization settings. With undefined behavior, just because the code could crash or generate an error doesn't mean it will.
As an example, an x86 processor running in 8086 real mode has a segmented memory model using a 16-bit segment and a 16-bit offset to build a 20-bit address. So in this case an address doesn't convert exactly to an integer.
The equality operators == and != however do not have this restriction. They can be used between any two pointers to compatible types or NULL pointers. So using == or != in both of your examples would produce valid C code.
However, even with == and != you could get some unexpected yet still well-defined results. See Can an equality comparison of unrelated pointers evaluate to true? for more details on this.
Regarding the exam question given by your professor, it makes a number of flawed assumptions:
A flat memory model exists where there is a 1-to-1 correspondence between an address and an integer value.
That the converted pointer values fit inside an integer type.
That the implementation simply treats pointers as integers when performing comparisons without exploiting the freedom given by undefined behavior.
That a stack is used and that local variables are stored there.
That a heap is used to pull allocated memory from.
That the stack (and therefore local variables) appears at a higher address than the heap (and therefore allocated objects).
That string constants appear at a lower address then the heap.
If you were to run this code on an architecture and/or with a compiler that does not satisfy these assumptions then you could get very different results.
Also, both examples also exhibit undefined behavior when they call strcpy, since the right operand (in some cases) points to a single character and not a null terminated string, resulting in the function reading past the bounds of the given variable.
The primary issue with comparing pointers to two distinct arrays of the same type is that the arrays themselves need not be placed in a particular relative positioning--one could end up before and after the other.
First of all, I thought I would get undefined or some type or error, because pt an px aren't pointing to the same array (at least in my understanding).
No, the result is dependent on implementation and other unpredictable factors.
Also is pt>px because both pointers are pointing to variables stored on the stack, and the stack grows down, so the memory address of t is greater than that of x? Which is why pt>px is true?
There isn't necessarily a stack. When it exists, it need not to grow down. It could grow up. It could be non-contiguous in some bizarre way.
Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
Let's look at the C specification, §6.5.8 on page 85 which discusses relational operators (i.e. the comparison operators you're using). Note that this does not apply to direct != or == comparison.
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. ... If the objects pointed to are members of the same aggregate object, ... pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values.
In all other cases, the behavior is undefined.
The last sentence is important. While I cut down some unrelated cases to save space, there's one case that's important to us: two arrays, not part of the same struct/aggregate object1, and we're comparing pointers to those two arrays. This is undefined behavior.
While your compiler just inserted some sort of CMP (compare) machine instruction which numerically compares the pointers, and you got lucky here, UB is a pretty dangerous beast. Literally anything can happen--your compiler could optimize out the whole function including visible side effects. It could spawn nasal demons.
1Pointers into two different arrays that are part of the same struct can be compared, since this falls under the clause where the two arrays are part of the same aggregate object (the struct).
Then asked what
p[0].p0 < p[0].p1
p[1].p0 < p[1].p1
p[2].p0 < p[2].p1
Evaluate to. The answer is 0, 1, and 0.
These questions reduce to:
Is the heap above or below the stack.
Is the heap above or below the string literal section of the program.
same as [1].
And the answer to all three is "implementation defined". Your prof's questions are bogus; they have based it in traditional unix layout:
<empty>
text
rodata
rwdata
bss
< empty, used for heap >
...
stack
kernel
but several modern unices (and alternative systems) do not conform to those traditions. Unless they prefaced the question with " as of 1992 "; make sure to give a -1 on the eval.
On almost any remotely-modern platform, pointers and integers have an isomorphic ordering relation, and pointers to disjoint objects are not interleaved. Most compilers expose this ordering to programmers when optimizations are disabled, but the Standard makes no distinction between platforms that have such an ordering and those that don't and does not require that any implementations expose such an ordering to the programmer even on platforms that would define it. Consequently, some compiler writers perform various kinds of optimizations and "optimizations" based upon an assumption that code will never compare use relational operators on pointers to different objects.
According to the published Rationale, the authors of the Standard intended that implementations extend the language by specifying how they will behave in situations the Standard characterizes as "Undefined Behavior" (i.e. where the Standard imposes no requirements) when doing so would be useful and practical, but some compiler writers would rather assume programs will never try to benefit from anything beyond what the Standard mandates, than allow programs to usefully exploit behaviors the platforms could support at no extra cost.
I'm not aware of any commercially-designed compilers that do anything weird with pointer comparisons, but as compilers move to the non-commercial LLVM for their back end, they're increasingly likely to process nonsensically code whose behavior had been specified by earlier compilers for their platforms. Such behavior isn't limited to relational operators, but can even affect equality/inequality. For example, even though the Standard specifies that a comparison between a pointer to one object and a "just past" pointer to an immediately-preceding object will compare equal, gcc and LLVM-based compilers are prone to generate nonsensical code if programs perform such comparisons.
As an example of a situation where even equality comparison behaves nonsensically in gcc and clang, consider:
extern int x[],y[];
int test(int i)
{
int *p = y+i;
y[0] = 4;
if (p == x+10)
*p = 1;
return y[0];
}
Both clang and gcc will generate code that will always return 4 even if x is ten elements, y immediately follows it, and i is zero resulting in the comparison being true and p[0] being written with the value 1. I think what happens is that one pass of optimization rewrites the function as though *p = 1; were replaced with x[10] = 1;. The latter code would be equivalent if the compiler interpreted *(x+10) as equivalent to *(y+i), but unfortunately a downstream optimization stage recognizes that an access to x[10] would only defined if x had at least 11 elements, which would make it impossible for that access to affect y.
If compilers can get that "creative" with pointer equality scenario which is described by the Standard, I would not trust them to refrain from getting even more creative in cases where the Standard doesn't impose requirements.
It's simple: Comparing pointers does not make sense as memory locations for objects are never guaranteed to be in the same order as you declared them.
The exception is arrays. &array[0] is lower than &array[1]. Thats what K&R points out. In practice struct member addresses are also in the order you declare them in my experience. No guarantees on that....
Another exception is if you compare a pointer for equal. When one pointer is equal to another you know it's pointing to the same object. Whatever it is.
Bad exam question if you ask me. Depending on Ubuntu Linux 16.04, 64-bit version programming environment for an exam question ? Really ?
Pointers are just integers, like everything else in a computer. You absolutely can compare them with < and > and produce results without causing a program to crash. That said, the standard does not guarantee that those results have any meaning outside of array comparisons.
In your example of stack allocated variables, the compiler is free to allocate those variables to registers or stack memory addresses, and in any order it so choose. Comparisons such as < and > therefore won't be consistent across compilers or architectures. However, == and != aren't so restricted, comparing pointer equality is a valid and useful operation.
What A Provocative Question!
Even cursory scanning of the responses and comments in this thread will reveal how emotive your seemingly simple and straight forward query turns out to be.
It should not be surprising.
Inarguably, misunderstandings around the concept and use of pointers represents a predominant cause of serious failures in programming in general.
Recognition of this reality is readily evident in the ubiquity of languages designed specifically to address, and preferably to avoid the challenges pointers introduce altogether. Think C++ and other derivatives of C, Java and its relations, Python and other scripts -- merely as the more prominent and prevalent ones, and more or less ordered in severity of dealing with the issue.
Developing a deeper understanding of the principles underlying, therefore must be pertinent to every individual that aspires to excellence in programming -- especially at the systems level.
I imagine this is precisely what your teacher means to demonstrate.
And the nature of C makes it a convenient vehicle for this exploration. Less clearly than assembly -- though perhaps more readily comprehensible -- and still far more explicitly than languages based on deeper abstraction of the execution environment.
Designed to facilitate deterministic translation of the programmer’s intent into instructions that machines can comprehend, C is a system level language. While classified as high-level, it really belongs in a ‘medium’ category; but since none such exists, the ‘system’ designation has to suffice.
This characteristic is largely responsible for making it a language of choice for device drivers, operating system code, and embedded implementations. Furthermore, a deservedly favoured alternative in applications where optimal efficiency is paramount; where that means the difference between survival and extinction, and therefore is a necessity as opposed to a luxury. In such instances, the attractive convenience of portability loses all its allure, and opting for the lack-lustre performance of the least common denominator becomes an unthinkably detrimental option.
What makes C -- and some of its derivatives -- quite special, is that it allows its users complete control -- when that is what they desire -- without imposing the related responsibilities upon them when they do not. Nevertheless, it never offers more than the thinnest of insulations from the machine, wherefore proper use demands exacting comprehension of the concept of pointers.
In essence, the answer to your question is sublimely simple and satisfyingly sweet -- in confirmation of your suspicions. Provided, however, that one attaches the requisite significance to every concept in this statement:
The acts of examining, comparing and manipulating pointers are always and necessarily valid, while the conclusions derived from the result depends on the validity of the values contained, and thus need not be.
The former is both invariably safe and potentially proper, while the latter can only ever be proper when it has been established as safe. Surprisingly -- to some -- so establishing the validity of the latter depends on and demands the former.
Of course, part of the confusion arises from the effect of the recursion inherently present within the principle of a pointer -- and the challenges posed in differentiating content from address.
You have quite correctly surmised,
I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.
And several contributors have affirmed: pointers are just numbers. Sometimes something closer to complex numbers, but still no more than numbers.
The amusing acrimony in which this contention has been received here reveals more about human nature than programming, but remains worthy of note and elaboration. Perhaps we will do so later...
As one comment begins to hint; all this confusion and consternation derives from the need to discern what is valid from what is safe, but that is an oversimplification. We must also distinguish what is functional and what is reliable, what is practical and what may be proper, and further still: what is proper in a particular circumstance from what may be proper in a more general sense. Not to mention; the difference between conformity and propriety.
Toward that end, we first need to appreciate precisely what a pointer is.
You have demonstrated a firm grip on the concept, and like some others may find these illustrations patronizingly simplistic, but the level of confusion evident here demands such simplicity in clarification.
As several have pointed out: the term pointer is merely a special name for what is simply an index, and thus nothing more than any other number.
This should already be self-evident in consideration of the fact that all contemporary mainstream computers are binary machines that necessarily work exclusively with and on numbers. Quantum computing may change that, but that is highly unlikely, and it has not come of age.
Technically, as you have noted, pointers are more accurately addresses; an obvious insight that naturally introduces the rewarding analogy of correlating them with the ‘addresses’ of houses, or plots on a street.
In a flat memory model: the entire system memory is organized in a single, linear sequence: all houses in the city lie on the same road, and every house is uniquely identified by its number alone. Delightfully simple.
In segmented schemes: a hierarchical organization of numbered roads is introduced above that of numbered houses so that composite addresses are required.
Some implementations are still more convoluted, and the totality of distinct ‘roads’ need not sum to a contiguous sequence, but none of that changes anything about the underlying.
We are necessarily able to decompose every such hierarchical link back into a flat organization. The more complex the organization, the more hoops we will have to hop through in order to do so, but it must be possible. Indeed, this also applies to ‘real mode’ on x86.
Otherwise the mapping of links to locations would not be bijective, as reliable execution -- at the system level -- demands that it MUST be.
multiple addresses must not map to singular memory locations, and
singular addresses must never map to multiple memory locations.
Bringing us to the further twist that turns the conundrum into such a fascinatingly complicated tangle. Above, it was expedient to suggest that pointers are addresses, for the sake of simplicity and clarity. Of course, this is not correct. A pointer is not an address; a pointer is a reference to an address, it contains an address. Like the envelope sports a reference to the house. Contemplating this may lead you to glimpse what was meant with the suggestion of recursion contained in the concept. Still; we have only so many words, and talking about the addresses of references to addresses and such, soon stalls most brains at an invalid op-code exception. And for the most part, intent is readily garnered from context, so let us return to the street.
Postal workers in this imaginary city of ours are much like the ones we find in the ‘real’ world. No one is likely to suffer a stroke when you talk or enquire about an invalid address, but every last one will balk when you ask them to act on that information.
Suppose there are only 20 houses on our singular street. Further pretend that some misguided, or dyslexic soul has directed a letter, a very important one, to number 71. Now, we can ask our carrier Frank, whether there is such an address, and he will simply and calmly report: no. We can even expect him to estimate how far outside the street this location would lie if it did exist: roughly 2.5 times further than the end. None of this will cause him any exasperation. However, if we were to ask him to deliver this letter, or to pick up an item from that place, he is likely to be quite frank about his displeasure, and refusal to comply.
Pointers are just addresses, and addresses are just numbers.
Verify the output of the following:
void foo( void *p ) {
printf(“%p\t%zu\t%d\n”, p, (size_t)p, p == (size_t)p);
}
Call it on as many pointers as you like, valid or not. Please do post your findings if it fails on your platform, or your (contemporary) compiler complains.
Now, because pointers are simply numbers, it is inevitably valid to compare them. In one sense this is precisely what your teacher is demonstrating. All of the following statements are perfectly valid -- and proper! -- C, and when compiled will run without encountering problems, even though neither pointer need be initialized and the values they contain therefore may be undefined:
We are only calculating result explicitly for the sake of clarity, and printing it to force the compiler to compute what would otherwise be redundant, dead code.
void foo( size_t *a, size_t *b ) {
size_t result;
result = (size_t)a;
printf(“%zu\n”, result);
result = a == b;
printf(“%zu\n”, result);
result = a < b;
printf(“%zu\n”, result);
result = a - b;
printf(“%zu\n”, result);
}
Of course, the program is ill-formed when either a or b is undefined (read: not properly initialized) at the point of testing, but that is utterly irrelevant to this part of our discussion. These snippets, as too the following statements, are guaranteed -- by the ‘standard’ -- to compile and run flawlessly, notwithstanding the IN-validity of any pointer involved.
Problems only arise when an invalid pointer is dereferenced. When we ask Frank to pick up or deliver at the invalid, non-existent address.
Given any arbitrary pointer:
int *p;
While this statement must compile and run:
printf(“%p”, p);
... as must this:
size_t foo( int *p ) { return (size_t)p; }
... the following two, in stark contrast, will still readily compile, but fail in execution unless the pointer is valid -- by which we here merely mean that it references an address to which the present application has been granted access:
printf(“%p”, *p);
size_t foo( int *p ) { return *p; }
How subtle the change? The distinction lies in the difference between the value of the pointer -- which is the address, and the value of the contents: of the house at that number. No problem arises until the pointer is dereferenced; until an attempt is made to access the address it links to. In trying to deliver or pick up the package beyond the stretch of the road...
By extension, the same principle necessarily applies to more complex examples, including the aforementioned need to establish the requisite validity:
int* validate( int *p, int *head, int *tail ) {
return p >= head && p <= tail ? p : NULL;
}
Relational comparison and arithmetic offer identical utility to testing equivalence, and are equivalently valid -- in principle. However, what the results of such computation would signify, is a different matter entirely -- and precisely the issue addressed by the quotations you included.
In C, an array is a contiguous buffer, an uninterrupted linear series of memory locations. Comparison and arithmetic applied to pointers that reference locations within such a singular series are naturally, and obviously meaningful in relation both to each other, and to this ‘array’ (which is simply identified by the base). Precisely the same applies to every block allocated through malloc, or sbrk. Because these relationships are implicit, the compiler is able to establish valid relationships between them, and therefore can be confident that calculations will provide the answers anticipated.
Performing similar gymnastics on pointers that reference distinct blocks or arrays do not offer any such inherent, and apparent utility. The more so since whatever relation exists at one moment may be invalidated by a reallocation that follows, wherein that is highly likely to change, even be inverted. In such instances the compiler is unable to obtain the necessary information to establish the confidence it had in the previous situation.
You, however, as the programmer, may have such knowledge! And in some instances are obliged to exploit that.
There ARE, therefore, circumstances in which EVEN THIS is entirely VALID and perfectly PROPER.
In fact, that is exactly what malloc itself has to do internally when time comes to try merging reclaimed blocks -- on the vast majority of architectures. The same is true for the operating system allocator, like that behind sbrk; if more obviously, frequently, on more disparate entities, more critically -- and relevant also on platforms where this malloc may not be. And how many of those are not written in C?
The validity, security and success of an action is inevitably the consequence of the level of insight upon which it is premised and applied.
In the quotes you have offered, Kernighan and Ritchie are addressing a closely related, but nonetheless separate issue. They are defining the limitations of the language, and explaining how you may exploit the capabilities of the compiler to protect you by at least detecting potentially erroneous constructs. They are describing the lengths the mechanism is able -- is designed -- to go to in order to assist you in your programming task. The compiler is your servant, you are the master. A wise master, however, is one that is intimately familiar with the capabilities of his various servants.
Within this context, undefined behaviour serves to indicate potential danger and the possibility of harm; not to imply imminent, irreversible doom, or the end of the world as we know it. It simply means that we -- ‘meaning the compiler’ -- are not able to make any conjecture about what this thing may be, or represent and for this reason we choose to wash our hands of the matter. We will not be held accountable for any misadventure that may result from the use, or mis-use of this facility.
In effect, it simply says: ‘Beyond this point, cowboy: you are on your own...’
Your professor is seeking to demonstrate the finer nuances to you.
Notice what great care they have taken in crafting their example; and how brittle it still is. By taking the address of a, in
p[0].p0 = &a;
the compiler is coerced into allocating actual storage for the variable, rather than placing it in a register. It being an automatic variable, however, the programmer has no control over where that is assigned, and so unable to make any valid conjecture about what would follow it. Which is why a must be set equal to zero for the code to work as expected.
Merely changing this line:
char a = 0;
to this:
char a = 1; // or ANY other value than 0
causes the behaviour of the program to become undefined. At minimum, the first answer will now be 1; but the problem is far more sinister.
Now the code is inviting of disaster.
While still perfectly valid and even conforming to the standard, it now is ill-formed and although sure to compile, may fail in execution on various grounds. For now there are multiple problems -- none of which the compiler is able to recognize.
strcpy will start at the address of a, and proceed beyond this to consume -- and transfer -- byte after byte, until it encounters a null.
The p1 pointer has been initialized to a block of exactly 10 bytes.
If a happens to be placed at the end of a block and the process has no access to what follows, the very next read -- of p0[1] -- will elicit a segfault. This scenario is unlikely on the x86 architecture, but possible.
If the area beyond the address of a is accessible, no read error will occur, but the program still is not saved from misfortune.
If a zero byte happens to occur within the ten starting at the address of a, it may still survive, for then strcpy will stop and at least we will not suffer a write violation.
If it is not faulted for reading amiss, but no zero byte occurs in this span of 10, strcpy will continue and attempt to write beyond the block allocated by malloc.
If this area is not owned by the process, the segfault should immediately be triggered.
The still more disastrous -- and subtle --- situation arises when the following block is owned by the process, for then the error cannot be detected, no signal can be raised, and so it may ‘appear’ still to ‘work’, while it actually will be overwriting other data, your allocator’s management structures, or even code (in certain operating environments).
This is why pointer related bugs can be so hard to track. Imagine these lines buried deep within thousands of lines of intricately related code, that someone else has written, and you are directed to delve through.
Nevertheless, the program must still compile, for it remains perfectly valid and standard conformant C.
These kinds of errors, no standard and no compiler can protect the unwary against. I imagine that is exactly what they are intending to teach you.
Paranoid people constantly seek to change the nature of C to dispose of these problematic possibilities and so save us from ourselves; but that is disingenuous. This is the responsibility we are obliged to accept when we choose to pursue the power and obtain the liberty that more direct and comprehensive control of the machine offers us. Promoters and pursuers of perfection in performance will never accept anything less.
Portability and the generality it represents is a fundamentally separate consideration and all that the standard seeks to address:
This document specifies the form and establishes the interpretation of programs expressed in the programming language C. Its purpose is to promote portability, reliability, maintainability, and efficient execution of C language programs on a variety of computing systems.
Which is why it is perfectly proper to keep it distinct from the definition and technical specification of the language itself. Contrary to what many seem to believe generality is antithetical to exceptional and exemplary.
To conclude:
Examining and manipulating pointers themselves is invariably valid and often fruitful. Interpretation of the results, may, or may not be meaningful, but calamity is never invited until the pointer is dereferenced; until an attempt is made to access the address linked to.
Were this not true, programming as we know it -- and love it -- would not have been possible.
To create a string that I can modify, I can do something like this:
// Creates a variable string via array
char string2[] = "Hello";
string2[0] = 'a'; // this is ok
And to create a constant string that cannot be modified:
// Creates a constant string via a pointer
char *string1 = "Hello";
string1[0] = 'a'; // This will give a bus error
My question then is how would one modify a constant string (for example, by casting)? And, is that considered bad practice, or is it something that is commonly done in C programming?
By definition, you cannot modify a constant. If you want to get the same effect, make a non-constant copy of the constant and modify that.
how would one modify a constant string (for example, by casting)?
If by this you mean, how would one attempt to modify it, you don't even need a cast. Your sample code was:
char *string1 = "Hello";
string1[0] = 'a'; // This will give a bus error
If I compile and run it, I get a bus error, as expected, and just like you did. But if I compile with -fwritable-strings, which causes the compiler to put string constants in read/write memory, it works just fine.
I suspect you were thinking of a slightly different case. If you write
const char *string1 = "Hello";
string1[0] = 'a'; // This will give a compilation error
the situation changes: you can't even compile the code. You don't get a Bus Error at run-time, you get a fatal error along the lines of "read-only variable is not assignable" at compile time.
Having written the code this way, one can attempt to get around the const-ness with an explicit cast:
((char *)string1)[0] = 'a';
Now the code compiles, and we're back to getting a Bus Error. (Or, with -fwritable-strings, it works again.)
is that considered bad practice, or is it something that is commonly done in C programming
I would say it is considered bad practice, and it is not something that is commonly done.
I'm still not sure quite what you're asking, though, or if I've answered your question. There's often confusion in this area, because there are typically two different kinds of "constness" that we're worried about:
whether an object is stored in read-only memory
whether a variable is not supposed to be modified, due to the constraints of a program's architecture
The first of these is enforced by the OS and by the MMU hardware. It doesn't matter what programming-language constructs you did or didn't use -- if you attempt to write to a readonly location, it's going to fail.
The second of these has everything to do with software engineering and programming style. If a piece of code promises not to modify something, that promise may let you make useful guarantees about the rest of the program. For example, the strlen function promises not to modify the string you hand it; all it does is inspect the string in order to compute its length.
Confusingly, in C at least, the const keyword has mostly to do with the second category. When you declare something as const, it doesn't necessarily (and in fact generally does not) cause the compiler to put the something into read-only memory. All it does is let the compiler give you warnings and errors if you break your promise -- if you accidentally attempt to modify something that elsewhere you declared as const. (And because it's a compile-time thing, you can also readily "cheat" and turn off this kind of constness with a cast.)
But there is read-only memory, and these days, compilers typically do put string constants there, even though (equally confusingly, but for historical reasons) string constants do not have the type const char [] in C. But since read-only memory is a hardware thing, you can't "turn it off" with a cast.
You cannot modify the contents of a string literal in a safe or reliable manner in C; it results in undefined behavior. From the C11 standard draft section 6.4.5 p7 concerning string literals:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
Attempting to modify constant string literal is undefined behavior. You may get a bus error, as in your case, or the program may not even indicate that the write failed at all. This is undefined behavior for you - the language makes no promises at this point.
You could reassign the pointer (losing your reference to the string "Hello"):
char *s1 = "Hello";
printf("%s ", s1);
s1 = "World";
printf("%s\n", s1);
Example.
If I have a pointer,
int* p;
p = (int*)2; // just for test
*p = 3; // it will be crack, right?
In general, the access of the pointer of value 2 will be crack. But actually the crack is not so simple. Invalid value of the pointer maybe comes from the Runtime error. I'd like to find a way to check the pointer before accessing it.
In standard C99, this (dereferencing (int*)2 in your *p =3; statement) is undefined behavior (UB). Read C.Lattner's blog on that. You should be very scared of UB. This is why programming in C is so hard (other programming languages like Ocaml, Common Lisp have much less UB).
A C program may have undefined behavior but might not always crash.
In practice, when coding in C, be vary careful about pointers. Initialize all of them explicitly (often to NULL). Be very careful about pointer arithmetic. Avoid buffer overflows and memory leaks. Static source code analysis may help (e.g. with Frama-C) but is limited (read about halting problem & Rice's theorem). You could use quite often flexible array members and check pointers and indexes at runtime.
On some embedded freestanding C implementations (e.g. coding for Arduino like devices), some addresses might have particular meanings (e.g. be some physical IO devices), hence UB could be very scary.
(I am focusing on Linux below)
On some implementations, and some operating systems, you might test if an address is valid. For example on Linux you might parse /proc/self/maps to compute if some given address is valid (see proc(5) for more about /proc/).
(therefore, you could write -on Linux- some function bool isreadableaddress(void*) which would parse /proc/self/maps and tell if an address is readable in the virtual address space of your process; but it won't be very efficient since needing several system calls)
And you should use valgrind and compile with all warnings & debug options (gcc -Wall -Wextra -g) and use the debugger (gdb) and some more debugging compiler options like -fsanitize=address
You might perhaps handle the SIGSEGV signal, but it is very tricky and highly non-portable (operating system, processor, and ABI specific). Unless you are a guru, you should not even try.
Yes, "it will be crack".
Your program has no way to know whether an arbitrarily-initialised pointer will practically "work" at runtime. Firstly, you compile your code before you run it, potentially on a completely different computer. The compiler cannot predict the future.
The language deals with this by saying any pointer not explicitly made to point to an object or array that you created in the program cannot exist as long as you want your program to have well-defined behaviour.
Basically, the only way to be sure is to not do this.
Actually it is possible... sort of....
Not with plain C, but most environments allow you to test whether a you can write to a pointer...
for example
Unfortunately you have no way of knowing if a pointer points at what you intend. Meaning you can be pointing at another valid address, different from what you expect, and unintentionally corrupt a piece of your own memory...
char a[2];
int b; // assuming they are stored on the stack sequentially and aligned one right after the other...
char *ptr = (char*)a;
ptr += 3;
*ptr = 'b' // valid pointer, but probably an error....
Can a single char be made read-only in C ? (I would like to make '\0' read-only to avoid buffer overflows.)
char var[5 + 1] = "Hello";
var[5] = '\0'; // specify this char as read-only ?
You can't make a mixed const/non-const array or string. The \0 not being overwritten should be guaranteed by the invariants in your program.
Can't do it, you have to ensure this through proper practice.
strictly speaking, no.
On some systems you could obtain something similar by playng with the linker, e.g. by specifying the address of "var" and forcing it to be 5 bytes before a "read only" section. But it works only on very few cases, and anyway it's not part of the C language.
No, such a concept of a read only memory location does not exist at all in C. C Programmers are left on their own when allocating/accessing/manipulating memory and they are assumed to know what they are doing :D. Such is the responsability that comes out from raw power :D.
If you can switch to C++ then you may consider using std::vector. While buffer overflows are still possible with std::vectors they are less likely when you use methods contained in the class interface. This methods abstracts element access and insertion and so you won't need to explicitly manage memory. Notice though you can still access directly an element that is outside of the vector size if you don't use iterators.
Since overflows are recurring problems in c/c++ there are several tools to help the programmer with these kind of mistakes. Tools range from static language analyzer to runtime detection of unprotected memory access in debug mode.
Make use of a const char * ptr which keeps track of your \0.