char *c="1234". Address stored in c is always the same - c

This was a question asked by an interviewer:
#include<stdio.h>
int main()
{
char *c="123456";
printf("%d\n",c);
return 0;
}
This piece of code always prints a fixed number (e.g. 13451392), no matter how many times you execute it. Why?

Your code contains undefined behavior: printing a pointer needs to be done using %p format specifier, and only after converting it to void*:
printf("%p\n", (void*)c);
This would produce a system-dependent number, which may or may not be the same on different platforms.
The reason that it is fixed on your platform is probably that the operating system always loads your executable into the same spot of virtual memory (which may be mapped to different areas of physical memory, but your program would never know). String literal, which is part of the executable, would end up in the same spot as well, so the printout would be the same all the time.

To answer your question, the character string "123456" is a static constant in memory, and when the .exe is loaded, it always goes into the same memory location.
What c is (or rather what it contains) is the memory address of that character string which, as I said, is always at the same location. If you print the address as a decimal number, you see the address, in decimal.
Of course, as #dasblinkenlight said, you should print it as a pointer, because different machines/languages have different conventions about the size of pointers versus the size of ints.

Most executable file formats have an option to tell the OS loader at which virtual address to load the executable, For example PE format used by Windows has ImageBase field for this and usually sets to 0x00400000 for applications.
When the loader first load the executable, it tries load it at that address, if it's not used, it load it at it, which is mostly true, but if it's used. It load it at different address given by the system.
The case here is that the offset to your "12345" in the data section is the same, and OS loads the image base at the same base address, so you always get the same virtual address, base + offset.
But this is not always the case, one for the given reason above, the base address may be used, alot of Windows DLLs compile using MSVC sets their base address to 0x10000000, so only one or none is actually loaded at that address.
Another case is when there is address space randomization ASLR, security feature, if it is supported and enabled by the system, MSVC has the linker option /DYNAMICBASE, the system will ignore the specified image base and will give you different random address on its own.
Two things to conclude:
You should not depend on this behavior, the system can load your program at any address and thus you will give different address.
Use %p for printing address, on some systems, for example, int is 4 bytes and pointers are 8 bytes, part of you address will be chopped.

Related

Is it possible to allocate a single byte of memory at a specific address?

Is it possible to allocate a single byte of memory at a specific desired address, say 0x123?
This suggests follow up questions:
Is it possible to know if a specific address has already been malloced?
Some complications could be:
The byte at the desired address 0x123 was already malloc'ed. In this case, can I move the byte value elsewhere and notify the compiler (or whatever's keeping track of these things) of the new address of the byte?
The byte at the desired address 0x123 was malloc'ed along with other bytes. E.g. char *str = malloc(8); and str <= 0x123 < str + 8, or in other words, 0x123 overlaps some portion of already malloc'ed memory. In this case, is it possible to move the portion of malloc'ed memory elsewhere and notify the compiler (or whatever's keeping track of these things)?
There are also several variations:
Is this possible if the desired address is known at compile time?
Is this possible if the desired address is known at run time?
I know mmap takes a hint addr, but it allocates in multiples of the pagesize and may or may not allocate at the given hint addr.
It is possible to assign a specific value to a pointer as follows:
unsigned char *p = (unsigned char *)0x123;
However dereferencing such a pointer will almost certainly result in undefined behavior on any hosted system.
The only time such a construct would be valid is on an embedded system where it is allowed to access an arbitrary address and the implementation documents specific addresses for specific uses.
As for trying to manipulate the inner workings of a malloc implementation, such a task is very system specific and not likely to yield any benefit.
The are operating-system-specific ways to do this. On Windows, you can use VirtualAlloc (with the MEM_COMMIT | MEM_RESERVE flags). On Linux, you can use mmap (with the MAP_FIXED_NOREPLACE flag). These are the operating system functions which give you full control over your own address space.
In either case, you can only map entire pages. Addresses only become valid and invalid a page at a time. You can't have a page that is only half valid, and you can't have a page where only one address is valid. This is a CPU limitation.
If the page you want is already allocated, then obviously you can't allocate it again.
On both Windows and Linux, you can't allocate the first page. This is so that accesses to NULL pointers (which point to the first page) will always crash.
Is it possible to allocate a single byte of memory at a specific desired address, say 0x123?
Generally: no. The C language doesn't cover allocation at specific addresses, it only covers how to access a specific address. Many compilers do provide a non-standard language extensions for how to allocate at a fixed address. When sticking to standard C, the actual allocation must either be done:
In hardware, by for example having a MCU which provides a memory-mapped register map, or
By the system-specific linker, through custom linker scripts.
See How to access a hardware register from firmware? for details.
malloc doesn't make any sense in either case, since it exclusively uses heap allocation and the heap sits inside a pre-designated address space.

Static and Dynamic Memory Addresses in C

printf("address of literal: %p \n", "abc");
char alpha[] = "abcdef";
printf("address of alpha: %p \n", alpha);
Above, literal is stored in static memory, alpha is stored in dynamic memory. I read in a book that some compilers show these two addresses using different number of bits (I only tried using gcc on Linux, and it does show different number of bits). Does it depend on the compiler, or the operating system and hardware?
I only tried using gcc on Linux, and it does show different number of bits
It's not that it "uses a different number of bits". As far as I know, Linux – at least when running the major platforms I know of (e.g. x86, x64, ARM32) – doesn't have "near" and "far" pointers. For instance, on x86, every pointer is 32 bits wide and on x64, every pointer is 64 bits wide.
It's just that…
the compiler probably allocates the alpha array on the stack (which it is allowed to do since it has automatic storage duration. It is most probably not stored in "dynamic memory", that would be stupid since that would involve a superfluous dynamic allocation, which is one of the slowest things you can do with memory.)
Meanwhile the literals themselves, having static storage duration, are stored elsewhere (usually in the data segment of the executable);
and on top of that, the OS's memory manager happens to place these two things (the stack and the executable image) far apart, so one of them has addresses that start with a lot of zeroes, while addresses in the other one don't have that many leading zeroes.
Furthermore, the default behavior of %p in your libc implementation happens to not print leading zeroes.
alpha is stored e.g. in the stack or another dynamic memory segment. The literal is stored inside the code segment. These are different address ranges.
The addresses are platform dependant. In most cases the pointer size is 4 bytes long, but the addresses for the different segments are in different ranges.
The addresses are platform dependant.
The linker is responsible for the address assignment. You may want to enable an option to let the linker produce an address map file.
The dynamic parts are also called data segments. The static parts are code segments. You will find a lot literature searching for this term for your platform (e.g. search for "x86 memory segmentation").

Questions about loading executables into memory

So I'm working with the linux 0.11 kernel on a virtual machine, and I need to write a program that analyses executable files that are ran on that kernel. The files are in the format a.out. What I want to know is, how does the operating system decide where to load the file in (virtual?) memory? Is it decided by something called "base address", and if so, how come I can't seem to find any mention of it in the a.out header?
//where is base address?
struct exec {
unsigned long a_magic; /* Use macros N_MAGIC, etc for access */
unsigned a_text; /* length of text, in bytes */
unsigned a_data; /* length of data, in bytes */
unsigned a_bss; /* length of uninitialized data area for file, in bytes */
unsigned a_syms; /* length of symbol table data in file, in bytes */
unsigned a_entry; /* start address */
unsigned a_trsize; /* length of relocation info for text, in bytes */
unsigned a_drsize; /* length of relocation info for data, in bytes */
};
I tried looking for documentations about the format, but the only information I found just explains what each of these fields are, what values a_magic can have, etc.
I need to know about it because the program needs to print out file and line numbers when given an address in memory of an instruction in the executable, and the debug symbols only have their addresses as offsets (e.g. relative to the start of text section, etc).
Also, out of curiosity, I know that in C, "(void*)0" is NULL, which you can't dereference. How then would you get the content of memory address 0?
As you see, I know very little about linux kernel and operating systems in general, so please start from the basics...
I appreciate any help you can give, thanks.
The base address is the a_entry field.
Also, out of curiosity, I know that in C, "(void*)0" is NULL, which you can't dereference. How then would you get the content of memory address 0?
Any system that puts memory usable by a C program at address zero would have to make it work, somehow. While one can imagine possible ways to do this, I don't know of anyone who bothers. Virtual address zero is, for all intents and purposes, never used.
The operating system can load the application at any location it chooses and then relocate the embedded addresses to be relative to that point. This relocation information is recorded in the a.out file. The base address depends on the architecture and other details and is often non-zero.
If you look at a linker map file, you should see a symbol that is either at the beginning of the memory image, or at a fixed offset from it. At runtime, subtract this value from the actual addresses you note for debugging to get to the relative address of the position you are interested in.
Note, the symbols will not be present in the executable if your linker script strips them.
Also, out of curiosity, I know that in C, "(void*)0" is NULL, which you can't dereference. How then would you get the content of memory address 0?
Actually you can dereference NULL, but the results are not defined. For convenience, most operating systems trap the access to help you debug pointer problems.
Also, memory location with address 0 in a process space is different from the memory location with address 0 in the 'hardware space'. The 'pagination' support in the CPU and operating system are 'decoupling' the physical memory from the virtual memory. It could happen that a virtual page be mapped at address 0, although there you usually have interrupt vectors and other special device memory and not real RAM anyway.

How do you know the exact address of a variable?

So I'm looking through my C programming text book and I see this code.
#include <stdio.h>
int j, k;
int *ptr;
int main(void)
{
j = 1;
k = 2;
ptr = &k;
printf("\n");
printf("j has the value %d and is stored at %p\n", j, (void *)&j);
printf("k has the value %d and is stored at %p\n", k, (void *)&k);
printf("ptr has the value %p and is stored at %p\n", (void *)ptr, (void *)&ptr);
printf("The value of the integer pointed to by ptr is %d\n", *ptr);
return 0;
}
I ran it and the output was:
j has the value 1 and is stored at 0x4030e0
k has the value 2 and is stored at 0x403100
ptr has the value 0x403100 and is stored at 0x4030f0
The value of the integer pointed to by ptr is 2
My question is if I had not ran this through a compiler, how would you know the address to those variables by just looking at this code? I'm just not sure how to get the actual address of a variable. Thanks!
Here's my understanding of it:
The absolute addresses of things in memory in C is unspecified. It's not standardised into the language. Because of this, you can't know the locations of things in memory by looking at just the code. (However, if you use the same compiler, code, compiler options, runtime and operating system, the addresses may be consistent.)
When you're developing applications, this is not behaviour you should rely on. You may rely on the difference between the locations of two things in some contexts, however. For example, you can determine the difference between the addresses of pointers to two array elements to determine how many elements apart they are.
By the way, if you are considering using the memory locations of variables to solve a particular problem, you may find it helpful to post a separate question asking how to so without relying on this behaviour.
There is no other way to "know the exact address" of a variable in Standard C than to print it with "%p". The actual address is determined by many factors not under control of the programmer writing code. It's a matter of OS, the linker, the compiler, options used and probably others.
That said, in the embedded systems world, there are ways to express this variable must reside at this address, for example if registers of external devices are mapped into the address space of a running program. This usually happens in what is called a linker file or map file or by assigning an integral value to a pointer (with a cast). All of these methods are non-standard.
For the purpose of your everyday garden-variety programs though, the point of writing C programs is that you need and should not care where your variables are stored.
You can't.
Different compilers can put the variables in different places. On some machines the address is not a simple integer anyway.
The compiler only knows things like "the third integer global variable" and "the four bytes allocated 36 bytes down from the stack pointer." It refers to global vars, pointers to subroutines (functions), subroutine arguments and local vars only in relative terms. (Never mind the extra stuff for polymorphic objects in C++, yikes!) These relative references are saved in the object file (.o or .obj) as special codes and offset values.
The Linker can fill in some details. It may modify some of these sketchy location references when joining several object files. Global variable locations will share a space (the Data Section) when globals from multiple compilation units are merged; the linker decides what order they all go in, but still describing them as relative to the start of the entire set of global vars. The result is an executable file with the final opcodes, but addresses still being sketchy and based on relative offsets.
It's not until the executable is loaded that the Loader replaces all the relative addresses with actual addresses. This is possible now, because the loader (or some part of the operating system it depends on) decides where in the whole virtual address space of the process to store the program's opcodes (Text Section), global variables (BSS, Data Sections) and call stack, and other things. The loader can do the math, and write the actual address into every spot in the executable, typically as part of "load immediate" opcodes and all opcodes involving memory access.
Google "relocation table" for more. See http://www.iecc.com/linker/linker07.html (somewhat old) for a more detailed explanation for particular platforms.
In real life, it's all complicated by the fact that virtual addresses are mapped to physical addresses by a virtual memory system, using segments or some other mechanism to keep each process in a separate address space.
I would like to further build upon the answers already provided by pointing out that some compilers, such as Visual Studio's, have a feature called Address Space Layout Randomization (ASLR), which makes programs begin at a random memory address as an anti-virus feature. Given the addresses that you have in your output, I'd say that you compiled without it (programs without it start at address 0x400000, I think). My source for this information is an answer to this question.
That said, the compiler is what determines the memory addresses at which local variables will be stored. The addresses will most likely change from compiler to compiler, and probably also with each version of the source code.
Every process has its own logical address space starting from zero. Addressees your program can access are all relative to zero. Absolute address of any memory location is decided only after loading the process in main memory. This is done using dynamic relocation by modern operating systems. Hence every time a process is loaded into memory it may be loaded at different location according to availability of the memory. Hence allowing user processes to know exact address of data stored in memory does not make any sense. What your code is printing, is a logical address and not the exact or physical address.
Continuing on the answers described above, please do not forget that processes would run in their own virtual address space (process isolation). This ensures that when your program corrupts some memory, the other running processes are not affected.
Process Isolation:
http://en.wikipedia.org/wiki/Process_isolation
Inter-Process Communication
http://en.wikipedia.org/wiki/Inter-process_communication

Can a 32-bit processor really address 2^32 memory locations?

I feel this might be a weird/stupid question, but here goes...
In the question Is NULL in C required/defined to be zero?, it has been established that the NULL pointer points to an unaddressable memory location, and also that NULL is 0.
Now, supposedly a 32-bit processor can address 2^32 memory locations.
2^32 is only the number of distinct numbers that can be represented using 32 bits. Among those numbers is 0. But since 0, that is, NULL, is supposed to point to nothing, shouldn't we say that a 32-bit processor can only address 2^32 - 1 memory locations (because the 0 is not supposed to be a valid address)?
If a 32-bit processor can address 2^32 memory locations, that simply means that a C pointer on that architecture can refer to 2^32 - 1 locations plus NULL.
the NULL pointer points to an unaddressable memory location
This is not true. From the accepted answer in the question you linked:
Notice that, because of how the rules for null pointers are formulated, the value you use to assign/compare null pointers is guaranteed to be zero, but the bit pattern actually stored inside the pointer can be any other thing
Most platforms of which I am aware do in fact handle this by marking the first few pages of address space as invalid. That doesn't mean the processor can't address such things; it's just a convenient way of making low values a non valid pointer. For instance, several Windows APIs use this to distinguish between a resource ID and a pointer to actual data; everything below a certain value (65k if I recall correctly) is not a valid pointer, but is a valid resource ID.
Finally, just because C says something doesn't mean that the CPU needs to be restricted that way. Sure, C says accessing the null pattern is undefined -- but there's no reason someone writing in assembly need be subject to such limitations. Real machines typically can do much more than the C standard says they have to. Virtual memory, SIMD instructions, and hardware IO are some simple examples.
First, let's note the difference between the linear address (AKA the value of the pointer) and the physical address. While the linear address space is, indeed, 32 bits (AKA 2^32 different bytes), the physical address that goes to the memory chip is not the same. Parts ("pages") of the linear address space might be mapped to physical memory, or to a page file, or to an arbitrary file, or marked as inaccessible and not backed by anything. The zeroth page happens to be the latter. The mapping mechanism is implemented on the CPU level and maintained by the OS.
That said, the zero address being unaddressable memory is just a C convention that's enforced by every protected-mode OS since the first Unices. In MS-DOS-era real-mode operaring systems, null far pointer (0000:0000) was perfectly addressable; however, writing there would ruin system data structures and bring nothing but trouble. Null near pointer (DS:0000) was also perfectly accessible, but the run-time library would typically reserve some space around zero to protect from accidental null pointer dereferencing. Also, in real mode (like in DOS) the address space was not a flat 32-bit one, it was effectively 20-bit.
It depends upon the operating system. It is related to virtual memory and address spaces
In practice (at least on Linux x86 32 bits), addresses are byte "numbers"s, but most are for 4-bytes words so are often multiple of 4.
And more importantly, as seen from a Linux application, only at most 3Gbytes out of 4Gbytes is visible. a whole gigabyte of address space (including the first and last pages, near the null pointer) is unmapped. In practice the process see much less of that. See its /proc/self/maps pseudo-file (e.g. run cat /proc/self/maps to see the address map of the cat command on Linux).

Resources