Converting characters to integer in memory [hypothetical]

Converting characters to integer in memory [hypothetical] - c

"Garbage" values are determined from previous programs to my knowledge. For examples if variable x = 10 and you terminate the program, then 10 is still stored at the address of x, but your access to it is lost. Hence, if in a new program you were to have an integer generate coincidentally at that same address it should generate 10 as its "garbage" value.
mockup notes i made on the topic out of curiosity
If you have a string of characters in succession which are all equal to NULL, they would each be stored as 0.
So, if in a new program you were to generate an integer at the starting address of that first character, what would its garbage value be? I have reason to believe it would be 0, but I'm unsure as to why exactly. Would it add each ascii value together to get a number, would some sort of binary conversion occur?
I recognize many compilers will auto initialize variables to zero, but some do not and create wildly interesting "garbage" values

"Garbage" values are determined from previous programs to my knowledge.
No. In general-purpose multi-user systems, the operating system does not provide memory used by a previous process to a new process (except for intentionally shared data). When memory is provided to a new process, the operating system will ensure any potentially sensitive data is overwritten (usually with zeros). “Embedded” or special-purpose systems might behave differently.
mockup notes i made on the topic out of curiosity
if you took the space of a, b, c, & d and created an int, would it be 0? Why?
In C, each object except a bit-field are composed of a contiguous sequence of bytes. The contents of those bytes represent a value.
An int is an object. (An object is a selected region of memory whose contents may be used to represent values.) So the bytes in it determine its value. For integer types, C uses binary encodings to represent values. So, when all the bytes of an int are zero, the value represented is zero. (There are additional details about handling negative numbers, not addressed here.)
So, if in a new program you were to generate an integer at the starting address of that first character, what would its garbage value be? I have reason to believe it would be 0, but I'm unsure as to why exactly. Would it add each ascii value together to get a number, would some sort of binary conversion occur?
This question is unclear. Characters are encoded as numbers. ASCII is commonly used as the encoding scheme. If you put characters in the memory of an int, the numbers that encode them will be in the bytes of the int, and they will represent whatever number their bits form in binary (possibly modified by interpretation of the sign bit). Note that some C implementations form the binary numeral for an int using its bytes in order from low address to high address, and use them in the other order (and other permutations are possible as well). So writing the same characters to an int in different C implementations may produce different values for the int.
I recognize many compilers will auto initialize variables to zero, but some do not and create wildly interesting "garbage" values
It is rare that automatic objects (those defined inside functions without an explicit storage duration) or dynamic objects (those allocated with malloc or related routines) will be initialized to zero by a compiler except for debugging or security purposes, except that some of the allocation routines, notably calloc, including initialization.

Related

Computer Memory Allocation for Duplicate Inputs

I'm taking Introduction to CS (CS50, Harvard) and we're learning type declaration in C. When we declare a variable and assign a type, the computer's allocating a specific amount of bits/bytes (1 byte for char, 4 bytes for int, 8 bytes for doubles etc...).
For instance, if we declare the string "EMMA", we're using 5 bytes, 1 for each "char" and 1 extra for the \0 null byte.
Well, I was wondering why 2 M's are allocated separate bytes. Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?
Would love some education on the matter (without getting too deep, as I'm fairly new to the field).
Edit: Fixed some bits into bytes — my bad

1 bit for char, 4 bytes for int, 8 bytes for doubles etc...
These are general values but they depend on the architecture (per this answer, there are even still 9-bit per byte architectures being sold these days).
Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?
While this idea is certainly feasible in theory, in practice the overhead is way too big for simple data like characters: one character is usually a single byte.
If we were to set up a system in which we allocate memory for the character value and only refer to it from the string, the string would be made of a series of elements which would be used to store which character should be there: in C this would be a pointer (you will encounter them at some point in your course) and is usually either 4 or 8 bytes long (32 or 64 bits). Assuming you use a 32-bit pointer, you would use 24 bytes of memory to store the string in this complex manner instead of 5 bytes using the simpler method (to expand on this answer, you would need even more metadata to be able to properly modify the string during your program's execution).
Your idea of storing a chunk of data and referring to it multiple times does however exist in several cases:
virtual memory (you will encounter this if you go towards OS development), where copy-on-write is used
higher level languages (like C++)
filesystems which implement a copy-on-write feature, like BTRFS
some backup systems (like borg or rsync) which deduplicate the files/chunks they store
Facebook's zstandard compression algorithm, where a dictionnary of small common chunks of data is used to improve compression ratio and speed
In such settings, where lots of data are stored, the relative size of the information required to store the data once and refer to it multiple times while improving copy time is worth the added complexity.

For instance if we declare the string "EMMA", we're using 5 bits
I am sure you are speaking about 5 bytes instead of 5 bits.
Well, I was wondering why 2 M's are allocated separate bits. Can't the
computer make use of the chars or integers currently taking space in
the memory and refer to that specific slot when it wants to reuse it?
A pointer to a "slot" usually occupies 4 or 8 bytes. So there is no sense to spend 8 bytes to point to an object that occupies only one byte
Moreover "EMMA" is a character array that consists from adjacent bytes. So all elements of the array has the same type and correspondingly size.
The compiler can reduce the memory usage by avoiding duplicated string literals. For example it can stores the same string literals as one string literal. This depends on a compiler option.
So if in the program the same string literal occurs for example two times as in these statements
char *s = malloc( sizeof( "EMMA" ) );
strcpy( s, "EMMA" );
then the compiler can store only one copy of the string literal.

The compiler is not supposed to be the code/program but something that does the minimal and it has to perform tasks such that it is easy for programmers to understand and manipulate,in other words it has to be general.
as a programmer you can make your program to save data in the suggested way but it won't be general .
eg- i am making a database for my school and i entered a wrong name and now i want to change the 2nd 'm' in "EMMA",now this would be troublesome if the system worked as suggested by you.
would love to clarify further if needed. :)

Do C standards specify how far the carry propagates when incrementing a pointer?

Consider the following situations:
The National Semiconductor SC/MP has pointers which, when you keep incrementing them, will roll from 0x0FFF to 0x0000 because the increment circuit does not propagate the carry past the lower nybble of the higher byte. So if, for example, I want to do while(*ptr++) to traverse a null-terminated string, then I might wind up with ptr pointing outside of the array.
On the PDP-10, because a machine word is longer than an address1, a pointer may have tags and other data in the upper half of the word containing the address. In this situation, if incrementing a pointer causes an overflow, that other data might get altered. The same goes for very early Macintoshes, before the ROMs were 32-bit clean.
So my question is about whether the C standard says what incrementing a pointer really means. As far as I can tell, the C standard assumes that it should work in bit-wise the same manner as incrementing an integer. But that doesn't always hold, as we have seen.
Can a standards-conforming C compiler emit a simple adda a0, 12 to increment a pointer, without checking that the presence or lack of carry propagation will not lead to weirdness?
1: On the PDP-10, an address is 18 bits wide, but a machine word is 36 bits wide. A machine word may hold either two pointers (handy for Lisp) or one pointer, plus bitfields which mean things like "add another level of indirection", segments, offsets etc. Or a machine word may of course contain no pointers, but that's not relevant to this question.
2: Add one to an address. That's 68000 assembler.

Behavior of pointer arithmetic is specified by the C standard only as long as the result points to a valid object or just past a valid object. More than that, the standard does not say what the bits of a pointer look like; an implementation may arrange them to suit its own purposes.
So, no, the standard does not say what happens when a pointer is incremented so far that the address rolls over.
If the while loop you refer to only proceeds one element past the end of the array, it is safe in C. (Per the standard, if ptr has been incremented to one element beyond the end of the array, and x points to any element in the array, including the first, then x < ptr must be true. So, if ptr has rolled over internally, the C implementation is responsible for ensuring the comparison still works.)
If your while loop may increment ptr more than one element beyond the end of the array, the C standard does not define the behavior.

People often ask, "Why does C have undefined behavior, anyway?". And this is a great example of one of the big reasons why.
Let's stick with the NS SC/MP example. If the hardware dictates that incrementing the pointer value 0x0FFF doesn't work quite right, we have two choices:
Translate the code p++ to the equivalent of if(p == 0x0FFF) p = 0x1000; else p++;.
Translate p++ to a straight increment, but rig things up so that no properly-allocated object ever overlaps an address involving 0x0FFF, such that if anyone ever writes code that ends up manipulating the pointer value 0x0FFF and adding one to it and getting a bizarre answer, you can say "that's undefined, so anything can happen".
If you take approach #1, the generated code is bigger and slower. If you take approach #2, the generated code is maximally efficient. And if someone complains about the bizarre behavior, asks why the compiler couldn't have emitted code that did something "more reasonable", you can simply say, "our mandate was to be as efficient as possible."

A significant number of platforms have addressing methods which cannot index "easily" across certain boundaries. The C Standard allows implementations two general approaches for handling this (which may be, but typically aren't, used together):
Refrain from having the compiler, linker, or malloc-style functions place any objects in a way that would straddle any problematic boundaries.
Perform address computations in a way that can index across arbitrary boundaries, even when it would be less efficient than address-computation code that can't.
In most cases, approach #1 will lead to code which is faster and more compact, but code may be limited in its ability to use memory effectively. For example, if code needs many objects of 33,000 bytes each, a machine with 4MiB of heap space subdivided into "rigid" 64K chunks, would be limited to creating 64 of them (one for each chunk), even though there should be space for 127 of them. Approach #2 will often yield much slower code, but such code may be able to make more effective use of heap space.
Interestingly, imposing 16-bit or 32-bit alignment requirements would allow many 8-bit processors to generate more efficient code than allowing arbitrary alignment (since they could omit page-crossing logic when indexing between the bytes of a word) but I've never seen any 8-bit compilers provide an option to impose and exploit such alignments even on platforms where it could offer considerable advantages.

C standard does not know anything about the implementation, and the standard does not care about the implementation. It only says what the effect of the pointer arithmetics is.
C allows something which is called Undefined Behavior. C does not care if the result of the pointer arithmetic has any sense (ie it is not out of bounds or the actual implementation defined storage did not wrap around). If it happens it is the UB. It is up to programmer to prevent UB, and C does not have any standard mechanisms for detecting or preventing UB.

What is the purpose of the byte size of the type of a variable if I know the address of the variable?

I am not getting the whole purpose of working with the byte size of a variable by knowing the address of it. For example, let's say I know where an int variable is stored, let's say it is stored in address 0x8C729A09, if I want to get the int stored in that address I can just dereference the address and get the number stored on it.
So, what exactly is the purpose of knowing the byte size of the variable? Why does it matter if the variable has 4 bytes (being int) or 8 bytes if I am able to get the value of the variable by just dereference the address? I am asking this, because I am working on dereferencing some address and I thought that I needed to go through a for loop to get the variable (By knowing the start address, which is the address of the variable, and the size of the variable in bytes) but whenever I do this I am just getting other variables that are also declared.
A little bit of context: I am working on a tool called Pin and getting the addresses of the global variables declared in another program.
The for case looks something like this:
for(address1 = (char *) 0x804A03C, limit = address1 + bytesize; address1 < limit; address1++)
cout << *(address1) << "\n";

Michael Krelin gave a very compact answer but I think I can expand on it a bit more.
In any language, not just C, you need to know the size for a variety of reasons:
This determines the maximum value that can be stored
The memory space an array of those values will take (1000 bytes will get you 250 ints or 125 longs).
When you want to copy one array of values into another, you need to know how many bytes are used to allocate enough space.
While you may dereference a pointer and get the value, you could dereference the pointer at a different portion of that value, but only if you know how many bytes it is composed of. You could get the high-value of an int by grabbing just the first two bytes, and the low value by getting the last two bytes.
Different architectures may have different sizes for different variables, which would impact all the above points.
Edit:
Also, there are certainly reasons where you may need to know the number of bits that a given variables is made of. If you want 32 booleans, what not a better variable to use than a single int, which is made of 32 bits? Then you can use some constants to create pointers to each bit and now you have an "array" of booleans. These are usually called bit-fields (correct me if I am wrong). In programming, every detail can matter, just not all the time for every application. Just figured that might be an interesting thought exercise.

The answer is simple: the internal representation of most types needs more than one byte. In order to dereference a pointer you (either you or the compiler) need to know how much bytes should be read.
Also consider it when working with strings, you cannot always relay on the terminating \0, hence you need to know how many bytes you have to read. Examples of these are functions like memcpy or strncmp.

Supposed you have an array of variables. Where do you find the variable at non-zero index without knowing its size? And how many bytes do you allocate for non-zero length array of variables?

Assemblers and word alignment

Today I learned that if you declare a char variable (which is 1 byte), the assembler actually uses 4 bytes in memory so that the boundaries lie on multiples of the word size.
If a char variable uses 4 bytes anyway, what is the point of declaring it as a char? Why not declare it as an int? Don't they use the same amount of memory?

When you are writing in assembly language and declare space for a character, the assembler allocates space for one character and no more. (I write in regard to common assemblers.) If you want to align objects in assembly language, you must include assembler directives for that purpose.
When you write in C, and the compiler translates it to assembly and/or machine code, space for a character may be padded. Typically this is not done because of alignment benefits for character objects but because you have several things declared in your program. For example, consider what happens when you declare:
char a;
char b;
int i;
char c;
double d;
A naïve compiler might do this:
Allocate one byte for a at the beginning of the relevant memory, which happens to be aligned to a multiple of, say, 16 bytes.
Allocate the next byte for b.
Then it wants to place the int i which needs four bytes. On this machine, int objects must be aligned to multiples of four bytes, or a program that attempts to access them will crash. So the compiler skips two bytes and then sets aside four bytes for i.
Allocate the next byte for c.
Skip seven bytes and then set aside eight bytes for d. This makes d aligned to a multiple of eight bytes, which is beneficial on this hypothetical machine.
So, even with a naïve compiler, a character object does not require four whole bytes to itself. It can share with neighbor character objects, or other objects that do not require greater alignment. But there will be some wasted space.
A smarter compiler will do this:
Sort the objects it has to allocate space for according to their alignment requirements.
Place the most restrictive object first: Set aside eight bytes for d.
Place the next most restrictive object: Set aside four bytes for i. Note that i is aligned to a multiple of four bytes because it follows d, which is an eight-byte object aligned to a multiple of eight bytes.
Place the least restrictive objects: Set aside one byte each for a, b, and c.
This sort of reordering avoids wasting space, and any decent compiler will use it for memory that it is free to arrange (such as automatic objects on stack or static objects in global memory).
When you declare members inside a struct, the compiler is required to use the order in which you declare the members, so it cannot perform this reordering to save space. In that case, declaring a mixture of character objects and other objects can waste space.

Q: Does a program allocate four bytes for every "char" you declare?
A: No - absolutely not ;)
Q: Is it possible that, if you allocate a single byte, the program might "pad" with extra bytes?
A: Yes - absolutely yes.
The issue is "alignment". Some computer architectures must access a data value with respect to a particular offset: 16 bits, 32 bits, etc. Other architectures perform better if they always access a byte with respect to an offset. Hence "padding":
http://en.wikipedia.org/wiki/Byte_padding#Data_structure_padding

There may indeed not be any point in declaring a single char variable.
There may however be many good reasons to want a char-array, where an int-array really wouldn't do the trick!
(Try padding a data structure with ints...)

Others have for the most part answered this. Assuming a char is a single byte, does declaring a char mean that it always pads to an alignment? Nope, some compilers do by default some dont, and many you can change the default using some sort of command somewhere. Does this mean you shouldnt use a char? It depends, first off the padding doesnt always happen so the few wasted bytes dont always happen. You are programming in a high level language using a compiler so if you think that you have only 3 wasted bytes in your whole binary...think again. Depending on the architecture using chars can have some savings, for example loading immediates saves you three bytes or more on some architectures. Other architectures just to do simple operations with the register extra instructions are required to sign extend or clip the larger register to behave like a byte sized register. If you are on a 32 bit computer and you are using an 8 bit character because you are only counting from 1 to 100, you might want to use a full sized int, in the long run you are probably not saving anyone anything by using the char. Now if this is an 8086 based pc running dos, that is a different story. Or an 8 bit microcontroller, then you want to lean toward the 8 bit variables as much as possible.

Number of values that a pointer can have in a 32 bit system

I had a query regarding one of the answers of this question.
The answer says:
If a 32-bit processor can address 2^32 memory locations, that simply
means that a C pointer on that architecture can refer to 2^32 - 1
locations plus NULL
Isn't it 2^32 plus NULL? Why is the -1?
EDIT: sorry for not being clear. Edited question.

2^32 - 1 locations plus NULL
That equals 2^32.
In most programming languages and operating systems NULL is a dedicated pointer value (usually 0) that means invalid pointer, thats why it cannot be used to point a valid memory location.
Because pointers are just like any integer numbers, there's no other way to signal invalid pointer than with a dedicated value. Because 32 bit integers can have 2^32 possible values, if you don't count this NULL value, you get 2^32-1 valid memory locations.

The author of that text is distinguishing NULL as not being a memory location. So you use one of the 2^32 available values for NULL which leaves 2^32-1 available for memory locations.

NULL is actually a value, usually 0, so if you add it you get 2^32

I believe the concept of NULL originally comes from memory allocators when they have failed at allocating a slot of memory. By comparing the returned value with NULL the programmer can determine whether the pointer may be used as such or not. I don't believe that NULL has any place in the discussion, it is simply a form of alias for position 0 (0x00000000).
In real-mode (lacking a memory protection scheme) systems the memory at location 0 may usually be read from and sometimes written to (if it isn't read-only). This of course depends on if physical memory chips are actually wired to that location. In many real-mode systems with a writable 0 location you get funny results if you manipulate that location. In x86 (16-bit real mode) PC-DOS systems the location points to the first vector in the interrupt vector table, a vector that points to the interrupt handler for divide by 0. A programmer could write there by mistake or for some valid reason.
In protected-mode systems a program accessing the 0 position will usually cause a memory protection fault that will terminate it. I should clarify this and say that location 0 to the application is almost certainly not the physical location 0 since most protected mode OSes have remapped an application's memory area using a virtual addressing mechanism provided by the host processor. The OS itself may, under certain circumstances, give itself or another program permission to access whatever memory location it pleases but such circumstances are few and so restricted that an application developer will never encounter them in his or her lifetime.
With this somewhat lengthy backgrounder I agree with those who say that for a 32-bit processor there are usually 2^32 addressable locations ranging from 0 (0x00000000) to 2^32-1 (0xffffffff). Exceptions are early 32-bit processors (such as intel's 486sx and Motorola's 68000) that implemented less than 32 adressing lines.

The author is saying that the C implementation must reserve one of the addresses that the processor can use, so that in C that address is used to represent a null pointer. Hence, C can address one fewer memory locations than the processor can. Convention is to use address 0 for null pointers.
In practice, a 32-bit processor can't really address 2^32 memory locations anyway, since various parts of the address space will be reserved for special purposes rather than mapped to memory.
There's also no actual requirement that the C implementation represent pointers using the same sized addresses that the processor uses. It might be horribly inefficient to do so, but the C implementation could use 33-bit pointers (which would therefore require at least 5 bytes of storage, and wouldn't fit in a CPU register), enabling it to use a value for null pointers that isn't one of the 2^32 addresses the processor can handle.
But assuming nothing like that, it's true that a 32 bit pointer can represent any of 2^32 addresses (whether those addresses refer to memory locations is another matter), and it's also true that since C requires a null pointer, one of those addresses must be used to mean "a null pointer".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight