Computer Memory Allocation for Duplicate Inputs - c

I'm taking Introduction to CS (CS50, Harvard) and we're learning type declaration in C. When we declare a variable and assign a type, the computer's allocating a specific amount of bits/bytes (1 byte for char, 4 bytes for int, 8 bytes for doubles etc...).
For instance, if we declare the string "EMMA", we're using 5 bytes, 1 for each "char" and 1 extra for the \0 null byte.
Well, I was wondering why 2 M's are allocated separate bytes. Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?
Would love some education on the matter (without getting too deep, as I'm fairly new to the field).
Edit: Fixed some bits into bytes — my bad

1 bit for char, 4 bytes for int, 8 bytes for doubles etc...
These are general values but they depend on the architecture (per this answer, there are even still 9-bit per byte architectures being sold these days).
Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?
While this idea is certainly feasible in theory, in practice the overhead is way too big for simple data like characters: one character is usually a single byte.
If we were to set up a system in which we allocate memory for the character value and only refer to it from the string, the string would be made of a series of elements which would be used to store which character should be there: in C this would be a pointer (you will encounter them at some point in your course) and is usually either 4 or 8 bytes long (32 or 64 bits). Assuming you use a 32-bit pointer, you would use 24 bytes of memory to store the string in this complex manner instead of 5 bytes using the simpler method (to expand on this answer, you would need even more metadata to be able to properly modify the string during your program's execution).
Your idea of storing a chunk of data and referring to it multiple times does however exist in several cases:
virtual memory (you will encounter this if you go towards OS development), where copy-on-write is used
higher level languages (like C++)
filesystems which implement a copy-on-write feature, like BTRFS
some backup systems (like borg or rsync) which deduplicate the files/chunks they store
Facebook's zstandard compression algorithm, where a dictionnary of small common chunks of data is used to improve compression ratio and speed
In such settings, where lots of data are stored, the relative size of the information required to store the data once and refer to it multiple times while improving copy time is worth the added complexity.

For instance if we declare the string "EMMA", we're using 5 bits
I am sure you are speaking about 5 bytes instead of 5 bits.
Well, I was wondering why 2 M's are allocated separate bits. Can't the
computer make use of the chars or integers currently taking space in
the memory and refer to that specific slot when it wants to reuse it?
A pointer to a "slot" usually occupies 4 or 8 bytes. So there is no sense to spend 8 bytes to point to an object that occupies only one byte
Moreover "EMMA" is a character array that consists from adjacent bytes. So all elements of the array has the same type and correspondingly size.
The compiler can reduce the memory usage by avoiding duplicated string literals. For example it can stores the same string literals as one string literal. This depends on a compiler option.
So if in the program the same string literal occurs for example two times as in these statements
char *s = malloc( sizeof( "EMMA" ) );
strcpy( s, "EMMA" );
then the compiler can store only one copy of the string literal.

The compiler is not supposed to be the code/program but something that does the minimal and it has to perform tasks such that it is easy for programmers to understand and manipulate,in other words it has to be general.
as a programmer you can make your program to save data in the suggested way but it won't be general .
eg- i am making a database for my school and i entered a wrong name and now i want to change the 2nd 'm' in "EMMA",now this would be troublesome if the system worked as suggested by you.
would love to clarify further if needed. :)

Related

(C) what ways can we manipulate memory length and addresses?

im having a tough time proposing this question to alot of people. if anyone here can help me shed some light on this i would greatly appreciate as this has been the ultimate road block for almost three years with my problem.
if you know your arrays and memory than skip the the last paragraph for the big question, but if you read through this it may help you understand why and with what concepts i am struggling to grasp.
so we have an initialized array
int main(){
int x=10;
int arr[x]={3,5,1,9,4,17,2,12,6,8};
to access the 5th element of the array, we would print it out as follows
printf("%d",arr[4]);
now, my first question revolves around this process. the printf function defines that i want to print an integer, and as a directory, i give it the array name/address of first element and specify that i would like the 4th element of the array.
the information i have so far leads me to believe that each integer occupies 4 bytes of memory. usually, in a classroom setting when a professor explains this they draw this on the board
0 1 2 3 4 5 6 7 8 9
[1000][1004][1008][1012][1016][1020][1024][1028][1032][1036]
{ 3 5 1 9 4 17 2 12 6 8 }
now, this is all basic but during my time in learning how computers work ive realized one very important thing, computers are very stupid and that is why we program them, so please bear with me.
question 1: is this an accurate representation of memory? this is how almost everyone on youtube, in classrooms, etc. paints the picture, a sequential list starting at 1000 and increasing by +4, what if i have two arrays of the same size? they cant occupy the same memory addresses so how can i keep track of the where.
question 2: i understand that arr[4]; refers to the value of the 4th position of the array. but this arr[4], what is it? the 4. is this 4 stored in memory somewhere? what data type is this 4? is it a pointer as well? my understanding is that it is an extension of the pointer array, which confuses me. because the array points to the the 1st element in position 0, how does a pointer "4" coexist inside of a pointer? what does the process look like for the computer? im assuming its not a problem because 4 is a pointer argument and can therefore exist within the indicated array pointer.
but the process? is it, go to arr[0] and then count 4 bytes from position 0 4 times? what address does position 0 hold? i know for teaching purposes its visualized as starting at 1000 but is that always the case? why not address 1036 and count 4 up from there? i know i read somewhere that memory addresses compartmentalize storage by making sure if a char (1 byte) is stored in memory next to an int (4 bytes) then there are gaps of memory between to make it all a divisible of 2.
so now on to my final question, which i cant find anything on the interwebz about. can i somehow tell the computer to assign the length of memory from index 0 - n to a variable? maybe im asking in the wrong way, so let me rephrase. is there a data type that defines length of memory and not position? i know we can access the amount of memory a certain variable is taking, but to do that we reference the variable and receive the memory as a result. i want to assign a length of memory to a variable.
is there a data type that defines length of memory and not position?
ALL data types in C do this. Every type defines how many bytes are needed to hold that type and what the individual bits in each byte mean. This is implementation defined, so different compilers and different targets define them differently, but the language gives you the tools to write portable code that will work in a well-defined manner whatever that may be.
sizeof tells you the size of any type in bytes. so sizeof(int) will tell you how large an int is on your target -- generally 4, but some targets use 8 or 2.
CHAR_BIT tells you the size of a byte -- how many bits are in a byte. You don't often need this, but when you do, it is available.

Pointer Meta Information

An interesting feature of realloc() is that it somehow knows how long your data is when it is copying or extending your allocated memory.
I read that what happens is that behind the scenes there is some meta information stored about a pointer (which contains it's allocated memory size), usually immediately before the address the pointer is pointing at (but of course, subject to implementation).
So my question is, if there is such data stored why isn't it exposed via an API, so things like the C string for example, won't have to look for a \0 to know where the end of a string is.
There could be hundreds of other uses as well.
As you said yourself, it's subject to the implementation of the standard libraries' memory manager.
C doesn't have a standard for this type of functionality, probably because generally speaking C was designed to be simple (but also provide a lot of capabilities).
With that said, there isn't really much use for this type of functionality outside of features for debugging memory allocation.
Some compilers do provide this type of functionality (specifically memory block size), but it's made pretty clear that it's only for the purpose of debugging.
It's not uncommon to write your own memory manager, allowing you to have full control over allocations, and even make assumptions for other library components. As you mentioned, your implementation can store the size of an allocation somewhere in a header, and a string implementation can reference that value rather than walking the bytes for \0 termination.
This information is not always accurate.
On Windows, the size is rounded up to a multiple of at least 8.
realloc does fine with this because it cares only about the allocated size, but you won't get the requested size.
Here it's worth understanding how allocators work a bit. For example, if you take a buddy allocator, it allocates chunks in predetermined sizes. For simplicity, let's say powers of 2 (real world allocators usually use tighter sizes).
In this kind of simplified case, if you call malloc and request 19 bytes of memory for your personal use, then the allocator is actually going to give you a 32-byte chunk worth of memory. It doesn't care that it gave you more than you needed, as that still satisfies your basic request by giving you something as large or larger than what you needed.
These chunk sizes are usually stored somewhere, somehow, for a general-purpose allocator that can handle variable-sized requests to be able to free chunks and do things like merge free chunks together and split them apart. Yet they're identifying the chunk size, not the size of your request. So you can't use that chunk size to see where a string should end, e.g., since according to you, you wanted one with 19 bytes, not 32 bytes.
realloc also doesn't care about your requested 19-byte size. It's working at the level of bits and bytes, and so it copies the whole chunk's worth of memory to a new, larger chunk if it has to.
So these kinds of chunk sizes generally aren't useful for implementing things like data structures. For these, you want to be working at the data size proportional to the amount of memory you requested which is something many allocators don't even bother to store (they don't need to in order to work given their efficiency needs). So it's often up to you to keep track of that size which is in tune with the logic of your software in some form (a sentinel like a null terminator is one form).
As for why querying these sizes is not available in the standard, it's probably because the need to know it would be a rather obscure, low-level kind of need given that it has nothing to do with the amount of memory you actually requested to work with. I have often wished for it for some very low-level debugging needs, but there are platform/compiler-specific alternatives that I have used instead, and I found they weren't quite as handy as I thought they would be (even for just low-level debugging).
You asked:
if there is such data stored why isn't it exposed via an API, so things like the C string for example, won't have to look for a \0 to know where the end of a string is.
You can allocate enough memory with the ability to hold 100 characters but you may use that memory to hold a string that is only 5 characters long. If you relied on the pointer metadata, you will get the wrong result.
char* cp = malloc(100);
strcpy(cp, "hello");
The second reason you would need the terminating null character is when you use stack memory to create a string.
char str[100] = "hello";
Without the terminating null character, you won't be able determine the length of the string held in str.
You said:
There could be hundreds of other uses as well.
I don't have a good response to that other than to say that custom memory allocators often provide access to the metadata. The reason for not including such APIs in the standard library is not obvious to me.

Sizeof pointer for 16 bit and 32 bit

I was just curious to know what would the sizeof pointer return for a 16 bit and a 32 bit system
printf("%d", sizeof(int16 *));
printf("%d", sizeof(int32 *));
Thank you.
Short answer: On a 32bit Intel 386 you will likely see these returning 4, while targeting a 16bit 8086 you might most likely see either 2 or 4 depending on the memory model you selected.
The details
First standard C does not mandate anything particular about pointers, only that they need to be able to "point to" the given variable, and pointer arithmetic needs to work within the data area of the given variable. Even a C interpreter which has some exotic representation of pointers is possible, and given this flexibility pointers truly might be of any size depending on what you target.
Usually however compilers indeed represent pointers by memory addresses which makes several operations undefined by the C standard "usually working". The way how the compiler chooses to represent a pointer depends on the targeted architecture: compiler writers obviously chose representations which are either or both useful and efficient.
An example to useful representations is generic pointers on a Harward architecture micro. They allow you to address both code and data ram. On a 8 bit micro they might be encoded as one type byte plus 2 address bytes, this obviously implies that whenever you dereference one such pointer, more complex code has to be emitted to load the contents from the proper place.
That gives a good example to an efficient representation: why not have specific pointers then? One which points to code memory, an other which points to data memory? Just 2 bytes (assuming 16bit address space as usual for 8bit micros such as the 8051), and no need to select by type.
But then you have multiple types of pointers, eh (again the 8051: you will likely have at least one additional type of pointer pointing within it's internal RAM too...). The programmer then needs to think about which particular pointer type he needs to use.
And of course the sizes also differ. On this hypothetical compiler targeting the 8051, you would have a generic pointer type of 3 bytes, an external data memory pointer type of 2 bytes, a code memory pointer of 2 bytes, and an internal RAM pointer type of 1 byte.
Also note that these are types of pointers, and not the types of data they point to (function pointers are a little off here as the fact a pointer is a function pointer implies that it is of a different type than data pointers while not having any specific syntax difference except that the data type it points to is a function type).
Back to your 16bit machine, assuming it is a 8086:
If you use some memory model where the compiler assumes you have a single data segment, you will likely get 2 byte data pointers if you don't specifically declare one near or far. Otherwise you will get 4 byte pointers by default. The representation of 2 byte pointers is usually simply the 16bit offset, while for 4 byte pointers it is a segment:offset pair. You can always apply a near or far specifier to explicitly make your pointers one or another type.
(How near pointers work in an program which also uses far pointers? Simply there is a default data segment generated by the compiler, and all nears are located within that. The compiler may simply permanently, or at least most of the time, have the ds segment register filled with the default data segment, so access of data pointed by nears can be faster)
The size of the a pointer depends on the architecture. Precisely, it depends on the size of the addresses used in that architecture which reflects the size of the bus system to access the memory.
For example, on 32 bits architecture the size of an address is 4 bytes :
sizeof (void *) == 4 Bytes.
On 64bits, addreses have size 8 bytes:
sizeof (void *) == 8 bytes.
Note, that all pointers have the same size interdependently of the type. So if you execute your code, the size of a int16 pointer and the size of int32 pointer will be the same.
However, the size of a pointer on a 16 bit system should be 2 bytes. Usually, 16bit systems have really few memory (some megabytes) and 2 bytes are enough to address all its locations. To be more precise, with a pointer of 16 bit the maximum memory you can have is around 65 KB. (really few compared to the amount of memory of a today computer).

What is the purpose of the byte size of the type of a variable if I know the address of the variable?

I am not getting the whole purpose of working with the byte size of a variable by knowing the address of it. For example, let's say I know where an int variable is stored, let's say it is stored in address 0x8C729A09, if I want to get the int stored in that address I can just dereference the address and get the number stored on it.
So, what exactly is the purpose of knowing the byte size of the variable? Why does it matter if the variable has 4 bytes (being int) or 8 bytes if I am able to get the value of the variable by just dereference the address? I am asking this, because I am working on dereferencing some address and I thought that I needed to go through a for loop to get the variable (By knowing the start address, which is the address of the variable, and the size of the variable in bytes) but whenever I do this I am just getting other variables that are also declared.
A little bit of context: I am working on a tool called Pin and getting the addresses of the global variables declared in another program.
The for case looks something like this:
for(address1 = (char *) 0x804A03C, limit = address1 + bytesize; address1 < limit; address1++)
cout << *(address1) << "\n";
Michael Krelin gave a very compact answer but I think I can expand on it a bit more.
In any language, not just C, you need to know the size for a variety of reasons:
This determines the maximum value that can be stored
The memory space an array of those values will take (1000 bytes will get you 250 ints or 125 longs).
When you want to copy one array of values into another, you need to know how many bytes are used to allocate enough space.
While you may dereference a pointer and get the value, you could dereference the pointer at a different portion of that value, but only if you know how many bytes it is composed of. You could get the high-value of an int by grabbing just the first two bytes, and the low value by getting the last two bytes.
Different architectures may have different sizes for different variables, which would impact all the above points.
Edit:
Also, there are certainly reasons where you may need to know the number of bits that a given variables is made of. If you want 32 booleans, what not a better variable to use than a single int, which is made of 32 bits? Then you can use some constants to create pointers to each bit and now you have an "array" of booleans. These are usually called bit-fields (correct me if I am wrong). In programming, every detail can matter, just not all the time for every application. Just figured that might be an interesting thought exercise.
The answer is simple: the internal representation of most types needs more than one byte. In order to dereference a pointer you (either you or the compiler) need to know how much bytes should be read.
Also consider it when working with strings, you cannot always relay on the terminating \0, hence you need to know how many bytes you have to read. Examples of these are functions like memcpy or strncmp.
Supposed you have an array of variables. Where do you find the variable at non-zero index without knowing its size? And how many bytes do you allocate for non-zero length array of variables?

Assemblers and word alignment

Today I learned that if you declare a char variable (which is 1 byte), the assembler actually uses 4 bytes in memory so that the boundaries lie on multiples of the word size.
If a char variable uses 4 bytes anyway, what is the point of declaring it as a char? Why not declare it as an int? Don't they use the same amount of memory?
When you are writing in assembly language and declare space for a character, the assembler allocates space for one character and no more. (I write in regard to common assemblers.) If you want to align objects in assembly language, you must include assembler directives for that purpose.
When you write in C, and the compiler translates it to assembly and/or machine code, space for a character may be padded. Typically this is not done because of alignment benefits for character objects but because you have several things declared in your program. For example, consider what happens when you declare:
char a;
char b;
int i;
char c;
double d;
A naïve compiler might do this:
Allocate one byte for a at the beginning of the relevant memory, which happens to be aligned to a multiple of, say, 16 bytes.
Allocate the next byte for b.
Then it wants to place the int i which needs four bytes. On this machine, int objects must be aligned to multiples of four bytes, or a program that attempts to access them will crash. So the compiler skips two bytes and then sets aside four bytes for i.
Allocate the next byte for c.
Skip seven bytes and then set aside eight bytes for d. This makes d aligned to a multiple of eight bytes, which is beneficial on this hypothetical machine.
So, even with a naïve compiler, a character object does not require four whole bytes to itself. It can share with neighbor character objects, or other objects that do not require greater alignment. But there will be some wasted space.
A smarter compiler will do this:
Sort the objects it has to allocate space for according to their alignment requirements.
Place the most restrictive object first: Set aside eight bytes for d.
Place the next most restrictive object: Set aside four bytes for i. Note that i is aligned to a multiple of four bytes because it follows d, which is an eight-byte object aligned to a multiple of eight bytes.
Place the least restrictive objects: Set aside one byte each for a, b, and c.
This sort of reordering avoids wasting space, and any decent compiler will use it for memory that it is free to arrange (such as automatic objects on stack or static objects in global memory).
When you declare members inside a struct, the compiler is required to use the order in which you declare the members, so it cannot perform this reordering to save space. In that case, declaring a mixture of character objects and other objects can waste space.
Q: Does a program allocate four bytes for every "char" you declare?
A: No - absolutely not ;)
Q: Is it possible that, if you allocate a single byte, the program might "pad" with extra bytes?
A: Yes - absolutely yes.
The issue is "alignment". Some computer architectures must access a data value with respect to a particular offset: 16 bits, 32 bits, etc. Other architectures perform better if they always access a byte with respect to an offset. Hence "padding":
http://en.wikipedia.org/wiki/Byte_padding#Data_structure_padding
There may indeed not be any point in declaring a single char variable.
There may however be many good reasons to want a char-array, where an int-array really wouldn't do the trick!
(Try padding a data structure with ints...)
Others have for the most part answered this. Assuming a char is a single byte, does declaring a char mean that it always pads to an alignment? Nope, some compilers do by default some dont, and many you can change the default using some sort of command somewhere. Does this mean you shouldnt use a char? It depends, first off the padding doesnt always happen so the few wasted bytes dont always happen. You are programming in a high level language using a compiler so if you think that you have only 3 wasted bytes in your whole binary...think again. Depending on the architecture using chars can have some savings, for example loading immediates saves you three bytes or more on some architectures. Other architectures just to do simple operations with the register extra instructions are required to sign extend or clip the larger register to behave like a byte sized register. If you are on a 32 bit computer and you are using an 8 bit character because you are only counting from 1 to 100, you might want to use a full sized int, in the long run you are probably not saving anyone anything by using the char. Now if this is an 8086 based pc running dos, that is a different story. Or an 8 bit microcontroller, then you want to lean toward the 8 bit variables as much as possible.

Resources