How does a C compiler convert a constant to binary

How does a C compiler convert a constant to binary - c

For the sake of specifics, let's consider GCC compiler, the latest version.
Consider the instruction int i = 7;.
In assembly it will be something like
MOV 7, R1
This will insert the value seven to register R1. The exact instruction may not be important here.
In my understanding, now the compiler will convert the MOV instruction to processor specific OPCODE. Then it will allocate a (possibly virtual) register. Then the constant value 7 needs to go in the register.
My question:
How does the 7 is actually converted to binary?
Does the compiler actually repeatedly divide by 2 to get the binary representation? (May be afterwards it will convert to HEX, but let's remain on the binary step).
Or, considering that the 7 is written as a character in a text file, is there a clever look up table based technique to convert any string (representing a number) to a binary value?
If the current GCC compiler uses built in function to convert a string 7 to a binary 0111, then how did the first compiler convert a text based string to a binary value?
Thank you.

How does the 7 is actually converted to binary?
First of all, there's a distinction between the binary base 2 number format and what professional programmers call "a binary executable", meaning generated machine code and most often expressed in hex for convenience. Addressing the latter meaning:
Disassemble with binaries (for example at https://godbolt.org/) and see for yourself
int main (void)
{
int i = 7;
return i;
}
Does indeed get translated to something like
mov eax,0x7
ret
Translated to binary op codes:
B8 07 00 00 00
C3
Where B8 = mov eax, B9 = mov ecx and so on. The 7 is translated into 07 00 00 00 since mov expects 4 bytes and this is a little endian CPU.
And this is the point where the compiler/linker stops caring. The code was generated according to the CPU's ABI (Application Binary Interface) and how to deal with this machine code from here on is up to the CPU.
As for how this makes it into the hardware in the actual form of base 2 binary... it's already in that form. Everything we see in a PC is a translated convenience for the human users, who have an easier time reading decimal or hex than raw binary.

If the current GCC compiler uses built in function to convert a string 7 to a binary 0111, then how did the first compiler convert a text based string to a binary value? This is egg chicken problem but to simply put these compilers are created step by step and at some point the compiler is written in its language such that c compiler is written by c etc.
Before to answer your question we should define what we mean by "compilation" or what compiler does. to simply put this compilation is a pipeline. Takes your high level code does some operations and generates an assembly code(specific to machine) and machine defined assembler takes your assembly code and converts it into a binary object file.
At the compiler level all they do is to create corresponding assembly format in a text file.
and assembler is another program that takes this text file and converts it into "binary" format. Assembler can be also written by c language here we also need a mapping i.e movl->(0000110101110...) but this one is binary not ascii. and we need to write this binary into a file as-is.
Converting numbers into binary format is also redundant because numbers are already in binary form when they are loaded into memory.
the question is how they are converted/placed in to memory is a problem of the loader program of the operating system which exceeds my knowledge.

Related

Is memory stored in ARM Assembly in binary?

Let's say I did an instruction such as MOV R1, #9. Would this store the binary number 9 in binary in memory in ARM Assembly? Would the ARM Assembly code see the number 00001001 stored in the location of R1 in memory? If not what how would it store the decimal number 9 in binary?

The processor's storage and logic are composed of gates, which manipulate values as voltages. For simplicity, only two voltages are recognized: high and low, making a binary system.
The processor can manipulate (compute with) these values using simple boolean logic (AND, OR, etc..) — this is called combinational logic.
The processor can also store binary values for a time using cycles in logic (here think circular, as in cycles in a graph) called sequential logic. A cycle creates a feedback loop that will preserve values until told to preserve a new one (or power is lost).
Most anything meaningful requires multiple binary values, which we can call bit strings. We can think of these bit strings as numbers, characters, etc.., usually depending on context.
The processor's registers are an example of sequential logic used to store bit strings, here strings of 32 bits per register.
Machine code is the language of the processor. The processor interprets bit strings as instruction sequences to carry out. Some bit strings command the processor to store values in registers, which are composed of gates using strings of sequential logic, here 32 bits per register: each bit stores a value as a voltage in a binary system.
The registers are storage, but the only way we can actually visualize them is to retrieve their values using machine code instructions to send these values to an output device for viewing as numbers, characters, colors, etc..
In short, if you send a number #9 to a register, it will store a certain bit string, and some subsequent machine code instruction will be able to retrieve that same bit pattern. Whether that 9 represents a numeric 9 or tab character or something else is a matter of interpretation, which is done by programming, which is source code ultimately turned into sequences of machine code instructions.

Everything is always binary in a computer; decimal and hex are just human-friendly notations for writing the values that a binary bit-pattern represents.
Even after you convert a binary integer to a string of decimal digits (e.g. by repeated division by 10 to get digit=remainder), those digits are stored one per byte in ASCII.
For example the digit '0' is represented by the ASCII / UTF-8 bit-pattern 00110000 aka hex 0x30 aka decimal 48.
We call this "converting to decimal" because the value is now represented in a way that's ready to copy into a text file or into video memory where it will be shown to humans as a series of decimal digits.
But really all the information is still represented in terms of binary bits which are either on or off. There are no single wires or transistors inside a computer that can represent a single 10-state value using e.g. one of ten different voltage levels. It's always either all the way on or all the way off.
https://en.wikipedia.org/wiki/Three-valued_logic aka Ternary logic is the usual example of non-binary logic. True decimal logic would be one instance of n-valued logic, where a "bit" can have one of 10 values instead of one of 3.
In practice CPUs are based on strictly binary (Boolean) logic. I mention ternary logic only to illustrate what it would really mean for computers to truly use decimal logic/numbers instead of just using binary to represent decimal characters.

A processor has a number of places to store content; you can call it a memory hierarchy: „Memories“ that are very close to the processor are very fast and small, e.g. registers, and „memories“ that are far from the processor are slow and large, like a disk. In between are cache memories at various levels and the main memory.
Since these memories are technologically very different, the are accessed in different ways: Registers are accessed by specifying them in assembly code directly (as R1 in MOV R1, #9), whereas the main memory storage locations are consecutively numbered with „addresses“.
So if you execute MOV R1, #9, the binary number 9 is stored in register 1 of the CPU, not in main memory, and an assembly code reading out R1 would read back number 9.
All numbers are stored in a binary format. The most well known are integers (as your example binary 00001001 for decimal 9), and floating point.

Adding new instructions to binutils 2.25

I am new to binutils development. I am trying to add a new custom instruction that takes two operands (size, base virtual address) . I am using binutils 2.25.The opcode is 3 byte long and I am running it on x86 machine.Here is what I did :
Used i386-opc.tbl to add an entry as follows.
enclsecreate, 2, 0x0f01cf, None, 3, Cpu386, No_bSuf|No_wSuf|No_lSuf|No_sSuf|No_qSuf|No_ldSuf, {Reg64, Reg64}
My understanding,The second operand states the number of operands, followed by the opcode, followed by None, followed by number of bytes in the opcode.
The use i386-gen :
./i386-gen --srcdir=
which creates i386-tbl.h
i386-gen was not built, i built using make i386-gen and then ran the above step.
To enable the use of disassembler, we need to update i386-dis.c.We need to add an entry to an table.I am lost at this point as to which table I need to add as there are so many of them and I dont understand the format of them.
It would be great if someone could guide me through further steps I need to take or point me to some documentations that contains the necessary information.Looking forward to your kind help.
Thanks

You could have showed the actual encoding for this instruction as it is not yet in the official intel instruction set reference (january 2015 version).
I find it strange that you say it has 2 operands, because I don't see a place for encoding them (maybe they are implicit). So I'll just assume no operands for the following.
The comment at the top of i386-dis.c says:
/* The main tables describing the instructions is essentially a copy
of the "Opcode Map" chapter (Appendix A) of the Intel 80386
Programmers Manual. Usually, there is a capital letter, followed
by a small letter. The capital letter tell the addressing mode,
and the small letter tells about the operand size. Refer to the
Intel manual for details. */
The applicable table of the manual for your opcode is Table A-6. Opcode Extensions for One- and Two-byte Opcodes by Group Number. If you have an updated copy, you should find your instruction in the 0F 01 11B row, column 001 with low 3 bits of (111). This is of course the breakdown of the CF.
The first thing for binutils is the 0F 01 group and the 001 column. This means you have to edit the table RM_0F01_REG_1. That lists the 8 possible instructions in order of their low bits. My copy currently has monitor and mwait there, for 000 and 001 respectively. You might have others too. Since the low 3 bits of our new instruction are 111 which is 7 in decimal, it has to go in the last slot in that table. Pad the table with Bad_Opcode as necessary, then insert the new entry.
If you need special decoding of the operands, you can add a new function to handle it. Use an existing one (e.g. OP_Monitor) as template. By the way, it's also an easy way to locate the required table: just look for an existing instruction that is in the same encoding group as your new instruction.
Disassembly of section .text:
0000000000000000 <.text>:
0: 0f 01 cf enclsecreate
Yay, success!

assembler 68K symbol '!' in source operand

How to interpret assembly 68K instruction:
MOVE.W #5100!$13ec,-(A7)
What is a meaning of symbol '!' between decimal 5100 and hexadecimal 13ec.
I have noticed that 5100 is equal to $13ec.

This is your disassembler being "helpful", and showing you two possible interpretations of a value. Sometimes the decimal view is what you want (e.g. it's a loop counter, or a fixed size, or a decimal constant), and sometimes the hex view is what you want (e.g. it's an address, a block size, flags, or a hex constant). By providing both, the disassembler is just trying to be helpful.
If you were going to assemble this instruction, you'd only use one interpretation, e.g.
MOVE.W #5100,-(A7)
or
MOVE.W $13ec,-(A7)

How does C know what type to expect?

If all values are nothing more than one or more bytes, and no byte can contain metadata, how does the system keep track of what sort of number a byte represents? Looking into Two's Complement and Single Point on Wikipedia reveals how these numbers can be represented in base-two, but I'm still left wondering how the compiler or processor (not sure which I'm really dealing with here) determines that this byte must be a signed integer.
It is analogous to receiving an encrypted letter and, looking at my shelf of cyphers, wondering which one to grab. Some indicator is necessary.
If I think about what I might do to solve this problem, two solutions come to mind. Either I would claim an additional byte and use it to store a description, or I would allocate sections of memory specifically for numerical representations; a section for signed numbers, a section for floats, etc.
I'm dealing primarily with C on a Unix system but this may be a more general question.

how does the system keep track of what sort of number a byte represents?
"The system" doesn't. During translation, the compiler knows the types of the objects it's dealing with, and generates the appropriate machine instructions for dealing with those values.

Ooh, good question. Let's start with the CPU - assuming an Intel x86 chip.
It turns out the CPU does not know whether a byte is "signed" or "unsigned." So when you add two numbers - or do any operation - a "status register" flag is set.
Take a look at the "sign flag." When you add two numbers, the CPU does just that - adds the numbers and stores the result in a register. But the CPU says "if instead we interpreted these numbers as twos complement signed integers, is the result negative?" If so, then that "sign flag" is set to 1.
So if your program cares about signed vs unsigned, writing in assembly, you would check the status of that flag and the rest of your program would perform a different task based on that flag.
So when you use signed int versus unsigned int in C, you are basically telling the compiler how (or whether) to use that sign flag.

The code that is executed has no information about the types. The only tool that knows the types is the compiler
at the time it compiles the code. Types in C are solely a restriction at compile time to prevent you
from using the wrong type somewhere. While compiling, the C compiler keeps track of the type
of each variable and therefore knows which type belongs to which variable.
This is the reason why you need to use format strings in printf, for example. printf has no chance of knowing what type it will get in the parameter list as this information is lost. In languages like go or java you have a runtime with reflection capabilities which makes it possible to get the type.
Suppose your compiled C code would still have type information in it, there would be the need for
the resulting assembler language to check for types. It turns out that the only thing close to types in assembly is size
of the operands for an instruction determined by suffixes (in GAS). So what is left from your type information is the size and nothing more.
One example for assembly which supports type is the java VM bytecode, which has type suffixes
for operands for primitives.

It is important to remember that C and C++ are high level languages. The compiler's job is to take the plain text representation of the code and build it into the platform specific instructions the target platform is expecting to execute. For most people using PCs this tends to be x86 assembly.
This is why C and C++ are so loose with how they define the basic data types. For example most people say there are 8 bits in a byte. This is not defined by the standard and there is nothing against some machine out there having 7 bits per byte as its native interpretation of data. The standard only recognizes that a byte is the smallest addressable unit of data.
So the interpretation of data is up to the instruction set of the processor. In many modern languages there is another abstraction on top of this, the Virtual Machine.
If you write your own scripting language it is up to you to define how you interpret your data in software.

Using C besides the compiler, that perfectly well knows about the type of the given values there is no system that knows about the type of a given value.
Note that C by itself doesn't bring any runtime type information system with it.
Take a look at the following example:
int i_var;
double d_var;
int main () {
i_var = -23;
d_var = 0.1;
return 0;
}
In the code there are two different types of values involved one to be stored as an integer and one to be stored as a double value.
The compiler that analyzes the code pretty well knows about the exact types of both of them. Here the dump of a short fragment of the type information gcc held while generation code generated by passing the -fdump-tree-all to gcc:
#1 type_decl name: #2 type: #3 srcp: <built-in>:0
chan: #4
#2 identifier_node strg: int lngt: 3
#3 integer_type name: #1 size: #5 algn: 32
prec: 32 sign: signed min : #6
max : #7
...
#5 integer_cst type: #11 low : 32
#6 integer_cst type: #3 high: -1 low : -2147483648
#7 integer_cst type: #3 low : 2147483647
...
#3805 var_decl name: #3810 type: #3 srcp: main.c:3
chan: #3811 size: #5 algn: 32
used: 1
...
#3810 identifier_node strg: i_var lngt: 5
Hunting down the #links you should clearly see that there really is a lot of information stored about memory-size, alignment-constraints and allowed min- and max-values for the type "int" stored in the nodes #1-3 and #5-7. (I left out the #4 node as the mentioned "chan" entry is just used to cha i n up any type definitions in the generated tree)
Reagarding the variable declared at main.c line 3 it is known, that it is holding a value of type int as seen by the type reference to node #3.
You'll sure be able to hunt down the double entries and the ones for d_var in an own experiment yourself if you don't trust me they will also there.
Taking a look at the generated assembler code (using gcc pass the -S switch) listed we can take a look at the way the compiler used this information in code generation:
.file "main.c"
.comm i_var,4,4
.comm d_var,8,8
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
movl $-23, i_var
fldl .LC0
fstpl d_var
movl $0, %eax
popl %ebp
ret
.size main, .-main
.section .rodata
.align 8
.LC0:
.long -1717986918
.long 1069128089
.ident "GCC: (Debian 4.4.5-8) 4.4.5"
.section .note.GNU-stack,"",#progbits
Taking a look at the assignment instructions you will see that the compiler figured out the right instructions "mov" to assign our int value and "fstp" to assign our "double" value.
Nevertheless besides the instructions chosen at the machine level there is no indication of the type of those values. Taking a look at the value stored at .LC0 the type "double" of the value 0.1 was even broken down in two consecutive storage locations each for a long to meet the known "types" of the assembler.
As a matter of fact breaking the value up this way was just one choice of other possiblities, using 8 consecutive values of "type" .byte would have done equally well.

The machine code in LC-3 are also known as assembly?

I'm a little confused over the concept of machine code...
Is machine code synonymous to assembly language?
What would be an example of the machine code in LC-3?

Assembly instructions (LD, ST, ADD, etc. in the case of the LC-3 simulator) correspond to binary instructions that are loaded and executed as a program. In the case of the LC-3, these "opcodes" are assembled into 16-bit strings of 1s and 0s that the LC-3 architecture is designed to execute accordingly.
For example, the assembly "ADD R4 R4 #10" corresponds to the LC-3 "machine code":
0001100100101010
Which can be broken down as:
0001 - ADD.
100 - 4 in binary
100 - 4 in binary
10 - indicates that we are adding a value to a register
1010 - 10 in binary
Note that each opcode has a distinct binary equivalent, so there are 2^4=16 possible opcodes.
The LC-3 sneakily deals with this problem by introducing those flag bits in certain instructions. For ADD, those two bits change depending on what we're adding. For example, if we are adding two registers (ie. "ADD R4 R4 R7" as opposed to a register and a value) the bits would be set to 01 instead of 10.
This machine code instructs the LC-3 to add decimal 10 to the value in register 4, and store the result in register 4.

An "Assembly language" is a symbolic (human-readable) language, which a program known as the "assembler" translates into the binary form, "machine code", which the CPU can load and execute. In LC-3, each instruction in machine code is a 16-bit word, while the corresponding instruction in the assembly language is a line of human-readable text (other architectures may have longer or shorter machine-code instructions, but the general concept it the same).

The above stands true for any high level language (such as C, Pascal, or Basic). The differece between HLL and assembly is that each assembly language statement corresponds to one machine operation (macros excepted). Meanwhile, in a HLL, a single statement can compile into a whole lot of machine code.
You can say that assembly is a thin layer of mnemonics on top of machine code.