I am attempting to (one step at a time) build my own copy of Forth to run on Mac OS X.
I currently have a version of Forth running on Apache and localhost in PHP, Ruby, and Python.
I want to make a version of Forth in C that will create a native executable version of Forth that can make its own native executable files of any compiled Forth code. Sorry about the semi-recursive sentence. My goal is to start in C and end up with my own Forth compiler (no longer running in any C code).
My starting point is to attempt to get a minimal test program to run as a binary executable for Terminal. Once I can understand what the existing C compiler is doing, I can modify its methods for my own purposes.
I created a small "hello world" program in C and ended up with an executable file of 8,497 bytes, which consisted mostly of 0x00 arrays (presumably buffers). My guess is that the entire stdio library was included.
Next, I created the smallest possible C program I could think of (other than a completely null program -- I wanted to be able to find my code in the resulting hex):
int main(void)
{
char testitem;
testitem = 'A';
return -1;
}
That should have given me the barest possible overhead with the storage of the ASCII A and the return value of all ones being easy to find.
Instead, I ended up with a file of 4,313 bytes. There were four locations with the 0x41 (ASCII 'A'), but none were part of a MOV immediate byte instruction. Presumably the 0x41 was stored as constant data and loaded with a different MOV instruction.
Again there were a lot of 0x00 arrays (3,731 bytes, or all but 402 bytes). Presumably there is some kind of header data in the object file (which does run correctly in Terminal and does signal -1) and who knows what else.
At the moment I am not concerned with having an application bundle -- running in Terminal is my short term goal. Once I have this first step working, I can move on to a full application.
Any suggestions on how to determine what I need to have in the object file for it to correctly work as a Terminal tool?
This turns out to be a common challenge. You might want to check following links that provide rather in-depth information:
Let's Build A Mach-O Executable
Hello Mach-O
Related
In my project, we are building application to run on Linux both on x86 and ARM. Accidentally, I have run the x86 binary on ARM, and to my surprise the binary launched - sort of. It wrote one of the string literals to stdout and immediately ended with segfault.
No meaningful message along the lines "This binary cannot be run on this platform" was shown, which is something I was assuming would happen.
Is it technically possible to set my compiler/linker/anything in a way, that the output binary will not be run at all if launched on wrong architecture? Or that some meaningful message will be displayed?
What you want is FatELF.
Since that isn't really supported, you could write a shell-script, put your executable's content in there (base64-encoded), and write the correct executable for the correct architecture to /tmp, and if the architecture is not supported, you could display an error message.
That way, you'd have one executable for all Unix/Linux/Mac platforms for all processor architectures, with no dependency on the user making a (wrong) decision.
I am trying to generate an incrementing value at load time to be used to "serialize" a PCB with a unique code value. Not an expert in ld or preprocessor commands, so looking for some help.
The value will be used in a unique ID for each board that the code is loaded on and will also be used as a counter for boards in the field.
I have no preconceived idea of how I might accomplish this, so any workable answer to get me started, including a pre-preprocessor macro is fine. In my olden days, I recollect adding a couple lines to the linker file that would accomplish this, but I have been unable to resurrect that information anywhere (including my brain's memory cells).
The simpler the answer, the better.
My solution to the problem was remarkably simple.
The binary contained
const char *serial = "XY-00000";
I then wrote a short program that boiled down to:
char uniqueserial [8];
/* Generate serial - this was an SQL call to the manufacturing DB */
char *array;
/* Read binary into array */
memcpy(memmem(array, "XY-00000",8), uniqueserial,8);
/* Write array to temp bin file for flashing */
Depends on the serial template string being unique in the binary. Use strings command to check. I disable crc protected object files due to taste. I like my embedded binaries being exact memory dumps.
The linker is not the right place for two reasons:
the executable can be loaded with the same id in several devices, making your approach void.
You should have to link the executable for each device you are programming, which poses an spent of cpu resources.
The best place is to patch the executable at loading time with the serial number.
Select a data patern as token to initialize your variable with the device id (a pattern difficult to happen elsewhere in your program binary) and initialize your serial number variable to that data pattern (better if you do it statically initializing an array variable or something similar)
Make a program to be executed on each download to device that search for the pattern in the executable file, before loading the binary program into the device and writes the correct value to be programmed into the device (beware that you are patching a binary, so you cannot think on variable lenght strings or the like, that can trash all the work made by the linker)
Once patched the binary executable, you can download it to the device.
Another solution is to reserve a fixed area in your linker script for all this kind of information. Then, put all your device information variables there. Then get the exact positions in rom for the individual variables and include the proper data in the loaded image. In this case, the linker is your friend, reserving some fixed segment in your device's rom allocated for storing the device's individual data (you can put there mac addresses, serial numbers, default configuration, etc.)
I am doing an ongoing project to write a simplified OS for hobby/learning purposes. I can generate hex files, and now I want to write a script on the chip to accept them via serial, load them into RAM, then execute them. For simplicity I'm writing in assembly so that all of the startup code is up to me. Where do I start here? I know that the hex file format is well documented, but is it as simple as reading the headers for each line, aligning the addresses, then putting the data into RAM and jumping to the address? It sounds like I need a lot more than that, but this is a problem that most people don't even try to solve. Any help would be great.
way too vague, there are many different file formats and at least two really popular ones that use text with the data in hex. So not really helping us here.
writing a script on chip means you have an operating system running on your microcontroller? what operating system is it and what does the command line look like, etc.
assembly is not required to completely control everything (basically baremetal) can use asm to bootstrap C and then the rest in C, not a problem.
Are you wanting to download to ram and run or wanting to download and then burn to flash to reset into in some way?
Basically what you are making is a bootloader. And yes we write bootloaders all the time, one for each platform, sometimes borrowing code from a prior platform sometimes not.
First off on your development computer, windows, mac, linux, whatever, write a program (in C or Pascal ideally, something you can easily port to the microcontroller) that reads the whole file into an array, then write some code that basically accepts one byte at a time like you would if you were receiving it serially. Parse through that file format whatever format you choose (initially, then perhaps change formats if you decide you no longer like it) take real programs that you have built which the disassembler or other tools should have other output options to show you what bytes or words should be landing at what addresses. Parse this data, printf out the address/byte or address/word items you find, and then compare that to what the toolchain showed. carve the parsing tool out and replace the printf with write to memory at that address. and then yes you jump to the entry point if you can figure that out and/or you as the designer decide all programs must have a specific entry point.
Intel hex and motorola s-record are good choices (-O ihex or -O srec, my current favorite is --srec-forceS3 -O srec), both are documented at wikipedia. and at least the gnu tools will produce both, you can use a dumb terminal program (minicom) to dump the file into your microcontroller and hopefully parse and write to ram as it comes in. If you cant handle that flow you might think of a raw binary (-O binary) and implement an xmodem receiver in your bootloader.
Why does inserting characters into an executable binary file cause it to "break" ?
And, is there any way to add characters without breaking the compiled program?
Background
I've known for a long time that it is possible to use a hex editor to change code in a compiled executable file and still have it run as normal...
Example
As an example in the application below, Facebook could be changed to Lacebook, and the program will still execute just fine:
But it Breaks with new Characters
I'm also aware that if new characters are added, it will break the program and it won't run, or it will crash immediately. For example, adding My in front of Facebook would achieve this:
What I know
I've done some work with C and understand that code is written in human readable, compiled, and linked into an executable file.
I've done introductory studies of assembly language and understand the concepts about data, commands, and pointers being moved around
I've written small programs for Windows, Mac and Linux
What I don't know
I don't quite understand the relationship between the operating system and the executable file. I'd guess that when you type in the name of the program and press return you are basically instructing the operating system to "execute" that file, which basically means loading the file into memory, setting the processor's pointer to it, and telling it 'Go!'
I understand why having extra characters in a text string of the binary file would cause problems
What I'd like to know
Why do the extra characters cause the program to break?
What thing determines that the program is broken? The OS? Does the OS also keep this program sandboxed so that it doesn't crash the whole system nowadays?
Is there any way to add in extra characters to a text string of a compiled program via a hex editor and not have the application break?
I don't quite understand the relationship between the operating system and the executable file. I'd guess that when you type in the name of the program and press return you are basically instructing the operating system to "execute" that file, which basically means loading the file into memory, setting the processor's pointer to it, and telling it 'Go!'
Modern operating systems just map the file into memory. They don't bother loading pages of it until it's needed.
Why do the extra characters cause the program to break?
Because they put all the other information in the file in the wrong place, so the loader winds up loading the wrong things. Also, jumps in the code wind up being to the wrong place, perhaps in the middle of an instruction.
What thing determines that the program is broken? The OS? Does the OS also keep this program sandboxed so that it doesn't crash the whole system nowadays?
It depends on exactly what gets screwed up. It may be that you move a header and the loader notices that some parameters in the header have invalid data.
Is there any way to add in extra characters to a text string of a compiled program via a hex editor and not have the application break?
Probably not reliably. At a minimum, you'd need to reliably identify sections of code that need to be adjusted. That can be surprisingly difficult, particularly if someone has attempted to make it so deliberately.
When a program is compiled into machine code, it includes many references to the addresses of instructions and data in the program memory. The compiler determines the layout of all the memory of the program, and puts these addresses into the program. The executable file is also organized into sections, and there's a table of contents at the beginning that contains the number of bytes in each section.
If you insert something into the program, the address of everything after that is shifted up. But the parts of the program that contain references to the program and data locations are not updated, they continue to point to the original addresses. Also, the table that contains the sizes of all the sections is no longer correct, because you increased the size of whatever section you modified.
The format of a machine-language executable file is based on hard offsets, rather than on parsing a byte stream (like textual program source code). When you insert a byte somewhere, the file format continues to reference information which follows the insertion point at the original offsets.
Offsets may occur in the file format itself, such as the header which tells the loader where things are located in the file and how big they are.
Hard offsets also occur in machine language itself, such in instructions which refer to the program's data or in branch instructions.
Suppose an instruction says "branch 200 bytes down from where we are now", and you insert a byte into those 200 bytes (because a character string happens to be there that you want to alter). Oops; the branch still covers 200 bytes.
On some machines, the branch couldn't even be 201 bytes even if you fixed it up because it would be misaligned and cause a CPU exception; you would have to add, say, four bytes to patch it to 204 (along with a myriad other things needed to make the file sane).
I've compiled a C file that does absolutely nothing (just a main that returns... not even a "Hello, world" gets printed), and I've compiled it with various compilers (MinGW GCC, Visual C++, Windows DDK, etc.). All of them link with the C runtime, which is standard.
But what I don't get is: When I open up the file in a hex editor (or a disassembler), why do I see that almost half of the 16 KB is just huge sections of either 0x00 bytes or 0xCC bytes? It seems rather ridiculous to me... is there any way to prevent these from occurring? And why are they there in the first place?
Thank you!
Executables in general contain a code segment and at least one data segment. I guess each of these has a standard minimum size, which may be 8K. And unused space is filled up with zeros. Note also that an EXE written in a higher level (than assembly) language contains some extra stuff on top of the direct translation of your own code and data:
startup and termination code (in C and its successors, this handles the input arguments, calls main(), then cleans up after exiting from main())
stub code and data (e.g. Windows executables contain a small DOS program stub whose only purpose is to display the message "This program is not executable under DOS").
Still, since executables are usually supposed to do something (i.e. their code and data segment(s) do contain useful stuff), and storage is cheap, by default noone optimizes for your case :-)
However, I believe most of the compilers have command line parameters with which you can force them to optimize for space - you may want to check the results with that setting.
Here is more details on the EXE file formats.
As it turns out, I should've been able to guess this beforehand... the answer was the debug symbols and code; those were taking up most of the space. Not compiling with /DEBUG and /PDB (which I always do by default) reduced the 13 K down to 3 K.