Porting MASM5 code in Quake 2 to GAS -- Unexpected rendering results - c

I am porting Quake 2's inline assembly code for MSVC to MASM then finally to GAS (for use with MinGW). The specific code in question is for the Skin drawing (R_PolysetCalcGradients for those who want to look it up). The code almost "works" what happens is the skin seems to stretched over the model incorrectly.
A few interesting things I noticed is when I run objdump -dwrC r_polysa.obj > r_polysa.masm and the same for the GAS version the code is mostly similar except for the fact fsubp and fsubrp have been swapped in MASM. Please note not the operands (I already know about this issue in GAS).
In the picture the left side is the GAS version, the right side is the MASM version. The original MASM code (and therefore what I have in GAS) is what should be on the left side. I am unsure why MASM is apparently swapping this or if objdump is reporting it incorrectly. However, if I swap the two this does not fix the issue. It just gets inverted in another wrong way.
I mention this FSUBP/FSUBRP swap because this was a problem in porting the Particle blending inline ASM code. This had one call to use an FSUBRP in the MASM version, objdump reported it now being FSUBP and I had to change it to FSUBP in the GAS version for it to work! I don't understand why this is happening?
In any case, I am new to assembly, but understand some basics and have been doing some reading. Obviously the math here is not quite right, but it seems as though it should be. I don't know how or what to do next. How do I fix and debug this problem?
The code repository to what I am working on is at: https://bitbucket.org/neozeed/q2dos/commits/branch/win32_asm (specifically the Win32_ASM branch). The files I am working with are gas\r_polysa.s and ref_soft\r_polysa.asm.

Opcodes were reversed so there is a quirk with GAS and fsubrp following an fsubp. There was one other additional gotcha, unrelated to that issue. For those interested see: https://bitbucket.org/neozeed/q2dos/commits/f5bf93e3a78e112ae1f766606471a6c5e67283d4

Related

Float-point values doesn't work in uC-OS-III

Float-point variables defined with float doesn't seem to work in µC-OS-III.
A simple code like this:
float f1;
f1 = 3.14f;
printf("\nFLOAT:%f", f1);
Would produce an output like this:
FLOAT:2681561605....
When I test this piece of code in the main() before the µC-OS-III initialization, it works just fine. However, after the multitasking begins, it doesn't work. It doesn't work in the tasks or in the startup task.
I've searched the Internet for the similar problem but I couldn't find anything. However, there is this article that says "The IAR C/C++ Compiler for ARM requires the Stack Pointer to be aligned at 8 bytes..."
https://www.iar.com/support/tech-notes/general/problems-with-printf-floating-point-f-on-arm/
I located the stacks at an 8-byte aligned locations. Then the code worked in the task but the OS crashed right after the printf.
My compiler tool chain is IAR EWARM Version 8.32.1 and I am using µC-OS-III V3.07.03 with STM32F103.
I might miss some OS or compiler configuration. I don't know! I had the same problem few years ago with µC-OS-II, but finally I decided to use Fixed-point mathematics instead of floats.
Could someone shed a light on this...
Locating the RTOS stacks at an 8-byte alignment will solve the problem, according to the IAR article.
I located the stacks at fixed locations:
static CPU_STK task_stk_startup[TASK_CFG_STACK_SIZE_STARTUP] # (0x20000280u);

x64 inline assembly in c to align instructions

I've got a very hot instruction loop which needs to be properly aligned on 32-bytes boundaries to maximize Intel's Instruction Fetcher effectiveness.
This issue is specific to Intel not-too-old line of CPU (from Sandy Bridge onward). Failure to align properly the beginning of the loop results in up to 20 % speed loss, which is definitely too noticeable.
This issue is pretty rare, one needs a highly optimized set of instructions for the instruction fetcher to become the bottleneck. But fortunately, it's not a unique case. Here is a nice article explaining in details how such a problem can be detected.
The problem is, gcc nor clang would care aligning properly this instruction loop. It makes compiling this code a nightmare producing random outcome, depending on how "good" the hot loop is aligned by chance. It also means that modifying a totally unrelated function can nonetheless highly impact performance of the hot loop.
Already tried several compiler flags, none of them gives satisfying result.
[Edit] More detailed description of tried compilation flags :
-falign-functions=32 : no impact or negative impact
-falign-jumps=32 : no impact
-falign-loops=32 : works fine when the hot loop is isolated into a tiny piece of test code. But in normal build, the compilation flag is applied across the entire source, and in this case it is detrimental : aligning all loops on 32-bytes is bad for performance. Only the very hot ones benefit from it.
Also attempted to use __attribute__((optimize("align-loops=32"))) in the function declaration. Doesn't produce any effect (identical binary generated, as if the the statement wasn't there). Later confirmed by gcc support team to be effectively ignored. Edit : #Jester indicates in comment that the statement works with gcc 5+. Unfortunately, my dev station uses primarily gcc 4.8.4, and this is more a problem of portability, since I don't control the final compiler used in the build process.
Only building using PGO can reliably produce expected performance, but PGO cannot be accepted as a solution since this piece of code will be integrated into other programs using their own build chain.
So, I'm considering inline assembly.
This would be specific to x64 instruction set, so no portability required.
If my understanding is correct, assembly like NASM allows the use of statements such as : ALIGN 32 which would force the next instruction to be aligned on 32 bytes boundaries.
Since the target source code is in C, it would be necessary to include this statement. For example, something like asm("ALIGN 32");
(which of course doesn't work).
I hope it's mostly a matter of knowing the right instruction to write, and not something deeper such as "it's impossible".
Similarly to NASM, the GNU assembler supports the .align pseudo OP for alignment:
volatile asm (".align 32");
For a non-assembly solution, you could try to supply -falign-loops=32 and possibly -falign-functions=32, -falign-jumps=32 as needed.

carry-less multiplication optimization for ECC over GF(2^m) in MIRACL

Link to MIRACL crypto library by CertiVox
Following the instructions in fastgf2m.txt, I've been able to get everything to compile. However, after execution, the benchmark (bmark.exe) program halts when evaluating curves over GF(2^m) with error, "This is not a point on the curve!"
I am able to get everything to work without the optimization but I'm unsure where the problem exists. I haven't modified any curve parameters and followed instructions in the distribution. I'm compiling on 64-bit Windows 8.1, on an Intel i7-3520M.
If anyone has any advice on how to correct this, it would be greatly appreciated.
Thanks!!
The method outlined in fastgf2m.txt is for generating unrolled code associated with a fixed m value determined at compile time. The bmark program changes m at runtime, and so the unrolled code will often not be correct in this case. The documentation could be clearer on this point.
Also make sure your processor does support the PCLMULQDQ instruction - many older processors will not.
It might be better to test the method on the ecsgen2/ecssign2/ecsver2 programs to implement ECDSA over GF(2^283) for example.

Reverse engineering a firmware - what's up with every fourth byte?

So I decided to grab my tools and analyze a router firmware. It went pretty okay up to the point where I had to find segments manually. I wouldn't bother you with it and i really don't want to ask about hacking anything or to do a favor for me. There is a pattern I'm sure someone could explain to me.
Looking at the hexdump, all i see is this:
There are strings that break the pattern but it goes all the way down almost to the end of the file.
what on earth can cause this pattern?
(if anyone's willing to help but needs more info: VxWorks 5.5.1 / probably ARM-9E CPU)
it is an arm, go look at the arm documentation you will see that for the 32 bit (non-thumb) arm instructions the first four bits are the condition code. The code 0b1110 is "ALWAYS" most of the time you dont do conditional execution so most arm instructions start with 0xE. makes it very easy to pick out an arm binary. the 16 bit thumb instructions also have a similar pattern but for different reasons, then if you add in thumb2 it changes that some...
Thats just due to how ARMs op codes are mapped and is actually helps me "eyeball" a dump to see if its ARM code.
I would suggest you go through part of the ARM Architecture Manual to see how op codes are generated. particularly conditionals. the E is created when you always want something to happen

I want to create a simple assembler in C. Where should I begin? [duplicate]

This question already has answers here:
Building an assembler
(4 answers)
How Do You Make An Assembler? [closed]
(4 answers)
Closed 9 years ago.
I've recently been trying to immerse myself in the world of assembly programming with the eventual goal of creating my own programming language. I want my first real project to be a simple assembler written in C that will be able to assemble a very small portion of the x86 machine language and create a Windows executable. No macros, no linkers. Just assembly.
On paper, it seems simple enough. Assembly code comes in, machine code comes out.
But as soon as I thinking about all the details, it suddenly becomes very daunting. What conventions does the operating system demand? How do I align data and calculate jumps? What does the inside of an executable even look like?
I'm feeling lost. There aren't any tutorials on this that I could find and looking at the source code of popular assemblers was not inspiring (I'm willing to try again, though).
Where do I go from here? How would you have done it? Are there any good tutorials or literature on this topic?
I have written a few myself (assemblers and disassemblers) and I would not start with x86. If you know x86 or any other instruction set you can pick up and learn the syntax for another instruction set in short order (an evening/afternoon), at least the lions share of it. The act of writing an assembler (or disassembler) will definitely teach you an instruction set, fast, and you will know that instruction set better than many seasoned assembly programmers for that instruction set who have not examined the microcode at that level. msp430, pdp11, and thumb (not thumb2 extensions) (or mips or openrisc) are all good places to start, not a lot of instructions, not overly complicated, etc.
I recommend a disassembler first, and with that a fixed length instruction set like arm or thumb or mips or openrisc, etc. If not then at least use a disassembler (definitely choose an instruction set for which you already have an assembler, linker, and disassembler) and with pencil and paper understand the relationship between the machine code and the assembly, in particular the branches, they usually have one or more quirks like the program counter is an instruction or two ahead when the offset is added, to gain another bit they sometimes measure in whole instructions not bytes.
It is pretty easy to brute force parse the text with a C program to read the instructions. A harder task but perhaps as educational, would be to use bison/flex and learn that programming language to allow those tools to create (an even more extreme brute force) parser which then interfaces to your code to tell you what was found where.
The assembler itself is pretty straight forward, just read the ascii and set the bits in the machine code. Branches and other pc relative instructions are a little more painful as they can take multiple passes through the source/tables to completely resolve.
mov r0,r1
mov r2 ,#1
the assembler begins parsing the text for a line (being defined as the bytes that follow a carriage return 0xD or line feed 0xA), discard the white space (spaces and tabs) until you get to something non white space, then strncmp that with the known mnemonics. if you hit one then parse the possible combinations of that instruction, in the simple case above after the mov skip over the white space to non-white space, perhaps the first thing you find must be a register, then optional white space, then a comma. remove the whitespace and comma and compare that against a table of strings or just parse through it. Once that register is done then go past where the comma is found and lets say it is either another register or an immediate. If immediate lets say it has to have a # sign, if register lets say it has to start with a lower or upper case 'r'. after parsing that register or immediate, then make sure there is nothing else on the line that shouldnt be on the line. build the machine code for this instruciton or at least as much as you can, and move on to the next line. It may be tedious but it is not difficult to parse ascii...
at a minimum you will want a table/array that accumulates the machine code/data as it is created, plus some method for marking instructions as being incomplete, the pc-relative instructions to be completed on a future pass. you will also want a table/array that collects the labels you find and the address/offset in the machine code table where found. As well as the labels used in the instruction as a destination/source and the offset in the table/array holding the partially complete instruction they go with. after the first pass, then go back through these tables until you have matched up all the label definitions with the labels used as a source or destination, using the label definition address/offset to compute the distance to the instruction in question and then finish creating the machine code for that instruction. (some disassembly may be required and/or use some other method for remembering what kind of encoding it was when you come back to it later to finish building the machine code).
The next step is allowing for multiple source files, if that is something you want to allow. Now you have to have labels that dont get resolved by the assembler so you have to leave placeholders in the output and make some flavor of the longest jump/branch instruction because you dont know how far away the destination will be, expect the worse. Then there is the output file format you choose to create/use, then there is the linker which is mostly simple, but you have to remember to fill in the machine code for the final pc relative instructions, no harder than it was in the assembler itself.
Note, writing an assembler is not necessarily related to creating a programming language and then writing a compiler for it, separate thing, different problems. Actually if you want to make a new programming language just use an existing assembler for an existing instruction set. Not required of course, but most teachings and tutorials are going to use the bison/flex approach for programming languages, and there are many college course lecture notes/resources out there for beginning compiler classes that you can just use to get you started then modify the script to add the features of your language. The middle and back ends are the bigger challenge than the front end. there are many books on this topic and many online resources as well. As mentioned in another answer llvm is not a bad place to create a new programming language the middle and backends are done for you, you only need to focus on the programming language itself, the front end.
You should look at LLVM, llvm is a modular compiler back end, the most popular front end is Clang for compiling C/C++/Objective-C. The good thing about LLVM is that you can pick the part of the compiler chain that you are interested in and just focus on that, ignoring all of the others. You want to create your own language, write a parser that generates the LLVM internal representation code, and for free you get all of the middle layer target independent optimisations and compiling to many different targets. Interesting in a compiler for some exotic CPU, write a compiler backend that takes the LLVM intermediated code and generates your assemble. Have some ideas about optimisation technics, automatic threading perhaps, write a middle layer which processes LLVM intermediate code. LLVM is a collection of libraries not a standalone binary like GCC, and so it is very easy to use in you own projects.
What you're looking for is not a tutorial or source code, it's a specification. See http://msdn.microsoft.com/en-us/library/windows/hardware/gg463119.aspx
Once you understand the specification of an executable, write a program to generate one. The executable you build should be as simple as possible. Once you have mastered that, then you can write a simple line-oriented parser that reads instruction names and numeric arguments to generate a block of code to plug into the exe. Later you can add symbols, branches, sections, whatever you want, and that's where something like http://www.davidsalomon.name/assem.advertis/asl.pdf will come in.
P.S. Carl Norum has a good point in the comment above. If your goal is create your own programming language, learning to write an assembler is irrelevant and is very much not the right way to start (unless the language you want to create is an assembly language). There are already assemblers that produce executables from assembler source, so your compiler could produce assembler source and you could avoid the work of recreating the assembler ... and you should. Or you could use something like LLVM, which will solve many other daunting problems of compiler construction. The odds are very small that you will ever actually produce your own programming language, but they're much smaller if you start from scratch and there's no need to. Decide what your goal is and use the best tools available to achieve it.

Resources