How Debuggers Find Expressions From Code Lines - c

A debugger gets a line number of an expression and translates it into an program address, what does the implementation look like? I want to implement this in a program I'm writing and the most promising library I've found to accomplish this is libbfd. All I would need is the address of the expression, and I can wait for it with ptrace(2). I can imagine that the debugger looks for the function name from the C file within the executable, but after that I'm lost.
Does anyone know? I don't need a code example, just enough info so that I can get an idea.
And I don't mind architecture-specific answers, the only ones I really care about are Arm and x86-64.

You should take a look at the DWARF2 format to try to understand how the mapping is done. Do consider how DWARF2 is vast and complex. It's not for everyone, but reading about it might satisfy your curiosity faster and more easily than reading the source for GCC/GDB.

Related

How can I convert an .abs or .s19 to a C file?

I am trying to run some MC9S12DP256 example files, but I want to see the code to understand it. Are there any ways to convert a .s19 or .abs file to a C code?
An ".s19" or an ".abs" file contains mainly the machine code of the application. The source code of it is not included, independent of the language used to write it. Even if it were written in assembly language, all symbolic informations and comments are excluded.
However, you can try to de-compile the machine code. This is not a trivial or quick task, you need to know the target really well. I did this with software for other processors, it is feasible for code up to some KB.
These are the steps I recommend:
Get a disassembler and an assembler for the target processor, optimally from the vendor.
Let it disassemble the machine code into assembly source code. You might need to convert the ".s19" file into a binary file, one possible tool for this is "srecord".
Assemble the resulting source code again into ".s19" or ".abs", and make sure that it generates the same contents as your original.
Insert labels for the reset and interrupt entry points. Start at the reset entry point with your analysis.
Read the source code, think about what it does.
You will quickly "dive" into subroutines that execute small functions, like reading ADC or sending data. Place a label and replace the numerical value at the call sites with the label.
Expect sections of (constant) data mixed with executable code.
Repeat often from point 3. If you have a difference, undo your last step and redo it in another way until you produce the same contents.
If you want C source, it is commonly much more difficult. You need a lot of experience how C is compiled into machine code. Be aware that variables or even functions are commonly placed in another sequence than they are declared. If you want to go that route, you usually also have to use the exact version of the compiler used to generate the original machine code.
Be aware that the original might be produced with any other language.

Counting the number of functions and data structures in a C codebase

Is there a way to take a C file (or a directory/project) and count the number of functions + data structures? This is similar to counting the LOC but instead is focused on counting the number of "conceptual units" the program handles as a way to measure its complexity.
It sounds like you are in need of perusing your source code. Doxygen is an excellent tool for summarizing just about every aspect of a C project. (and many other languages). It is OpenSource, and easily downloaded. Additionally, the list of features is extensive.
On a Linux environment, you'll want to look at a tool like objdump, that will show you a bunch of information about the compiled output.
There are pages that explain some of its complicated output, such as this.
But perhaps one of the simplest is objdump -T.

Accurately count number of keywords "if", "while" in a c file

Are there any libraries out there that I can pass my .c files through and will count the visible number of, of example, "if" statements?
We don't have to worry about "if" statement in other files called by the current file, just count of the current file.
I can do a simple grep or regex but wanted to check if there is something better (but still simple)
If you want to be sure it's done right, I'd probably make use of clang and walk the ast. A URL to get you started:
http://clang.llvm.org/docs/IntroductionToTheClangAST.html
First off, there is no way to use regular expressions or grep to give you the correct answer you are looking for. There are lots of ways that you would find those strings, but they could be buried in any amount of escape characters, quotations, comments, etc.
As some commenters have stated, you will need to use a parser/lexer that understands the C language. You want something simple, you said, so you won't be writing this yourself :)
This seems like it might be usable for you:
http://wiki.tcl.tk/3891
From the page:
lexes a string containing C source into a list of tokens
That will probably get you what you want, but even then it's not going to be trivial.
What everyone has said so far is correct; it seems a lot easier to just grep the shit out of your file. The performance hit of this is neglagible compared to the alternative which is to go get the gcc source code (or whichever compiler you're using), and then go through the parsing code and hook in what you want to do while it parses the syntax tree. This seems like a pain in the ass, especially when all you're worried about is the conditional statements. If you actually care about the branches, you could actually just take a look at the object code and count the number of if statements in the assembly, which would correctly tell you the number of branches (rather than just relying on how many times you typed a conditional, which will not translate exactly to the branching of the program).

Parsing: library functions, FSM, explode() or lex/yacc?

When I have to parse text (e.g. config files or other rather simple/descriptive languages), there are several solutions that come to my mind:
using library functions, e.g. strtok(), sscanf()
a finite state machine which processes one char at a time, tokenizing and parsing
using the explode() function I once wrote out of pure boredom
using lex/yacc (read: flex/bison) to generate an appropriate parser
I don't like the "library functions" approach. It feels clumsy and awkward. explode(), while it doesn't take much new code, feels even more blown up. And flex/bison often seems like sheer overkill.
I usually implement a FSM, but at the same time I already feel sorry for the poor guy that may have to maintain my code at a later point.
Hence my question:
What is the best way to parse relatively simple text files?
Does it matter at all?
Is there a commonly agreed-upon approach?
I'm going to break the rules a bit and answer your questions out of order.
Is there a commonly agreed-upon approach?
Absolutely not. IMHO the solution you choose should depend on (to name a few) your text, your timeframe, your experience, even your personality. If the text is simple enough to make flex and bison overkill, maybe C is itself overkill. Is it more important to be fast, or robust? Does it need to be maintained, or can it start quick and dirty? Are you a passionate C user, or can you be enticed away with the right language features? &c., &c.
Does it matter at all?
Again, this is something only you can answer. If you're working closely with a team of people, with particular skills and abilities, and the parser is important and needs to be maintained, it sure does matter! If you're writing something "out of pure boredom," I would suggest that it doesn't matter at all, no. :-)
What is the best way to parse relatively simple text files?
Well, I don't know that you're going to like my answer. Maybe first read some of the other fine answers here.
No, really, go ahead. I'll wait.
Ah, you're back and relaxed. Let's ease into things, shall we?
Never write it in 'C' if you can do it in 'awk';
Never do it in 'awk' if 'sed' can handle it;
Never use 'sed' when 'tr' can do the job;
Never invoke 'tr' when 'cat' is sufficient;
Avoid using 'cat' whenever possible.
-- Taylor's Laws of Programming
If you're writing it in C, but C feels like the wrong tool...it really might be the wrong tool. awk or perl will likely do what you're trying to do without all the aggravation. You may even be able to do it with cut or something similar.
On the other hand, if you're writing it in C, you probably have a good reason to write it in C. Maybe your parser is a tiny part of a much larger system, which, for the sake of argument, is embedded, in a refrigerator, on the moon. Or maybe you loooove C. You may even hate awk and perl, heaven forfend.
If you don't hate awk and perl, you may want to embed them into your C program. This is doable, in principle--I've never done it myself. For awk, try libmawk. For perl, there are proably a few ways (TMTOWTDI). You can run perl separately using popen to start it, or you can actually embed a Perl interpreter into your C program--see man perlembed.
Anyhow, as I've said, "the best way to parse" entirely depends on you and your team, the problem space, and your approach to the issue. What I can offer is my opinion.
I'm going to assume that in your C-only solutions (library functions and FSM (considering your explode to essentially be a library function)) you've already done your best at isolating the relevant code, designing the code and files well, and so forth.
Even so, I'm going to recommend lex and yacc.
Library functions feel "clumsy and awkward." A state machine seems unmaintainable. But you say that lex and yacc feel like overkill.
I think you should approach your complaints differently. What you're really doing is specifying a FSM. However, you're also hiring someone to write and maintain it for you, thereby solving most of the maintainability problem. Overkill? Did I mention they'll work for free?
I suspect, but do not know, that the reason lex and yacc originally felt like overkill was that your config / simple files just felt too, well, simple. If I'm right (a big if), you may be able to do most of your work in the lexer. (It's even conceivable that you can do all of your work in the lexer, but I know nothing about your input.) If your input is not only simple but widespread, you may be able to find a lexer/parser combination freely available for what you need.
In short: if you can do this not in C, try something else. If you want C, use lex and yacc--they have a little overhead, but they're a very good solution.
If you can get it to work, I'd go with an FSM, but with a huge assist from Perl-compatible regular expressions. This library is easy to understand, and you ought to be able to trim back sufficient extraneous spaghetti to give your monster that aerodynamic flair to which all flying monsters aspire. That, and plenty of comments in well-structured spaghetti, ought to make your code-maintaining successor comfortable. (And, as I'm sure you know, that code-maintaining successor is you after six months, when you've moved on to something else and the details of this code have slipped your mind.)
My short answer is to use the right too for the problem. If you have configuration files use existing standards and formats e.g. ini Files and parse them using Boost program_options.
If you enter the world of "own" languages use lex/yacc, since they provide you with the required features, but you have to consider the cost of maintaining the grammar and language implementation.
As a result I would recommend to further narrow you problem scope to find the right tool.

Building a Control-flow Graph using results from Objdump

I'm attempting to build a control-flow graph of the assembly results that are returned via a call to objdump -d . Currently the best method I've come up with is to put each line of the result into a linked list, and separate out the memory address, opcode, and operands for each line. I'm separating them out by relying on the regular nature of objdump results (the memory address is from character 2 to character 7 in the string that represents each line) .
Once this is done I start the actual CFG instruction. Each node in the CFG holds a starting and ending memory address, a pointer to the previous basic block, and pointers to any child basic blocks. I'm then going through the objdump results and comparing the opcode against an array of all control-flow opcodes in x86_64. If the opcode is a control-flow one, I record the address as the end of the basic block, and depending on the opcode either add two child pointers (conditional opcode) or one (call or return ) .
I'm in the process of implementing this in C, and it seems like it will work but feels very tenuous. Does anyone have any suggestions, or anything that I'm not taking into account?
Thanks for taking the time to read this!
edit:
The idea is to use it to compare stack traces of system calls generated by DynamoRIO against the expected CFG for a target binary, I'm hoping that building it like this will facilitate that. I haven't re-used what's available because A) I hadn't really though about it and B) I need to get the graph into a usable data structure so I can do path comparisons. I'm going to take a look at some of the utilities on the page you lined to, thanks for pointing me in the right direction. Thanks for your comments, I really appreciate it!
You should use an IL that was designed for program analysis. There are a few.
The DynInst project (dyninst.org) has a lifter that can translate from ELF binaries into CFGs for functions/programs (or it did the last time I looked). DynInst is written in C++.
BinNavi uses the ouput from IDA (the Interactive Disassembler) to build an IL out of control flow graphs that IDA identifies. I would also recommend a copy of IDA, it will let you spot check CFGs visually. Once you have a program in BinNavi you can get its IL representation of a function/CFG.
Function pointers are just the start of your troubles for statically identifying the control flow graph. Jump tables (the kinds generated for switch case statements in certain cases, by hand in others) throw a wrench in as well. Every code analysis framework I know of deals with those in a very heuristics-heavy approach. Then you have exceptions and exception handling, and also self-modifying code.
Good luck! You're getting a lot of information out of the DynamoRIO trace already, I suggest you utilize as much information as you can from that trace...
I found your question since I was interested in looking for the same thing.
I found nothing and wrote a simple python script for this and threw it on github:
https://github.com/zestrada/playground/blob/master/objdump_cfg/objdump_to_cfg.py
Note that I have some heuristics to deal with functions that never return, the gcc stack protector on 32bit x86, etc... You may or may not want such things.
I treat indirect calls similar to how you do (basically have a node in the graph that is a source when returning from an indirect).
Hopefully this is helpful for anyone looking to do similar analysis with similar restrictions.
I was also facing a similar issue in the past and wrote asm2cfg tool for this purpose: https://github.com/Kazhuu/asm2cfg. Tool has support for GDB disassembly and objdump inputs and spits out CFG as a dot or pdf.
Hopefully someone finds this helpful!

Resources