Generating Define-Use Paths for C Code coverage analysis - c

How is it possible to generate uncovered Define-Use Paths for C Code (using e.g. gcc).
As I saw this subject is only academic. (unlike line coverage)
resource: http://whiteboxtest.com/Data-Flow-Testing.php

You need a tool that can determine for every definition, all the possible uses (e.g., computes Def-Use pairs) in the code, associate with each Def-Use pair the variable defined and the program location (file, line, column) of Def and Use points.
Then for each def-use pair, you need to add instrumentation ("probes") to the program that records the use of that def-use pair, when it gets used (usually near the use), as some kind of boolean variable specific to that def-use pair.
Because there are lot of these it is useful to organize the individual booleans as a boolean array. (Obvious optimization to minimize the number inserted probes: a basic block, when executed, will satisfy many def-use pairs; a boolean representing execution of the basic block [block coverage] can stand in as set of def-use pairs. I'm sure there are other similar optimizations.).
After running the program, one has to dump these boolean variables, compute the actual def-use information (e.g., including using the block coverage data), and then display it.
A standard scheme for modifying the program to do this with source-to-source program transformations. My paper Branch Coverage for Arbitrary Languages Made Easy shows how to do this with our DMS Software Reengineering Toolkit and its style of rewrite rules. The paper focuses on branch (block) coverage, but the instrumentation aspect is fine. A typical transformation rule used looks like this:
rule mark_if_then_else(condition:expression; tstmt:statement; estmt:statement)=
“if (\condition) \tstmt else \estmt;”
rewrites to
“if (\condition)
{ visited[\new_place\(\tstmt\)]=1;
\tstmt}
else
{ visited[\new_place\(\estmt\)]=1;
\estmt
};”
This rule modifies and if-then-else construct to collect "visited" booleans for each conditionally executed block (then and else clauses) generating a new index for each new block. The \xxxx means "an arbitrary code structure of syntax type ssss if the transformation rule signature (first line) declares ssss:xxxx. You can see more information on the precise syntax and meaning of the DMS rewrite rules here.
It turns out getting def-use information is hard; you need what amounts to a compiler front end thus OP's mention of GCC. GCC won't do source-to-source transformations but you can get essentially the same effect with source-to-binary transformations as gcov does, by modifying the GCC sources with procedural code to add the probes. In general, though, GCC doesn't want to help you do this kind of custom instrumentation.
I don't know, but I'm pretty sure Clang computes def-use information. It is possible to do source to source transformations with Clang but I have no experience with that.
I do know that our DMS does compute def-use information for C and C++. That and its ability to do source-to-source transformations would make building a def-use coverage tool technically straightforward.
(Not asked, but DMS also computes control flows, so one could also do path coverage straightforwardly.)
Then there is the problem of building a display tool. You need something that can show the def use pairs and their status, probably associated with/superimposed on the the code so it is easy to understand each def and use. So you need to record line and column-precise information about the location of each def and use. I don't think you can get that from GCC; it doesn't have that information in the binary, but maybe it has it in the its constructed AST. You can get column information from DMS and Clang (I think).

Related

when should we care about cache missing?

I want to explain my question through a practical problem I met in my project.
I am writing a c library( which behaves like a programmable vi editor), and i plan to provide a series of APIs ( more than 20 in total ):
void vi_dw(struct vi *vi);
void vi_de(struct vi *vi);
void vi_d0(struct vi *vi);
void vi_d$(struct vi *vi);
...
void vi_df(struct vi *, char target);
void vi_dd(struct vi *vi);
These APIs do not perform core operations, they are just wrappers. For example, I can implement vi_de() like this:
void vi_de(struct vi *vi){
vi_v(vi); //enter visual mode
vi_e(vi); //press key 'e'
vi_d(vi); //press key 'd'
}
However, if the wrapper is as simple as such, I have to write more than 20 similar wrapper functions.
So, I consider implementing more complex wrappers to reduce the amount:
void vi_d_move(struct vi *vi, vi_move_func_t move){
vi_v(vi);
move(vi);
vi_d(vi);
}
static inline void vi_dw(struct vi *vi){
vi_d_move(vi, vi_w);
}
static inline void vi_de(struct vi *vi){
vi_d_move(vi, vi_e);
}
...
The function vi_d_move() is a better wrapper function, he can convert a part of similar move operation to APIs, but not all, like vi_f(), which need another wrapper with a third argument char target .
I finished explaining the example picked from my project. The pseudo code above is simper than real case, but is enough to show that:
The more complex the wrapper is, the less wrappers we need, and the slower they will be.(they will become more indirect or need to consider more conditions).
There are two extremes:
use only one wrapper but complex enough to adopt all move operations and convert them into corresponding APIs.
use more than twenty small and simple wrappers. one wrapper is one API.
For case 1, the wrapper itself is slow, but it has more chance resident in cache, because it is often executed(all APIs share it). It's a slow but hot path.
For case 2, these wrappers are simple and fast, but has less chance resident in cache. At least, for any API first time called, a cache miss will happen.(CPU need to fetch instructions from memory, but not L1, L2).
Currently, I implemented five wrappers, each of them are relatively simple and fast. this seems to be a balance, but just seems. I chose five just because I felt the move operation can be divided into five groups naturally. I have no idea how to evaluate it, I don't mean a profiler, I mean, in theory, what main factors should be considered in such case?
In the post end, I want to add more detail for these APIs:
These APIs need to be fast. Because this library is designed as a high performance virtual editor. The delete/copy/paste operation is designed to approach the bare C code.
A user program based on this library seldom calls all these APIs, only parts of them, and usually no more than 10 times for each.
In real case, the size of these simple wrappers are about 80 bytes each, and will be no more than 160 bytes even merged into a single complex one. (but will introduce more if-else branches).
4, As with the situation the library is used, I will take lua-shell as example(a little off-topic, but some friends want to know why I so care its performance):
lua-shell is a *nix shell which uses lua as its script. Its command execution unit(which do forks(), execute()..) is just a C module registered into the lua state machine.
Lua-shell treats everything as lua .
So, When user input:
local files = `ls -la`
And press Enter. The string input is first sent to lua-shell's preprocessor————which convert mixed-syntax to pure lua code:
local file = run_command("ls -la")
run_command() is the entry of lua-shell's command execution unit, which, I said before, is a C module.
We can talk about libvi now. lua-shell's preprocessor is the first user of the library I am writing. Here is its relative codes(pseudo):
#include"vi.h"
vi_loadstr("local files = `ls -la`");
vi_f(vi, '`');
vi_x(vi);
vi_i(vi, "run_command(\"");
vi_f(vi, '`');
vi_x(vi);
vi_a(" \") ");
The code above is parts of luashell's preprocessor implementation.
After generating the pure lua code, he feeds it to Lua State Machine and run it.
The shell user is sensitive to the time interval between Enter and a new prompt, and in most case lua-shell needs preprocess script with larger size and more complicate mixed-syntax.
This is a typical situation where libvi is used.
I won't care that much about cache misses (especially in your case), unless your benchmarks (with compiler optimizations enabled, i.e. compile with gcc -O2 -mtune=native if using GCC....) indicate that they matter.
If performances matters that much, enable more optimizations (perhaps compiling and linking your entire application or library with gcc -flto -O2 -mtune=native that is with link-time optimizations), and hand-optimize only what is critical. You should trust your optimizing compiler.
If you are in the design phase, consider perhaps making your application multi-threaded or somehow concurrent and parallel. With care, this could speedup it more than cache optimizations.
It is unclear what your library is about and what are your design goals. A possibility to add flexibility might be embed some interpreter (like lua or guile or python, etc...) in your application, hence configuring it thru scripts. In many cases, such an embedding could be fast enough (especially when the application specific primitives are of high enough level). Another (more complex) possibility is to provide metaprogramming abilities perhaps thru some JIT compiling library like libjit or libgccjit (so you would sort-of "compile" user scripts into dynamically produced machine code).
BTW, your question seems to focus on instruction cache misses. I would believe that data cache misses are more important (and less optimizable by the compiler), and that is why you would prefer e.g. vectors to linked lists (and more generally care about low-level data structures, focusing on using sequential -or cache-friendly- accesses)
(you could find a good video by Herb Sutter which explains that last point; I forgot the reference)
In some very specific cases, with recent GCC or Clang, adding a few __builtin_prefetch might slightly improve performance (by decreasing cache misses), but it could also harm it significantly, so I don't recommend using it in general, but see this.

I want to create a simple assembler in C. Where should I begin? [duplicate]

This question already has answers here:
Building an assembler
(4 answers)
How Do You Make An Assembler? [closed]
(4 answers)
Closed 9 years ago.
I've recently been trying to immerse myself in the world of assembly programming with the eventual goal of creating my own programming language. I want my first real project to be a simple assembler written in C that will be able to assemble a very small portion of the x86 machine language and create a Windows executable. No macros, no linkers. Just assembly.
On paper, it seems simple enough. Assembly code comes in, machine code comes out.
But as soon as I thinking about all the details, it suddenly becomes very daunting. What conventions does the operating system demand? How do I align data and calculate jumps? What does the inside of an executable even look like?
I'm feeling lost. There aren't any tutorials on this that I could find and looking at the source code of popular assemblers was not inspiring (I'm willing to try again, though).
Where do I go from here? How would you have done it? Are there any good tutorials or literature on this topic?
I have written a few myself (assemblers and disassemblers) and I would not start with x86. If you know x86 or any other instruction set you can pick up and learn the syntax for another instruction set in short order (an evening/afternoon), at least the lions share of it. The act of writing an assembler (or disassembler) will definitely teach you an instruction set, fast, and you will know that instruction set better than many seasoned assembly programmers for that instruction set who have not examined the microcode at that level. msp430, pdp11, and thumb (not thumb2 extensions) (or mips or openrisc) are all good places to start, not a lot of instructions, not overly complicated, etc.
I recommend a disassembler first, and with that a fixed length instruction set like arm or thumb or mips or openrisc, etc. If not then at least use a disassembler (definitely choose an instruction set for which you already have an assembler, linker, and disassembler) and with pencil and paper understand the relationship between the machine code and the assembly, in particular the branches, they usually have one or more quirks like the program counter is an instruction or two ahead when the offset is added, to gain another bit they sometimes measure in whole instructions not bytes.
It is pretty easy to brute force parse the text with a C program to read the instructions. A harder task but perhaps as educational, would be to use bison/flex and learn that programming language to allow those tools to create (an even more extreme brute force) parser which then interfaces to your code to tell you what was found where.
The assembler itself is pretty straight forward, just read the ascii and set the bits in the machine code. Branches and other pc relative instructions are a little more painful as they can take multiple passes through the source/tables to completely resolve.
mov r0,r1
mov r2 ,#1
the assembler begins parsing the text for a line (being defined as the bytes that follow a carriage return 0xD or line feed 0xA), discard the white space (spaces and tabs) until you get to something non white space, then strncmp that with the known mnemonics. if you hit one then parse the possible combinations of that instruction, in the simple case above after the mov skip over the white space to non-white space, perhaps the first thing you find must be a register, then optional white space, then a comma. remove the whitespace and comma and compare that against a table of strings or just parse through it. Once that register is done then go past where the comma is found and lets say it is either another register or an immediate. If immediate lets say it has to have a # sign, if register lets say it has to start with a lower or upper case 'r'. after parsing that register or immediate, then make sure there is nothing else on the line that shouldnt be on the line. build the machine code for this instruciton or at least as much as you can, and move on to the next line. It may be tedious but it is not difficult to parse ascii...
at a minimum you will want a table/array that accumulates the machine code/data as it is created, plus some method for marking instructions as being incomplete, the pc-relative instructions to be completed on a future pass. you will also want a table/array that collects the labels you find and the address/offset in the machine code table where found. As well as the labels used in the instruction as a destination/source and the offset in the table/array holding the partially complete instruction they go with. after the first pass, then go back through these tables until you have matched up all the label definitions with the labels used as a source or destination, using the label definition address/offset to compute the distance to the instruction in question and then finish creating the machine code for that instruction. (some disassembly may be required and/or use some other method for remembering what kind of encoding it was when you come back to it later to finish building the machine code).
The next step is allowing for multiple source files, if that is something you want to allow. Now you have to have labels that dont get resolved by the assembler so you have to leave placeholders in the output and make some flavor of the longest jump/branch instruction because you dont know how far away the destination will be, expect the worse. Then there is the output file format you choose to create/use, then there is the linker which is mostly simple, but you have to remember to fill in the machine code for the final pc relative instructions, no harder than it was in the assembler itself.
Note, writing an assembler is not necessarily related to creating a programming language and then writing a compiler for it, separate thing, different problems. Actually if you want to make a new programming language just use an existing assembler for an existing instruction set. Not required of course, but most teachings and tutorials are going to use the bison/flex approach for programming languages, and there are many college course lecture notes/resources out there for beginning compiler classes that you can just use to get you started then modify the script to add the features of your language. The middle and back ends are the bigger challenge than the front end. there are many books on this topic and many online resources as well. As mentioned in another answer llvm is not a bad place to create a new programming language the middle and backends are done for you, you only need to focus on the programming language itself, the front end.
You should look at LLVM, llvm is a modular compiler back end, the most popular front end is Clang for compiling C/C++/Objective-C. The good thing about LLVM is that you can pick the part of the compiler chain that you are interested in and just focus on that, ignoring all of the others. You want to create your own language, write a parser that generates the LLVM internal representation code, and for free you get all of the middle layer target independent optimisations and compiling to many different targets. Interesting in a compiler for some exotic CPU, write a compiler backend that takes the LLVM intermediated code and generates your assemble. Have some ideas about optimisation technics, automatic threading perhaps, write a middle layer which processes LLVM intermediate code. LLVM is a collection of libraries not a standalone binary like GCC, and so it is very easy to use in you own projects.
What you're looking for is not a tutorial or source code, it's a specification. See http://msdn.microsoft.com/en-us/library/windows/hardware/gg463119.aspx
Once you understand the specification of an executable, write a program to generate one. The executable you build should be as simple as possible. Once you have mastered that, then you can write a simple line-oriented parser that reads instruction names and numeric arguments to generate a block of code to plug into the exe. Later you can add symbols, branches, sections, whatever you want, and that's where something like http://www.davidsalomon.name/assem.advertis/asl.pdf will come in.
P.S. Carl Norum has a good point in the comment above. If your goal is create your own programming language, learning to write an assembler is irrelevant and is very much not the right way to start (unless the language you want to create is an assembly language). There are already assemblers that produce executables from assembler source, so your compiler could produce assembler source and you could avoid the work of recreating the assembler ... and you should. Or you could use something like LLVM, which will solve many other daunting problems of compiler construction. The odds are very small that you will ever actually produce your own programming language, but they're much smaller if you start from scratch and there's no need to. Decide what your goal is and use the best tools available to achieve it.

Find functional changes between two revisions of a file (compile diff?)

I'm looking for a tool that checks whether two (C) source code files generate the same binary so that I can find actual functional changes between two files and ignore mere coding style changes.
It would be great if this worked even within a file for different changesets, so a file may have changed in coding style on some places, but also had one functional patch added.
It's very very hard to write a program to figure out the "functional" result of another program. Such a program sounds like it would be necessary for this. I would guess that computer programs themselves are right about the most compact and machine-readable way we have to even describe functionality, so it's kind of hard to write a program that analyses a program and generates a "better" description.
Somehow abstracting out and "understanding" that coding style differences don't affect functionality also sounds very, very hard. I find it hard when manually reading other people's code somehow, because the differences in style can be pretty large, even though the end result might be the same in "my style".
I would be surprised if a solution wouldn't also require a solution to the halting problem, which is proven impossible for the general case.
The only way is to compile both with the same compiler options and do a binary diff.
It's not only style changes you'd have to look out for; someone may have extracted code to a function that gets inlined in an optimised build. This may, or may not, depending on compiler options and version, give the same binary.
Mapping binary back to source to "high level functionality" - unlikely.
Comparing two source files with respect to "high level functionality" (ignoring coding style) - possible:
http://cscope.sourceforge.net/
Alternative suggestion:
Write a tool that "normalizes" your source files - by applying the same formatting to both sets of code.
This can easily be automated.
For example:
1) checkout both from version control,
2) apply "standard format",
3) compare
If all you're interested in is whether they both "generate the same binary", then the easiest solution is simply to generate both binaries, and compare.
Note, however, that there are things that would result in binaries that are bitwise different, even though they're functionally identical:
Change in external function names
Optimisations
Reordering non-dependent code snippets
etc.
There is a branch of computer science that deals with concurrency and parallel processes.
One of the applications is deciding whether two systems are behaviorally equivalent (in some bisimulation relation (weak or strong)).
Though it's computationally very difficult to decide whether two large systems are behaviorally equivalent. The usage is mainly for verification of small critical applications where we can't afford failure.

Is there a static invariant discovery tool for C programs?

I'm looking for a tool that can statically discover invariants in C programs. I checked out Daikon but it discovers invariants only dynamically.
Is there a tool available for what I'm looking for? Thanks!
See The SLAM project: debugging system software via static analysis. It claims to infer invariants statically, for just what you asked for, the C language. The author, Tom Ball, is widely known for stellar work in program analysis.
If you mean "invariant" in the widest sense, as the linked page to Daikon is using, then the work of many static analysis tools can be described as "discovering invariants", just perhaps not the expressive invariants you were looking for.
Frama-C's value analysis accumulates its results, the possible values of all variables, for each statement. At the end of the analysis, it can thus present non-relational information about the domain variation of each variable in the program, at each statement. In this screenshot, an invariant is that S is always 0, 1, 3 or 6 just before the selected instruction, for all executions of this deterministic program.
The two hidden parameters in your question are the shape of the invariants you are interested in, and the shape of the programs you want to find these invariants for. For instance, SLAM, mentioned in Ira's answer, was designed to work on device driver code, and to infer invariants that just contain the necessary information for verifying proper use of system APIs. Another tool, Astrée, is famous for doing a very good job at inferring just the right invariants to demonstrate runtime safety of flight control software.
The two degrees of freedom make for a very large design space. You won't find anything that works for all kinds of C programs and infers all the invariants you might be interested in, but if you refine your question for specific application domains and kinds of invariants, you will have better chances to find relevant answers.

Automated tracing use of variables within source code

I'm working with a set of speech processing routines (written in C) meant to be compiled with the mex command on MATLAB. There is this C-function which I'm interested in accelerating using FPGA.
The hardware takes in specified input parameters through input ports, the rest of the inputs as constants to be hard coded, and passes a particular variable some where within the C-function, say foo, to the output port.
I am interested in tracing the computation graph (unsure if this is the right term to use) of foo. i.e. how foo relates to intermediate computed variables, which in turn eventually depends on input parameters and hard coded constants. This is to allow me to flatten the logic so they can be coded using a hardware description language, as well as remove irrevelant logic which does not affect the value of foo. The catch is that some intermediate variables are global, therefore tracing is a headache.
Is there an automated tool which analyzes a given set of C headers and source files and provide a means of tracing how a specified variable is altered, with some kind of dependency graph of all variables used?
I think what you are looking for is a tool to do value analysis.
Among the tools available to do this, I think Code Surfer is probably the best out there. Of course, it is also quite expensive but if you are a student, they do have an academic license program. On the open-source side, Frama-C can also do this in a more limited fashion and has a much, much steeper learning curve. But it is free and will get you where you want to go.

Resources