cprover
Loading...
Searching...
No Matches
_hidden
Collaboration diagram for _hidden:

Topics

 analyses
 ansi-c
 assembler
 big-int
 cbmc
 cpp
 cprover
 goto-analyzer
 goto-cc
 goto-checker
 goto-diff
 goto-harness
 goto-inspect
 goto-instrument
 goto-programs
 goto-symex
 goto-synthesizer
 json
 langapi
 linking
 pointer-analysis
 solvers
 util
 xmllang
 janalyzer
 java_bytecode
 jbmc
 jdiff
 miniz

Detailed Description

Background Concepts

Author
Martin Brain, Peter Schrammel, Johannes Kloos

The purpose of this section is to explain several key concepts used throughout the CPROVER framework at a high level, ignoring the details of the actual implementation. In particular, we will discuss different ways to represent programs in memory, three important analysis methods and some commonly used terms.

Representations

One of the first questions we should be considering is how we represent programs in such a way that we can easily analyze and reason about them.

As it turns out, the best way to do this is to use a variety of different representations, each representing a different level of abstraction. These representations are designed in such a way that for each analysis we want to perform, there is an appropriate representation, and it is easy to go from representations that are close to the source code to representations that focus on specific semantic aspects of the program.

The representations that the CPROVER framework uses mirror those used in modern compilers such as LLVM and gcc. I will point out those places where the CPROVER framework does things differently, attempting to give rationales wherever possible.

One in-depth resource for most of this section is the classic compiler construction text book ''Compilers: Principles, Techniques and Tools'' by Aho, Lam, Sethi and Ullman.

To illustrate the different concepts, we will consider a small example program. While the program is in C, the general ideas apply to other languages as well - see later sections of this manual to understand how the specific features of those languages are handled. Our running example will be a program that calculates factorials.

/* factorial.c
*
* For simplicity's sake, we just give the forward
* declarations of atoi and printf.
*/
int atoi(const char *);
int printf(const char *, ...);
unsigned long factorial(unsigned n) {
unsigned long fac = 1;
for (unsigned int i = 1; i <= n; i++) {
fac *= i;
}
return fac;
}
/* Error handling elided - this is just for illustration. */
int main(int argc, const char **argv) {
unsigned n = atoi(argv[1]);
printf("%u! = %lu\n", n, factorial(n));
return 0;
}
int main()
Definition example.cpp:18

The question of this first section is: how do we represent this program in memory so that we can do something useful with it?

One possibility would be to just store the program as a string, but this is clearly impractical: even finding whether there is an assignment to a specific variable would require significant parsing effort. For this reason, the first step is to parse the program text into a more abstract representation.

AST

The first step in representing a program in memory is to parse the program, at the same time checking for syntax errors, and store the parsing result in memory.

The key data structure that stores the result of this step is known as an Abstract Syntax Tree, or AST for short (cf. Wikipedia). ASTs are still relatively close to the source code, and represent the structure of the source code while abstracting away from syntactic details, e.g., dropping parentheses, semicolons and braces as long as those are only used for grouping.

Considering the example of the C program given above, we first notice that the program describes (in C terms) a single translation unit, consisting of four top-level declarations (the two function forward declarations of atoi and printf, and the function definitions of factorial and main). Let us start considering the specification of atoi. This gives rise to a subtree modeling that we have a function called atoi whose return type is int, with an unnamed argument of type const char *. We can represent this using a tree that has, for instance, the following structure (this is a simplified version of the tree that the CPROVER framework uses internally):

AST for the `atoi` declaration

This graph shows the (simplified) AST structure for the atoi function. The top level says that this is a global entity, namely one that has code (i.e., a function), called atoi and yielding int. Furthermore, it has a child node initiating a parameter list, and there is a node in the parameter list for each parameter, giving its type and name, if a name is given.

Extending this idea, we can represent the structure of the factorial function using ASTs. The idea here is that the code itself has a hierarchical structure. In the case of C, this starts with the block structure of the code: at the top, we start with a block of code, having three children, each being a ''statement'' node:

  1. unsigned long fac = 1
  2. for (unsigned int i = 1; i <= n; i++) { fac *= i }
  3. return fac

The first statement is already a basic statement: we represent it as a local declaration (similar to the global declarations above) of a variable.

AST for `unsigned long fac = 1`

The second statement is a compound statement, which we can decompose further. At the top level, we have a node stating that this is a for statement, with four child nodes:

  1. Another declaration node, declaring variable i.
  2. An expression node, with operator <= and two children giving the LHS as variable i and the RHS as variable n.
  3. An expression node with post-fix operator ++ and a child giving the variable i as argument.
  4. A block node, starting a new code block. This node has one child:
    1. An expression node with top-level operator *= and two child nodes giving the LHS as variable fac and the RHS as variable i. All in all, the AST for this piece of code looks like this:
AST for the `for` loop

Finally, the third statement is again simple: it consists of a return statement node, with a child node for the variable expression fac. Since the AST is very similar to the first AST above, we omit it here. All in all, the AST for the function body looks like this:

AST for the body of `factorial`

Using the AST for the function body, we can easily produce the definition of the factorial function:

AST for `factorial` (full definition)

In the end, we produce a sequence of trees modeling each declaration in the translation unit (i.e., the file factorial.c).

This data structure is already useful: at this level, we can easily derive simple information such as ''which functions are being defined?'', ''what are the arguments to a given function'' and so on.

Symbol tables and variable disambiguation

Nevertheless, for many analyses, this representation needs to be transformed further. In general, the first step is to resolve variable names. This is done using an auxiliary data structure known as the symbol table.

The issue that this step addresses is that the meaning of a variable name depends on a given scope - for instance, in a C program, we can define the variable i as a local variable in two different functions, so that the name i refers to two different objects, depending on which function is being executed at a given point in time.

To resolve this problem, a first transformation step is performed, changing (short) variable names to some form of unique identifier. This ensures that each variable identifier is used only once, and all accesses to a given variable can be easily found by just looking for the variable identifier. For instance, we could change each variable name into a pair of the name and a serial number. The serial number gets increased whenever a variable of that name is declared. In the example, the ASTs for factorial and main after resolving variable names would look roughly like this:

ASTs with resolved variables

Note that the parameter n in factorial and the local variable n in main are now disambiguated as n.1 and n.2; furthermore, we leave the names of global objects as-is. In the CPROVER framework, a more elaborate system is used: local variables are prefixed with the function names, and further disambiguation is performed by adding indices. For brevity, we use indices only.

Further information on ASTs can be found in the Wikipedia page and the materials linked there. Additionally, there is an interesting discussion on StackOverflow about abstract versus concrete syntax trees.

At this level, we can already perform a number of interesting analyses, such as basic type checking. But for more advanced analyses, other representations are better: the AST contains many different kinds of nodes (essentially, one per type of program construct), and has a rather intricate structure.

Intermediate Representations (IR)

The ASTs in the previous section represent the syntax of a program, including all the features of a given programming language. But in practice, most programming languages have a large amount of ''syntactic sugar'': constructs that are, technically speaking, redundant, but make programming a lot easier. For analysis, this means that if we immediately try to work on the initial AST, we would have to handle all these various cases.

To simplify analysis, it pays off to bring the AST into simpler forms, known as intermediate representations (short IR). An IR is usually given as some form of AST, using a more restricted subset or a variant of the original language that is easier to analyze than the original version.

Taking the example from above, we rewrite the program into a simpler form of the C language: instead of allowing powerful control constructs such as while and for, we reduce everything to if and goto. In fact, we even restrict if statements: an if statement should always be of the form if (*condition*) goto *target* else goto *target*;. As it turns out, this is sufficient to represent every C program. The factorial function in our example program can then be rewritten as follows:

unsigned long factorial(unsigned n) {
unsigned long fac = 1;
// Replace the for loop with if and goto
unsigned int i = 1;
for_loop_start:
if (i <= n) goto for_loop_entry else goto for_loop_end;
for_loop_entry:
fac *= i;
i++;
goto for_loop_start;
for_loop_end:
return fac;
}

We leave it up to the reader to verify that both versions of the function behave the same way, and to draw the function as an AST.

In the CPROVER framework, a number of different IRs are employed to simplify the program under analysis into a simple core language step-by-step. In particular, expressions are brought into much simpler forms. This sequence of transformations is described in later chapters.

TODO: Link to corresponding sections.

Control Flow Graphs (CFG)

Another important representation of a program can be gained by transforming the program structure into a control flow graph, short CFG. While the AST focuses more on the syntactic structure of the program, keeping constructs like while loops and similar forms of structured control flow, the CFG uses a unified graph representation for control flow.

In general, for analyses based around Abstract Interpretation (see Abstract Interpretation), it is usually preferable to use a CFG representation, while other analyses, such as variable scope detection, may be easier to perform on ASTs.

The general idea is to present the program as a graph. The nodes of the graph are instructions or sequences of instructions. In general, the nodes are basic blocks: a basic block is a sequence of statements that is always executed in order from beginning to end. The edges of the graph describe how the program execution may move from one basic block to the next. Note that single statements are always basic blocks; this is the representation used inside the CPROVER framework. In the examples below, we try to use maximal basic blocks (i.e., basic blocks that are as large as possible); this can be advantageous for some analyses.

Let us consider the factorial function as an example. As a reminder, here is the code, in IR:

unsigned long fac = 1;
unsigned int i = 1;
for_loop_start:
if (i <= n) goto for_loop_entry else goto for_loop_end;
for_loop_entry:
fac *= i;
i++;
goto for_loop_start;
for_loop_end:
return fac;

We rewrite the code with disambiguated variables (building the AST from it is left as an exercise):

unsigned long fac.1 = 1;
unsigned int i.1 = 1;
for_loop_start:
if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end;
for_loop_entry:
fac.1 *= i.1;
i.1++;
goto for_loop_start;
for_loop_end:
return fac.1;

This function consists of four basic blocks:

  1. unsigned long fac.1 = 1; unsigned int i.1 = 1;
  2. if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end (this block has a label, for_loop_start).
  3. fac.1 *= i.1
    i.1 ++ goto for_loop_start (this block has a label, for_loop_entry).
  4. return fac.1 (this block has a label, for_loop_end).

One way to understand which functions form basic blocks is to consider the successors of each instruction. If we have two instructions A and B, we say that B is a successor of A if, after executing A, we can execute B without any intervening instructions. For instance, in the example above, the loop initialization statement unsigned int i.1 = 1 is a successor of unsigned long fac.1 = 1. On the other hand, return fac.1 is not a successor of unsigned long fac.1 = 1: we always have to execute some other intermediate statements to reach the return statement.

Now, consider the if statement, if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end. This statement has two successors: fac.1 *= i.1 and return fac.1.

Similarly, we say that A is a predecessor of B if B is a successor of A. We find that the if statement has two predecessors, unsigned int i.1 = 1 and goto for_loop_start.

A basic block is a sequence of instructions with the following property:

  1. If B comes directly after A in the sequence, B must be the sole successor of A.
  2. If A comes directly before B in the sequence, A must be the sole predecessor of B.

In particular, each member of the sequence but the first must have exactly one predecessor, and each member of the sequence but the last must have exactly one successor. These criteria explain why we have the basic blocks described above.

Putting everything together, we get a control flow graph like this:

Control flow graph for factorial

The graph can be read as follows: each node corresponds to a basic block. The initial basic block (where the function is entered) is marked with a double border, while those basic blocks that leave the function have a gray background. An edge from a basic block B to a basic block B', means that if the execution reaches the end of B, execution may continue in B'. Some edges are labeled: the edge leaving the comparison basic block with the label true, for instance, can only be taken if the comparison did, in fact, return true.

Note that this representation makes it very easy to interpret the program, keeping just two pieces of state: the current position (which basic block and which line), and the values of all variables (in real software, this would also include parts of the heap and the call stack). Execution proceeds as follows: as long as there are still instructions to execute in the current basic block, run the next instruction and move the current position to right after that instruction. At the end of a basic block, take one of the available outgoing edges to another basic block.

In the CPROVER framework, we often do not construct CFGs explicitly, but instead use an IR that is constructed in a way very similar to CFGs, known as ''GOTO programs''. This IR is used to implement a number of static analyses, as described in sections Frameworks: and Specific analyses:.

SSA

While control flow graphs are already quite useful for static analysis, some techniques benefit from a further transformation to a representation known as static single assignment, short SSA. The point of this step is to ensure that we can talk about the entire history of assignments to a given variable. This is achieved by renaming variables again: whenever we assign to a variable, we clone this variable by giving it a new name. This ensures that each variable appearing in the resulting program is written to exactly once (but it may be read arbitrarily many times). In this way, we can refer to earlier values of the variable by just referencing the name of an older incarnation of the variable.

We illustrate this transformation by first showing how the body of the for loop of factorial is transformed. We currently have:

_
expression: fac.1 *= i.1
expression: i.1 ++

We now give a second number to each variable, counting the number of assignments so far. Thus, the SSA version of this code turns out to be

_
expression: fac.1.2 = fac1.1 * i.1.1
expression: i.1.2 = i.1.1 + 1

This representation now allows us to state facts such as ''i is increasing'' by writing i.1.1 < i.1.2.

At this point, we run into a complication. Consider the following piece of code for illustration:

// Given some integers a, b
int x = a;
if (a < b)
x = b;
return x;

The corresponding control flow graph looks like this:

CFG for maximum function

When we try to transform to SSA, we get:

CFG for maximum function - SSA, attempt

Depending on which path the execution takes, we have to return either x.1 or x.2! The way to make this work is to introduce a function Φ that selects the right instance of the variable; in the example, we would have

CFG for maximum function - SSA using Φ

In the CPROVER framework, we provide a precise implementation of Φ, using explicitly tracked information about which branches were taken by the program. There are also some differences in how loops are handled (finite unrolling in CPROVER, versus a Φ-based approach in compilers); this approach will be discussed in a later chapter.

For the time being, let us come back to factorial. We can now give an SSA using Φ functions:

Control flow graph in SSA for factorial

The details of SSA construction, plus some discussion of how it is used in compilers, can be found in the original paper.

The SSA is an extremely helpful representation when one wishes to perform model checking on the program (see next section), since it is much easier to extract the logic formulas used in this technique from an SSA compared to a CFG (or, even worse, an AST). That being said, the CPROVER framework takes a different route, opting to convert to intermediate representation known as GOTO programs instead.

Field Sensitivity

Control granularity of object accesses.

Field sensitivity is a transformation of the instructions goto-program which mainly replaces some expressions by symbols but can also add assignments to the target equations produced by symbolic execution. The main goal of this transformation is to allow more constants to be propagated during symbolic execution. Note that field sensitivity is not applied as a single pass over the whole goto program but instead applied as the symbolic execution unfolds.

On a high level, field sensitivity replaces member operators, and array accesses with atomic symbols representing a field when possible. In cases where this is not immediately possible, like struct assignments, some things need to be added. The possible cases are described below.

Member access

A member expression struct_expr.field_name is replaced by the symbol struct_expr..field_name; note the use of .. to visually distinguish the symbol from the member expression. This is valid for both lvalues and rvalues. See field_sensitivityt::apply.

Symbols representing structs

In an rvalue, a symbol struct_expr which has a struct type with fields field1, field2, etc, will be replaced by {struct_expr..field_name1; struct_expr..field_name2; …}. See field_sensitivityt::get_fields.

Assignment to a struct

When the symbol is on the left-hand-side, for instance for an assignment struct_expr = other_struct, the assignment is replaced by a sequence of assignments: struct_expr..field_name1 = other_struct..field_name1; struct_expr..field_name2 = other_struct..field_name2; etc. See field_sensitivityt::field_assignments.

Array access

An index expression array[index] when index is constant and array has constant size is replaced by the symbol array[[index]]; note the use of [[ and ]] to visually distinguish the symbol from the index expression. When index is not a constant, array[index] is replaced by {array[[0]]; array[[1]]; …index]. Note that this process does not apply to arrays whose size is not constant, and arrays whose size exceed the bound max_field_sensitivity_array_size. See field_sensitivityt::apply.

Symbols representing arrays

In an rvalue, a symbol array which has array type will be replaced by {array[[0]]; array[[1]]; …}[index]. See field_sensitivityt::get_fields.

Assignment to an array

When the array symbol is on the left-hand-side, for instance for an assignment array = other_array, the assignment is replaced by a sequence of assignments: array[[0]] = other_array[[0]]; array[[1]] = other_array[[1]]; etc. See field_sensitivityt::field_assignments.

Analysis techniques

Bounded model checking

One of the most important analysis techniques by the CPROVER framework, implemented using the CBMC (and JBMC) tools, is bounded model checking, a specific instance of a method known as Model Checking.

The basic question that model checking tries to answer is: given some system (in our case, a program) and some property, can we find an execution of the system such that it reaches a state where the property holds? If yes, we would like to know how the program reaches this state - at the very least, we want to see what inputs are required, but in general, we would prefer having a trace, which shows what statements are executed and in which order.

In general, a trace describes which statements of the program were executed, and which intermediate states were reached. Often, it is sufficient to only provide part of the intermediate states (omitting some entirely, and only mentioning parts that cannot be easily reconstructed in others).

As it turns out, model checking for programs is, in general, a hard problem. Part of the reason for this is that many model checking algorithms strive for a form of ''completeness'' where they either find a trace or return a proof that such a trace cannot possibly exist.

Since we are interested in generating test cases, we prefer a different approach: it may be that a certain target state is reachable only after a very long execution, or not at all, but this information does not help us in constructing test cases. For this reason, we introduce an execution bound that describes how deep we go when analyzing a program.

Model checking techniques using such execution bounds are known as bounded model checking; they will return either a trace, or a statement that says ''the target state could not be reached in n steps'', for a given bound n. Thus, for a given bound, we always get an underapproximation of all states that can be reached: we can certainly find those reachable within the given bound, but we may miss states that can be reached only with more steps. Conversely, we will never claim that a state is not reachable within a certain bound if there is, in fact, a way of reaching this state.

The bounded model checking techniques used by the CPROVER framework are based on symbolic model checking, a family of model checking techniques that work on sets of program states and use advanced tools such as SAT solvers (more on that below) to calculate the set of reachable states. The key step here is to encode both the program and the set of states using an appropriate logic, mostly propositional logic and (fragments of) first-order logic.

In the following, we will quickly discuss propositional logic, in combination with SAT solving, and show how to build a simple bounded model checker. Actual bounded model checking for software requires a number of additional steps and concepts, which will be introduced as required later on.

Propositional logic and SAT solving

Many of the concepts in this section can be found in more detail in the Wikipedia article on Propositional logic.

Let us start by looking at propositional formulas. A propositional formula consists of propositional variables, say x, y and z, that can take the Boolean values true and false, connected together with logical operators (often called junctors), namely and, or and not. Sometimes, one introduces additional junctors, such as xor or implies, but these can be defined in terms of the three basic junctors just described.

Examples of propositional formulas would be ''x and y'' or ''not x or y or z''. We can evaluate formulas by setting each variable to a Boolean value and reducing using the follows rules:

  • x and false = false and x = false or false = not true = false
  • x or true = true or x = true and true = not false = true

An important related question is: given a propositional formula, is there a variable assignment that makes it evaluate to true? This is known as the SAT problem. The most important things to know about SAT are:

  1. It forms the basis for bounded model checking algorithms;
  2. It is a very hard problem to solve in general: it is NP-complete, meaning that it is easy to check a solution, but (as far as we know) hard to find one;
  3. There has been impressive research in SAT solvers that work well in practice for the kinds of formulas that we encounter in model checking. A commonly-used SAT solver is minisat.
  4. SAT solvers use a specific input format for propositional formulas, known as the conjunctive normal form. For details, see the linked Wikipedia page; roughly, a conjunctive normal form formula is a propositional formula with a specific shape: at the lowest level are atoms, which are propositional variables ''x'' and negated propositional variables ''not x''; the next layer above are clauses, which are sequences of atoms connected with ''or'', e.g. ''not x or y or z''. The top layer consists sequences of clauses, connected with ''and''.

As an example in how to use a SAT solver, consider the following formula (in conjunctive normal form):

''(x or y) and (x or not y) and x and y''

We can represent this formula in the minisat input format as:

p cnf 2 3
1 2 0
1 -2 0
1 0
2 0

Compare the Minisat user guide. Try to run minisat on this example. What would you expect, and what result do you get?

Next, try running minisat on the following formula: ''(x or y) and (x or not y) and (not x) and y'' What changed? Why?

Using SAT to reason about data: Bit vectors

So far, we have seen how to reason about simple propositional formulas. This seems to be quite far from our goal of reasoning about the behavior of programs. As a first step, let us see how we can lift SAT-based reasoning to reasoning about data such as machine integers.

Remember the factorial function. It starts with the line

unsigned long fac = 1;

Now, suppose that fac will be represented as a binary number with 64 bits (this is standard on, e.g., Linux). So, if we wish to reason about the contents of the variable fac, we might as well represent it as a vector of 64 propositional variables, say fac0 to fac63, where fac = 263 fac63 + ... + 20 fac0. We can then assert that fac=1 using the propositional formula fac63 = 0 and ... and fac1 = 0 and fac0 = 1, where we define the formula A = B as ''(A or not B) and (B or not A)''.

We call this a bit vector representation. Compare the Wikipedia page on Binary numbers.

Bit vector representations can also be used to describe operations on binary numbers. For instance, suppose we have two four-bit numbers A_3, ... A_0 (representing a number A) and B_3, ..., B_0 (representing a number b) and wish to add them. As detailed on the page on Binary adders, we define three additional bit vectors, the carries C0, ..., C3, the partial sum P0, ..., P4 and the sum S0, ..., S4 , representing a number S such that S=A+B. Note that the sum vector has one bit more - why? How is this related to arithmetic overflow in C?

For convenience, we first define the half-adder. This is given as two formulas, HA_S(A,B) = (A and not B) or (B and not A), which gives the sum of the bits A and B, and HA_C(A,B) = A and B, which indicates whether the result of the sum was too big to fit into one bit (so we carry a one to the next bit).

Using the half-adder formulas, we can define the full adder, again given as two formulas, one for sum and the other for carry. We have FA_S(A,B,C_in) = HA(HA(A,B),C_in), giving the sum of A, B and C_in, and FA_C(A,B,C_in) = HA_C(A,B) or (C_in and HA_S(A,B)), which states whether the result is too big to fit into a bit. Note that the full adder has an additional input for carries; in the following step, we will use it to chain the full adders together to compute the actual sum.

Using the helper variables we have above, we can give the sum of the four-bit numbers as

C_0 = FA_C(A_0,B_0,0) and C_1 = FA_C(A_1,B_1,C_0) and C_2 = FA_C(A_2,B_2,C_1) and C_3 = FA_C(A_3,B_3,C_2) and S_0 = FA_S(A_0,B_0,0) and S_1 = FA_S(A_1,B_1,C_0) and and S_2 = FA_S(A_2,B_2,C_1) and S_3 = FA_S(A_3,B_3,C_2) and S_3 = FA_S(0,0,C_3).

Other arithmetic operations on binary numbers can be expressed using propositional logic as well; the details can be found in the linked articles, as well as Two's complement for handling signed integers and IEEE 754 for floating point numbers.

In the following, we will simply write formulas such as S=A+B, with the understanding that this is internally represented using the appropriate bit vectors.

How bounded model checking works

At this point, we can start considering how to express program behavior in propositional logic.

Let us start with a very simple program:

int sum(int a, int b)
{
return a + b;
}

To describe the behavior of this program, we introduce the appropriately-sized bit vectors A and B, and an additional helper vector return. The A and B bit vectors reflect the values of the parameters a and b, while return contains the return value of the function. As we have seen above, we can describe the value of a+b as A+B – remember that this is an abbreviation for a moderately complex formula on bit vectors!

From the semantics of the return instruction, we know that this program will return the value a+b, so we can describe its behavior as return = A+B.

Let us consider a slightly more complex program.

int calculate(int x)
{
int y = x * x;
y = y + x;
return y;
}

We again introduce several bit vectors. For the parameter x, we introduce a bit vector X, and for the return value, return. But we also have to deal with the (local) variable y, which gets two assignments. Furthermore, we now have a program with three instructions.

Thinking about the approach so far, we find that we cannot directly translate y=y+x into a propositional formula: On the left-hand side of the =, we have the ''new'' value of y, while the right-hand side uses the ''old'' value. But propositional formulas do not distinguish between ''old'' and ''new'' values, so we need to find a way to distinguish between the two. The easiest solution is to use the Static Single Assignment form described in section SSA. We transform the program into SSA (slightly simplifying the notation):

int calculate(int x.1)
{
int y.1 = x.1 * x.1;
y.2 = y.1 + x.1;
return y.2;
}

In this form, we know that each variable is assigned to at most once. To capture the behavior of this program, we translate it statement-by-statement into a propositional formula. We introduce two bit vectors Y1 and Y2 to stand for y.1 and y.2 (we map X to x.1 and return to the return value). int y.1 = x.1 * x.1 becomes Y1 = X * X, y.2 = y.1 + x.1 becomes Y2 = Y1 + X and return y.2 becomes return = Y2.

To tie the three formulas together into a description of the whole program, we note that the three instructions form a single basic block, so we know they are always executed as a unit. In this case, it is sufficient to simply connect them with ''and'': Y1 = X * X and Y2 = Y1 + X and return = Y2. Note that thanks to SSA, we do not need to pay attention to control flow here.

One example of non-trivial control flow are if statements. Consider the following example:

int max(int a, int b)
{
int result;
if (a < b)
result = b;
else
result = a;
return result;
}

Bringing this into SSA form, we have the following program (we write Phi for Φ):

int max(int a, int b)
{
int result;
if (a < b)
result.1 = b;
else
result.2 = a;
return Phi(result.1,result.2);
}

We again introduce bit vectors A (for parameter a), B (for parameters b), R1 (for result.1), R2 (for result.2) and return (for the return value). The interesting question in this case is how we can handle the Φ node: so far, it is a ''magic'' operator that selects the correct value.

As a first step, we modify the SSA form slightly by introducing an additional propositional variable C that tracks which branch of the if was taken. We call this variable the code guard variable, or guard for short. Additionally, we add C to the Φ node as a new first parameter, describing which input to use as a result. The corresponding program looks something like this:

int max(int a, int b)
{
int result;
bit C; /* Track which branch was taken */
C = a < b;
/* if (C) - not needed anymore thanks to SSA */
result.1 = b;
/* else */
result.2 = a;
return Phi(C,result.1,result.2);
}

For the encoding of the program, we introduce the implication junctor, written ⇒, where ''A ⇒ B'' is equivalent to ''(not A) or B''. It can be understood as ''if A holds, B must hold as well''.

With these ingredients, we can encode the program. First of all, we translate the basic statements of the program:

  • C = a<b maps to C = A<B, for an appropriate formula A<B.
  • result.1 = b becomes R1 = B, and result.2 = a becomes R2 = A.

Finally, the Φ node is again resolved using the ⇒ junctor: we can encode the return statement as (C ⇒ return = R1) and ((not C) ⇒ return = R2).

At this point, it remains to tie the statements together; we find that we can again simply connect them with ''and''. We get:

C = a<b and R1 = B and R2 = A and (C ⇒ return = R1) and ((not C) ⇒ return = R2).

So far, we have only discussed how to encode the behavior of programs as propositional formulas. To actually reason about programs, we also need to a way to describe the property we want to prove. To do this, we introduce a primitive ASSERT. Let e be some expression; then ASSERT(e) is supposed to do nothing if e evaluates to true, and to abort the program if e evaluates to false.

For instance, we can add prove that max(a,b) <= a by modifying the max function into

int max(int a, int b)
{
int result;
if (a < b)
result = b;
else
result = a;
ASSERT(result <= a);
return result;
}
@ ASSERT

The corresponding SSA would be

int max(int a, int b)
{
int result;
bit C; /* Track which branch was taken */
C = a < b;
result.1 = b;
result.2 = a;
ASSERT(Phi(C,result.1,result.2) <= a);
return Phi(C,result.1,result.2);
}

We translate ASSERT(Phi(C,result.1,result.2) <= a) into

Φ(C,result.1,result.2) <= a The resulting formula would be C = a<b and R1 = B and R2 = A and (C ⇒ R1 <= A) and ((not C) ⇒ R2 <= A). (C ⇒ return = R1) and ((not C) ⇒ return = R2).

We can extend this approach quite straightforwardly to other constructs, but one obvious problem remains: We have not described how to handle loops. This turns out to be a substantial issue for theoretical reasons: Programs without loop (and without recursive functions) always run for a limited amount of execution steps, so we can easily write down formulas that, essentially, track their entire execution history. But programs with loops can take arbitrarily many steps, or even run into infinite loops, so we cannot describe their behavior using a finite propositional formula in the way we have done above.

There are multiple approaches to deal with this problem, all with different trade-offs. CBMC chooses bounded model checking as the underlying approach. The idea is that we only consider program executions that are, in some measure, ''shorter'' than a given bound. This bound then implies an upper bound on the size of the required formulas, which makes the problem tractable.

To make this concrete, let's go back to the factorial example. We consider the function in the following form (in CPROVER, we actually use a goto-based IR, but the while-based version is a bit simpler to understand):

unsigned long factorial(unsigned n) {
unsigned long fac = 1;
unsigned int i = 1;
while (i <= n) {
fac *= i;
i = i+1;
}
return fac;
}

We set the following ''length bound'': The loops in the program are allowed to be executed at most three times; we will ignore all other executions. With this in mind, we observe that the following two classes of programs behave exactly the same:

/* Execute this loop at most n+1 times */
while (e) {
body;
}

and

if (e) {
body;
/* Execute this loop at most n times */
while (e) {
body;
}
}

This transformation is known as loop unrolling or unwinding: We can always replace a loopby an if that checks if we should execute, a copy of the loop body and the loop statement.

So, to reason about the factorial function, we unroll the loop three times and then replace the loop with ASSERT(!(condition)). We get:

unsigned long factorial(unsigned n) {
unsigned long fac = 1;
unsigned int i = 1;
if (i <= n) {
fac *= i;
i = i+1;
if (i <= n) {
fac *= i;
i = i+1;
if (i <= n) {
fac *= i;
i = i+1;
ASSERT(!(i <= n));
}
}
return fac;
}

Since this program is now in a simpler form without loops, we can use the techniques we learned above to transform it into a propositional formula. First, we transform into SSA (with code guard variables already included):

unsigned long factorial(unsigned n) {
unsigned long fac.1 = 1;
unsigned int i.1 = 1;
bit C1 = i.1 <= n;
/* if (C1) { */
fac.2 = fac.1 * i.1;
i.2 = i.1+1;
bit C2 = i.2 <= n;
/* if (C2) { */
fac.3 = fac.2 * i.2;
i.3 = i.2+1;
bit C3 = i.3 <= n;
/* if (C3) { */
fac.4 = fac.3 * i.3;
i.4 = i.3+1;
ASSERT(!(i.4 <= n));
/* }}} */
return Phi(C1, Phi(C2, Phi(C3, fac.4, fac.3), fac.2), fac.1);
}

Note that we may be missing possible executions of the program due to this translation; we come back to this point later.

The corresponding propositional formula can then be written as (check that this is equivalent to the formula you would be getting by following the translation procedure described above):

fac.1 = 1 and i.1 = 1 and C1 = i.1 <= n and fac.2 = fac.1 * i.1 and i.2 = i.1 + 1 and C2 = i.2 <= n and fac.3 = fac.2 * i.2 and i.3 = i.2 + 1 and C3 = i.3 <= n and fac.4 = fac.3 * i.3 and i.4 = i.3 + 1 and C4 = i.4 <= n and not (i.4 <= n) and ((C1 and C2 and C3) ⇒ result = fac.4) and ((C1 and C2 and not C3) ⇒ result = fac.3) and ((C1 and not C2) ⇒ result = fac.2) and ((not C1) ⇒ result = fac.1) In the following, we reference this formula as FA(n, result).

At this point, we know how to encode programs as propositional formulas. Our goal was to reason about programs, and in particular, to check whether a certain property holds. Suppose, for example, that we want to check if there is a way that the factorial function returns 6. One way to do this is to look at the following propositional formula: FA(n, result) and result = 6. If this formula has a model (i.e., if we can find a satisfying assignment to all variables, and in particular, to n), we can extract the required value for the parameter n from that model. As we have discussed above, this can be done using a SAT solver: If you run, say, MiniSAT on this formula, you will get a model that translates to n=3.

Be aware that this method has very clear limits: We know that the factorial of 5 is 120, but with the formula above, evaluating ''FA(n, result) and result=120'' would yield ''unsatisfiable''! This is because we limited the number of loop iterations, and to reach 120, we have to execute the loop more than three times. In particular, a ''VERIFICATION SUCCESSFUL'' message, as output by CBMC, must always be interpreted keeping the bounds that were used in mind.

In the case that we found a model, we can get even more information: We can even reconstruct the program execution that would lead to the requested result. This can be achieved by tracing the SSA, using the guard variables to decide which branches of if statements to take. In this way, we can simulate the behavior of the program. Writing down the steps of this simulation yields a trace, which described how the corresponding program execution would look like.

Where to go from here

The above section gives only a superficial overview on how SAT solving and bounded model checking work. Inside the CPROVER framework, we use a significantly more advanced engine, with numerous optimizations to the basic algorithms presented above. One feature that stands out is that we do not reduce everything to propositional logic, but instead use a more powerful logic, namely quantifier-free first-order logic. The main difference is that instead of propositional variables, we allow expressions that return Boolean values, such as comparisons between numbers or string matching expressions. This gives us a richer logic to express properties. Of course, a simple SAT solver cannot deal with such formulas, which is why we go to SMT solvers instead - these solvers can deal with specific classes of first-order formulas (like the ones we produce). One well-known SMT solver is Z3.

Static analysis

While BMC analyzes the program by transforming everything to logic formulas and, essentially, running the program on sets of concrete states, another approach to learn about a program is based on the idea of interpreting an abstract version of the program. This is known as abstract interpretation. Abstract interpretation is one of the main methods in the area of static analysis.

Abstract Interpretation

The key idea is that instead of looking at concrete program states (e.g., ''variable x contains value 1''), we look at some sufficiently-precise abstraction (e.g., ''variable x is odd'', or ''variable x is positive''), and perform interpretation of the program using such abstract values. Coming back to our running example, we wish to prove that the factorial function never returns 0.

An abstract interpretation is made up from four ingredients:

  1. An abstract domain, which represents the analysis results.
  2. A family of transformers, which describe how basic programming language constructs modify the state.
  3. A map that takes a pair of a program location (e.g., a position in the program code) and a variable name and yields a value in the abstract domain.
  4. An algorithm to compute a ''fixed point'', computing a map as described in the previous step that describes the behavior of the program.

The first ingredient we need for abstract interpretation is the abstract domain. The domain allows us to express what we know about a given variable or value at a given program location; in our example, whether it is zero or not. The way we use the abstract domain is for each program point, we have a map from visible variables to elements of the abstract domain, describing what we know about the values of the variables at this point.

For instance, consider the factorial example again. After running the first basic block, we know that fac and i both contain 1, so we have a map that associates both fac and i to "not 0".

An abstract domain is a set $D$ (or, if you prefer, a data type) with the following properties:

  • There is a function merge that takes two elements of $D$ and returns an element of $D$. This function is:
    • associative - (merge(x, merge(y,z)) = merge(merge(x,y), z)),
    • commutative - (merge(x,y) = merge(y,x)) and
    • idempotent (merge(x,x) = x).
  • There is an element bottom of $D$ such that merge(x, bottom) = x.

Algebraically speaking, $D$ needs to be a semi-lattice.

For our example, we use the following domain:

  • D contains the elements "bottom" (nothing is known), "equals 0", "not 0" and "could be 0".
  • merge is defined as follows:
    • merge(bottom, x) = x
    • merge("could be 0", x) = "could be 0"
    • merge(x,x) = x
    • merge("equals 0", "not 0") = "could be 0"
  • bottom` is bottom, obviously. It is easy but tedious to check that all conditions hold.

The second ingredient we need are the abstract state transformers. An abstract state transformer describes how a specific expression or statement processes abstract values. For the example, we need to define abstract state transformers for multiplication and addition.

Let us start with multiplication, so let us look at the expression x*y. We know that if x or y are 0, x*y is zero. Thus, if x or y are "equals 0", the result of x*y is also "equals 0". If x or y are "could be 0" (but neither is "equals 0"), we simply don't know what the result is - it could be zero or not. Thus, in this case, the result is "could be 0".

What if x and y are both "not 0"? In a mathematically ideal world, we would have x*y be non-zero as well. But in a C program, multiplication could overflow, yielding 0. So, to be correct, we have to yield "could be 0" in this case.

Finally, when x is bottom, we can just return whatever value we had assigned to y, and vice versa.

For addition, the situation looks like this: consider x+y. If neither x nor y is "not 0", but at least one is "could be 0", the result is "could be 0". If both are "equals 0", the result is "equals 0". What if both are "not 0"? It seems that, since the variables are unsigned and not zero, it should be "not 0". Sadly, overflow strikes again, and we have to make do with "could be 0". The bottom cases can be handled just like for multiplication.

The way of defining the transformation functions above showcases another important property: they must reflect the actual behavior of the underlying program instructions. There is a formal description of this property, using Galois connections; for the details, it is best to look at the literature.

The third ingredient is straightforward: we use a simple map from program locations and variable names to values in the abstract domain. In more complex analyses, more involved forms of maps may be used (e.g., to handle arbitrary procedure calls, or to account for the heap).

At this point, we have almost all the ingredients we need to set up an abstract interpretation. To actually analyze a function, we take its CFG and perform a fixpoint algorithm.

Concretely, let us consider the CFG for factorial again. This time, we have named the basic blocks, and simplified the variable names.

Control flow graph for factorial - named basic blocks

We provide an initial variable map: n is "could be 0" before BB1. As a first step, we analyze BB1 and find that the final variable map should be:

  • n "could be 0".
  • fac "not 0" (it is, in fact, 1, but our domain does not allow us to express this).
  • i "not 0". Let as call this state N.

At this point, we look at all the outgoing edges of BB1 - we wish to propagate our new results to all blocks that follow BB1. There is only one such block, BB2. We analyze BB2 and find that it doesn't change any variables. Furthermore, the result of i <= n does not allow us to infer anything about the values in the variable map, so we get the same variable map at the end of BB2 as before.

Again, we look at the outgoing edges. There are two successor blocks, BB3 and BB4. The information in our variable map does not allow us to rule out either of the branches, so we need to propagate to both blocks. We start with BB3 and remember that we need to visit BB4.

At this point, we know that n "could be 0", while fac and i are "not 0". Applying the abstract transformers, we learn that afterwards, both fac and i "could be 0" (fac ends up as "could be 0" since both fac and i were "not 0" initially, i ends up as "could be 0" because both i and 1 are "not 0"). So, the final variable map at the end of BB3 is

  • n, fac and i "could be 0". Let us call this state S.

At this point, we propagate again, this time to BB2. But wait, we have propagated to BB2 before! The way this is handled is as follows: We first calculate what the result of running BB2 from S; this yields S again. Now, we know that at the end of BB2, we can be either in state S or N. To get a single state out of these two, we merge: for each variable, merge the mapping of that variable in S and N. We get:

  • n maps to merge("could be 0", "could be 0") = "could be 0"
  • fac maps to merge("could be 0", "not 0") = "could be 0"
  • i maps to merge("could be 0", "not 0") = "could be 0" In other words, we arrive at S again.

At this point, we propagate to BB3 and BB4 again. Running BB3, we again end up with state S at the end of BB3, so things don't change. We detect this situation and figure out that we do not need to propagate from BB3 to BB2 - this would not change anything! Thus, we can now propagate to BB4. The state at the end of BB4 is also S.

Now, since we know the variable map at the end of BB4, we can look up the properties of the return value: in S, fac maps to "could be 0", so we could not prove that the function never returns 0. In fact, this is correct: calculating factorial(200) will yield 0, since the 64-bit integer fac overflows.

Nevertheless, let us consider what would happen in a mathematically ideal world (e.g., if we used big integers). In that case, we would get "not 0" * "not 0" = "not 0", "not 0" + x = "not 0" and x + "not 0" = "not 0". Running the abstract interpretation with these semantics, we find that if we start BB3 with variable map N, we get variable map N at the end as well, so we end up with variable map N at the end of BB4 - but this means that fac maps to "not 0"!

The algorithm sketched above is called a fixpoint algorithm because we keep propagating until we find that applying the transformers does not yield any new results (which, in a mathematically precise way, can be shown to be equivalent to calculating, for a specific function f, an x such that f(x) = x).

This overview only describes the most basic way of performing abstract interpretation. For one, it only works on the procedure level (we say it is an intra-procedural analysis); there are various ways of extending it to work across function boundaries, yielding inter-procedural analysis. Additionally, there are situations where it can be helpful make a variable map more abstract (widening) or use information that can be gained from various sources to make it more precise (narrowing); these advanced topics can be found in the literature as well.

Glossary

Instrument

To instrument a piece of code means to modify it by (usually) inserting new fragments of code that, when executed, tell us something useful about the code that has been instrumented.

For instance, imagine you are given the following function:

int aha (int a)
{
if (a > 10)
return 1;
else
return -1;
}

and you want to design an analysis that figures out which lines of code will be covered when aha is executed with the input 5.

We can instrument the code so that the function will look like:

int aha (int a)
{
__runtime_line_seen(0);
if (a > 10) {
__runtime_line_seen(1);
return 1;
} else {
__runtime_line_seen(2);
return -1;
}
__runtime_line_seen(3);
}

All we have to do now is to implement the function void __runtime_line_seen(int line) so that when executed it logs somewhere what line is being visited. Finally we execute the instrumented version of aha and collect the desired information from the log.

More generally speaking, and especially within the CPROVER code base, instrumenting the code often refers to modify its behavior in a manner that makes it easier for a given analysis to do its job, regardless of whether that is achieved by executing the instrumented code or by just analyzing it in some other way.

Flattening and Lowering

As we have seen above, we often operate on many different representations of programs, such as ASTs, control flow graphs, SSA programs, logical formulas in BMC and so on. Each of these forms is good for certain kinds of analyses, transformations or optimizations.

One important kind of step in dealing with program representations is going from one representation to the other. Often, such steps are going from a more ''high-level'' representation (closer to the source code) to a more ''low-level'' representation. Such transformation steps are known as flattening or lowering steps, and tend to be more-or-less irreversible.

An example of a lowering step is the transformation from ASTs to the GOTO IR, given above.

Verification Condition

In the CPROVER framework, the term verification condition is used in a somewhat non-standard way. Let a program and a set of assertions be given. We transform the program into an (acyclic) SSA (i.e., an SSA with all loops unrolled a finite number of times) and turn it into a logical formula, as described above. Note that in this case, the formula will also contain information about what the program does after the assertion is reached: this part of the formula, is, in fact, irrelevant for deciding whether the program can satisfy the assertion or not. The verification condition is the part of the formula that only covers the program execution until the line that checks the assertion has been executed, with everything that comes after it removed.

CBMC Architecture

Author
Martin Brain, Peter Schrammel

This section provides a graphical overview of how CBMC fits together. CBMC takes C code or a goto-binary as input and tries to emit traces of executions that lead to crashes or undefined behaviour. The diagram below shows the intermediate steps in this process.

The CPROVER User Manual describes CBMC from a user perspective. Each node in the diagram above links to the appropriate class or module documentation, describing that particular stage in the CBMC pipeline. CPROVER is structured in a similar fashion to a compiler. It has language specific front-ends which perform limited syntactic analysis and then convert to an intermediate format. The intermediate format can be output to files (this is what goto-cc does) and are (informally) referred to as “goto binaries” or “goto programs”. The back-end are tools process this format, either directly from the front-end or from it’s saved output. These include a wide range of analysis and transformation tools (see Other Tools).

Concepts

Central data structures

For an explanation of the data structures involved in the modeling of a GOTO program (the GOTO Intermediate Representation used by CBMC and assorted tools) please see Central Data Structures .

{C, java bytecode} → Parse tree → Symbol table → GOTO programs → GOTO program transformations → BMC → counterexample (goto_tracet) → printing

To be documented.

Instrumentation: goto functions → goto functions

For an overview of the transformations applied to goto programs after the generation of the initial goto program and before BMC, see Goto Program Transformations .

To be documented.

Goto functions → BMC → Counterexample (trace)

For an explanation of part of how the BMC (Symex) process works, please refer to Symex and GOTO program instructions

Trace → interpreter → memory map

To be documented.

Goto functions → abstract interpretation

To be documented.

Executables (flow of transformations):

goto-cc

To be documented.

goto-instrument

To be documented.

cbmc

To be documented.

goto-analyzer

To be documented.

Central Data Structures

Central Data Structures

The following is some light technical documentation of the major data structures used in the input transformation to Intermediate Representation (IR) inside CBMC and the assorted CProver tools.

GOTO models

The goto_modelt class is the top-level data structure that CBMC (and the other tools) use for holding the GOTO intermediate representation. The following diagram is a simplified view of how the data structures relate to each other -

erDiagram
goto_modelt {
symbol_tablet symbol_table
goto_functionst goto_functions
}
goto_modelt ||--|| symbol_tablet : ""
goto_modelt ||--|| goto_functionst : ""
symbol_tablet {
symbol_map symbols
}
symbol_tablet ||--o{ symbolt : ""
symbolt {
string name
}
goto_functionst {
function_map functions
}
goto_functionst ||--o{ goto_functiont : ""
goto_functiont {
goto_programt body
ordered_collection_of parameter_identifiers
}
goto_functiont ||--|| goto_programt : ""
goto_programt {
ordered_collection_of instructions
}
goto_programt ||--o{ instructiont : ""
instructiont {
enumeration instruction_type
goto_program_codet code
boolean_expression guard
source_locationt source_location
}
instructiont ||--|| goto_program_instruction_typet : ""
instructiont ||--o| goto_program_codet : ""
instructiont ||--o| source_locationt : ""
goto_program_codet ||--o| source_locationt : ""

A goto_modelt is effectively a pair, consisting of:

  • A collection of GOTO functions.
  • A symbol table containing symbol definitions which may be referred to by symbol expressions. Symbol expressions are found in the goto functions and the (sub-)expressions for the definitions of symbols.

In pseudocode, the type looks this:

type goto_modelt {
type goto_functionst = map<identifier, goto_functiont>
type symbol_tablet = map<identifier, symbolt>
}

There is an abstract interface for goto models provided by the abstract_goto_modelt class. This is defined and documented in the file src/goto-programs/abstract_goto_model.h. Ideally code which takes a goto model as input should accept any implementation of the abstract interface rather than accepting the concrete goto_modelt exclusively. This helps with compatibility with jbmc which uses lazy loading. See the lazy_goto_modelt class which is defined and documented in jbmc/src/java_bytecode/lazy_goto_model.h for details of lazy loading.

For further information about symbols see the symbolt class which is defined and documented in src/util/symbol.h and the symbol_exprt (symbol expression) class which is defined and documented in src/util/std_expr.h.

goto_functiont

A goto_functiont is also defined as a pair. It's designed to represent a function at the goto level, and effectively it's the following data type (in pseudocode):

type goto_functiont {
goto_programt body
list<identifiers> parameters
}

The goto_programt denoting the body of the function will be the subject of a more elaborate explanation in the next section. The in-memory structure of a goto function allows the body to be optional, but only functions with bodies are included in the serialised goto binaries.

The parameters subcomponent is a list of identifiers that are to be looked-up in the symbol-table for their definitions.

goto_programt

A goto program is a sequence of GOTO instructions (goto_instructiont). For details of the goto_programt class itself see the documentation in goto_program.h. For further details of the life cycle of goto programs see goto programs

An instruction (goto_instructiont) is a triple (an element with three subcomponents), describing a GOTO level instruction with the following 3 component subtypes, again in pseudocode:

type goto_instructiont {
instruction_type instr_type
instruction statement
guard boolean_expr
}

The pseudocode subtypes above require some more elaboration, so we will provide some extra detail to illuminate some details at the code level:

  • The instruction_type above corresponds to the goto_program_instruction_typet type listed in goto_program.h
    • Some illustrative instruction types are assign, function_call, return, etc.
  • The instruction is a statement represented by a goto_instruction_codet.
    • A goto_instruction_codet is an alias of codet (a data structure broadly representing a statement inside CBMC) that contains the actual code to be executed.
    • You can distinguish different statements by using the result of the get_statement member function of the codet class.
  • The guard is an exprt (a data structure broadly representing an expression inside CBMC) that is expected to have type bool.
    • This is optional - not every instruction is expected to have a guard associated with it.

source_locationt

Another important data structure that needs to be discussed at this point is source_locationt.

source_locationt are attached into various exprts (the data structure representing various expressions, usually the result of some early processing, e.g. the result of the frontend parsing a file).

This means that codets, representing GOTO instructions also have a source_locationt attached to them, by virtue of inheriting from exprt.

For the most part, source_locationt is something that is only being used when we print various nodes (for property listing, counterexample/trace printing, etc).

It might be possible that a specific source location might point to a CBMC instrumentation primitive (which might be reported as a built-in-addition) or there might even be no-source location (because it might be part of harnesses generated as an example, that have no presence in the user code).

irept

ireps are the underlying data structure used to implement many of the data structures in CBMC and the assorted tools. These include the exprt, typet, codet and source_locationt classes. This is a tree data structure where each node is expected to contain a string/ID and may have child nodes stored in both a sequence of child nodes and a map of strings/IDs to child nodes. This enables the singular irept data structure to be used to model graphs such as ASTs, CFGs, etc.

The classes extending irept define how the higher level concepts are mapped onto the underlying tree data structure. For this reason it is usually advised that the member functions of the sub-classes should be used to access the data held, rather than the member functions of the base irept class. This aids potential future restructuring and associates accesses of the data with the member functions which have the doxygen explaining what the data is.

The strings/IDs held within irept are of type irep_idt. These can be converted to std::string using the id2string function. There is a mechanism provided for casting expressions and types in src/util/expr_cast.h. In depth documentation of the irept class itself can be found in src/util/irep.h.

Code Walkthrough

Author
Cesar Rodriguez, Owen Jones

Data structures: core structures and AST

The core structures used for representing abstract syntax trees are all documented in util.

Data structures: from AST to GOTO program

See goto-programs, goto_programt and instructiont.

Front-end languages: generating codet from multiple languages

language_filest, languaget classes:

See langapi.

C

See ansi-c.

C++

See cpp.

Java bytecode

See java_bytecode.

Bmct class

equation

See Overview.

Symbolic executors

Symbolic execution

See Overview.

Solvers infrastructure

See Overview.

Static analysis APIs

See analyses and pointer-analysis.

Compilation and Development

Author
Martin Brain, Peter Schrammel, Owen Jones

Compilation

The CBMC source code is available on its GitHub page.

Makefiles

Instructions for compiling CBMC using makefiles are available in COMPILING.md in the root of the CBMC repository. They cover Linux, Solaris 11, FreeBSD 11, MacOS X and Windows.

CMake files

There is also support for compiling using CMake. Instructions are available in COMPILING.md in the root of the CBMC repository.

Personal configuration

config.inc

Two files, config.inc and common, are included in every makefile. config.inc contains configuration options relating to compilation so that they can be conveniently edited in one place. common contains commands that are needed in every makefile but which the developer is not expected to edit. (There is also another config.inc which is also included in every makefile in the jbmc folder.)

Note, these files are not involved in the CMake build.

Macro DEBUG

If the macro DEBUG is defined during compilation of CBMC (for example by using a compiler flag) then extra debug code will be included. This includes print statements and code checking that data structures are as expected.

Running tests

Regression tests

The regression tests are contained in regression/ and jbmc/regression/. Inside these folders there is a directory for each of the modules. Each of these contains multiple test directories, with names describing what they test. When there are multiple tests in a test directory then they should all test very similar aspects of the program's behaviour. Each test directory contains input files and one or more test description files, which have the ending .desc. The test description files specify what command to run, what output is expected and so on. The test framework is a Perl script, test.pl, located in regression/test.pl.

The --help option to test.pl outlines the format of the test description files. Most importantly, the first word in a test description file is its level, which is one of: CORE (should be run in CI, should succeed), THOROUGH (takes too long to be run in CI, should succeed), FUTURE (will succeed when a planned feature is added) or KNOWNBUG (will succeed when a bug is fixed).

Test descriptions may also include a number of tags. test.pl -I<tag> will only run tests with a particular <tag>, and test.pl -X<tag> will run all tests except for those with a particular <tag>. See regression/README.md for the current set of tags and their intended use.

Running regression tests with make

If you have compiled using make then you can run the regression tests using make test. Run it from regression/ to run all the regression tests, or any of its subfolders to just run the tests for that module. The number of tests that are run in parallel can be controlled through the environment variable TESTPL_JOBS.

If you have not compiled using make then this won't work, because the makefile is expecting to find binaries like cbmc and jbmc in the source folders.

Running regression tests with CTest

If you have compiled using CMake then you can run the regression tests using CTest. (Note: this will also run the unit tests.)

Here are two example commands, to be run from the build/ directory:

ctest -V -L CORE -R cpp
ctest -V -L CORE -R cpp -E cbmc-cpp

-V makes it print out more useful output. -L CORE makes it only run tests that have been tagged CORE. -R regular_expression can be used to limit which tests are run to those which match the given regular expression, and -E regex excludes tests to those which match the given regular expression. So the first command will run all the CORE tests in regression/cbmc/cpp and regression/cbmc/cbmc-cpp, and the second will run all the CORE tests in regression/cbmc/cpp only. Another useful option is -N, which makes CTest list which tests it will run without actually running them.

Running individual regression tests directly with test.pl

It can be useful to run a single test folder in isolation. This can be done by running regression/test.pl directly. The way that test.pl is run varies between the different modules, and can be ascertained by looking at the test target in the makefile. The simple case is when there isn't a file called chain.sh. Then you can directly run test.pl on a single test folder with the following command from the module directory (note that it is recommended to use absolute paths as it avoids many issues, e.g. the path to the binary should be relative to <test-folder>):

<absolute-path-to-test.pl> -p -c <absolute-path-to-binary> <test-folder>

-p makes it print a log of failed tests and -c tells it where to find the binary to run, e.g. cbmc, jbmc or goto-analyzer. If <test-folder> is not provided then all test directories are run. The --help option lists all command line options, including -j for running multiple tests in parallel and -C, -T, -F and -K for controlling whether CORE, THOROUGH, FUTURE or KNOWNBUG tests are run.

When there is a file called chain.sh then the command should look like

<absolute-path-to-test.pl> -p -c '<absolute-path-to-chain-sh> <arg-1> ... <arg-n>' <test-folder>

Note that the test binary and its initial command line arguments are a single value for the -c option, so they must be be wrapped in quotes if they contain any unescaped spaces. What to put for the arguments <arg-1> to <arg-n> varies from module to module. To find out, look in chain.sh and see what arguments it expects. You can also look in the Makefile and see how it calls chain.sh in the test target.

Unit tests

The unit tests are contained in the unit/ folder. They are written using the Catch unit test framework.

If you have compiled with make, you can run the unit tests for CBMC directly by going to unit/, running make to compile the unit tests and then make test to run them. You can run the unit tests for JBMC directly by going to jbmc/unit/ and running the same commands.

If you have compiled with CMake, you can run the unit tests for CBMC directly by going to unit/ and running

../<build-folder>/bin/unit

and you can run the unit tests for JBMC directly by going to jbmc/unit/ and running

../../<build-folder>/bin/java-unit

If you have compiled with CMake you can also run the unit tests through CTest, with the names unit and java-unit. So, from the build/ directory, run

ctest -V -L CORE -R ^unit
ctest -V -L CORE -R java-unit

to run the CBMC unit tests and the JBMC unit tests respectively. (The ^ is needed to make it a regular expression that matches unit but not java-unit.)

Note that some tests run which are expected to fail - see the summary at the end of the run to see how many tests passed, how many failed which were expected to and how many tests failed which were not expected to.

For more information on the structure of unit/ and how to tag tests, see the section on unit tests in CODING_STANDARD.md in the root of the CBMC repository

Test coverage

On Unix-style systems you can automatically generate a code coverage report. To obtain an HTML report for the test and unit tests, first build the dedicated coverage configuration using CMake (setting enable_coverage and building the coverage target):

cmake -S . -Bcov-build -Denable_coverage=1 -Dparallel_tests=2
make -C cov-build coverage

This configures a build environment in the cov-build/ folder with coverage recording at runtime enabled. The actual build (using make in the above case) will run the test suite, running parallel_tests-many tests concurrently (in the above case: 2). The HTML report is generated using lcov and stored in cov-build/html/.

Using a different SAT solver

By default, CBMC will assume MiniSat 2 as the SAT back-end. Several other solvers are supported (see also [config.inc](compilation-and-development-subsubsection-config-inc) above). As a more general option, which is not limited to a single SAT solver, you may use the IPASIR interface. For example, to use the SAT solver RISS, proceed as follows:

1) Build RISS (in particular its IPASIR configuration):

git clone https://github.com/conp-solutions/riss riss.git
cd riss.git
mkdir build
cd build
cmake ..
make riss-coprocessor-lib-static
cd ../..

2) Build CBMC while enabling the IPASIR back-end: make -C src IPASIR=../../riss.git/riss \ LIBS="../../riss.git/build/lib/libriss-coprocessor.a -lpthread"

3) Run CBMC - note that RISS is highly configurable via the RISSCONFIG environment variable: export RISSCONFIG=VERBOSE:BMC1 ... run CBMC ...

Documentation

Apart from the (user-orientated) CBMC user manual and this document, most of the rest of the documentation is inline in the code as doxygen and some comments. A man page for CBMC, goto-cc and goto-instrument is contained in the doc/ directory and gives some options for these tools. All of these could be improved and patches are very welcome. In some cases the algorithms used are described in the relevant papers.

The doxygen documentation can be accessed online. To build it locally, run scripts/run_doxygen.sh. HTML output will be created in doc/html/. The index page is doc/html/index.html. This script will filter out expected warning messages from doxygen, so that new problems are more obvious. It is important to use the correct version of doxygen, as specified in run_doxygen.sh, so that there are no unexpected changes to the list of expected warnings. In the event that any change fixes an old warning, then the corresponding line(s) should be deleted from scripts/expected_doxygen_warnings.txt. We want to avoid adding any more warnings to this list of expected warnings, but that can be done to work around limitations in Doxygen (where the code and documentation are completely correct).

Formatting

The CODING_STANDARD.md file in the root of the CBMC repository contains guidance on how to write code for the CBMC repository. This includes which language features can be used and formatting rules.

C++ code can be automatically reformatted in the correct way by running clang-format. There are more details in CODING_STANDARD.md.

Linting

There is also a linting script, scripts/cpplint.py. There is a wrapper script to run cpplint.py only on lines that differ from another branch, e.g. to run it on lines that have been changed from develop:

scripts/run_lint.sh develop

There are also instructions for adding this as a git pre-commit hook in CODING_STANDARD.md.

Time profiling

To do time profiling with a tool like gprof, the flags -g (build with debug symbols) and -pg (enable profiling information) must be used when compiling, and -pg must be used when linking. If you are building with cmake you can just add "-Denable_profiling=1" to your cmake invocation, and reload cmake before building your desired binary. Note that these flags require everything to be rebuilt, so it will take a long time even if you are using ccache.

Run your binary as normal. A file called gmon.out will be created in the working directory of the binary at the end of execution. In most instances this will be the same as the working directory at the beginning of execution. It is also possible to choose the output location by setting the environment variable GMON_OUT_PREFIX - the output file location is then whatever you set it to with the process id appended to the end.

Make sure gprof is installed by running gprof -v. If it is not installed then run sudo apt install binutils.

Run gprof <path-to-binary> <path-to-gmon.out> and redirect the output to a file. This will take a while to run - e.g. 12 minutes for test-gen run on a trivial function.

The output file will now be a large text file. There are two sections: the "flat profile", which ignores context, and just tells you how much time was spent in each function; and the "call graph", which includes context, and tells you how much time was spent within each call stack. For more information see online tutorials, like https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter/gprof_5.html

Folder Walkthrough

Author
Martin Brain, Peter Schrammel

src/

The source code is divided into a number of sub-directories, each containing the code for a different part of the system.

In the top level of src there are only a few files:

  • config.inc: The user-editable configuration parameters for the build process. The main use of this file is setting the paths for the various external SAT solvers that are used. As such, anyone building from source will likely need to edit this.
  • Makefile: The main systems Make file. Parallel builds are supported and encouraged; please don’t break them!
  • common: System specific magic required to get the system to build. This should only need to be edited if porting CBMC to a new platform / build environment.
  • doxygen.cfg: The config file for doxygen.cfg

doc/

Contains the CBMC man page. Doxygen HTML pages are generated into the doc/html directory when running doxygen from src.

regression/

The regression/ directory contains the regression test suites. See Compilation and Development for information on how to run and develop regression tests.

unit/

The unit/ directory contains the unit test suites. See Compilation and Development for information on how to run and develop unit tests.

Directory dependencies

This diagram shows intended directory dependencies. Arrows should be read transitively - dependencies of dependencies are often used directly.

There are module_dependencies.txt files in each directory, which are checked by the linter. Where directories in module_dependencies.txt are marked with comments such as 'dubious' or 'should go away', these dependencies have generally not been included in the diagram.

Goto Program Transformations

Core Transformation Passes

This section lists the transformation passes that must have been applied for the goto model to be in symex ready goto format.

Note that the passes are listed below in the order they are currently called in CBMC. While all dependencies on the ordering are not fully documented, the following order should be used.

Removal/Lowering of Assembly

This pass goes through the goto model and removes or lowers instances of assembly intructions. Assembly instructions are stored in instances of the other instruction.

The implementation of this pass is called via the remove_asm(goto_modelt &, message_handlert &) function. For full documentation of this pass see remove_asm.h

This pass has no predecessor.

Linking of Standard Libraries

This pass links the function definitions from the Cprover standard library models to the current goto model.

The implementation of this pass is called via the link_to_library function.

Predecessor pass is Removal/Lowering of Assembly .

Removal/Lowering of Function Pointers

This pass finds all the instructions which are function calls to the value of a function pointer. Each instruction is then replaced with a switch block. The switch block contains a case for each of the potential targets of the function pointer. The targets are found by looking for all functions in the goto program that approximately match the type of the function pointer.

Note that for this pass to work as intended, all potential targets of calls to function pointers must be in the model. Otherwise they will not be added to the relevant switch blocks. This may cause the assertion that the switch blocks' fallthrough is unreachable to be shown to be violated via an invalid counter example.

The implementation of this pass is called via the remove_function_pointers function. Detailed documentation of this pass belongs in remove_function_pointers.h

Predecessor pass is the Linking of Standard Libraries or the optional String Instrumentation if it is being used.

Memory Mapped IO Instrumentation

This pass finds pointer dereferences and adds corresponding calls to the __CPROVER_mm_io_r and __CPROVER_mm_io_w modelling functions if they exist. See the device behaviour section of modeling-mmio.md for details of modeling memory-mapped I/O regions of device interfaces. This pass is always carried out but will only make changes if one of the modelling functions exist.

The implementation of this pass is called via the mm_io function. Further documentation of this pass can be found in mm_io.h

Predecessor pass is Removal/Lowering of Function Pointers .

Instrument/Remove Preconditions

This pass moves preconditions associated with functions to call sites rather than at the head of the function. Note that functions may have multiple call sites and in this case the instrumentation is copied to these multiple call sites.

The implementation of this is called via instrument_preconditions(goto_modelt &goto_model) . Documentation of this pass belongs in instrument_preconditions.h

Predecessor pass is Memory Mapped IO Instrumentation.

Removal/Lowering of Return Statements

This pass replaces returns of values with writes and reads to global variables.

The implementation of this is called via remove_returns(goto_modelt &) . Detailed documentation of this pass can be found in remove_returns.h

The predecessor passes is the Instrument/Remove Preconditions or the optional Partial Inlining if is being used.

Remove/Lower Vector Typed Expressions

This pass removes/lowers expressions corresponding to CPU instruction sets such as MMX, SSE and AVX. For more details on how this is done see vector_typet and remove_vector.cpp. The implementation of this is called via remove_vector(goto_modelt &goto_model)

Predecessor pass is Removal/Lowering of Return Statements.

Remove/Lower Complex Typed Expressions

This pass is for the handling of complex numbers in the mathematical sense of a number consisting of paired real and imaginary parts. These are removed/lowered in this pass. The implementation of this is called via remove_complex(goto_modelt &) . Documentation for this pass belongs in remove_complex.h

Predecessor pass is Remove/Lower Vector Typed Expressions.

Rewrite Unions

This pass converts union member access into byte extract and byte update operations.

The implementation of this pass is called via rewrite_union(goto_modelt &)

Predecessor pass is Remove/Lower Complex Typed Expressions.

goto_check_c

This is a pass that can add many checks and instrumentation largely related to the definition of the C language. Note that this transformation pass is mandatory in the current implementation, however it may have no effect depending on the command line options specified.

The implementation of this pass is called via goto_check_c(const optionst &, goto_modelt &, message_handlert &)

Predecessor pass is Rewrite Unions.

Adjust Float Expressions

This is a transform from general mathematical operations over floating point types into floating point specific operator variations which include a rounding mode.

The implementation of this pass is called via adjust_float_expressions(goto_modelt &) . Documentation of this pass can be found in adjust_float_expressions.h

Predecessor pass is goto_check_c or the optional Transform Assertions Assumptions if it is being used.

Goto Functions Update

This transformation updates in memory data goto program fields which may have been invalidated by previous passes. Note that the following are performed by this pass:

  • Incoming edges
  • Target numbers
  • location numbers
  • loop numbers

The implementation of this pass is called via goto_functionst::update()

Predecessor pass is Adjust Float Expressions or the optional String Abstraction if it is being used.

Add Failed Symbols

This pass adds failed symbols to the symbol table. See src/pointer-analysis/add_failed_symbols.h for details. The implementation of this pass is called via add_failed_symbols(symbol_table_baset &) . The purpose of failed symbols is explained in the documentation of the function goto_symext::dereference(exprt &, goto_symex_statet &, bool)

Predecessor pass is Goto Functions Update or the optional Add Non-Deterministic Initialisation of Global Scoped Variables if it is being used.

Remove Skip Instructions

This transformation removes skip instructions. Note that this transformation is called in many places and may be called as part of other transformations. This instance here is part of the mandatory transformation to reach symex ready goto format. If there is a use case where it is desirable to preserve a "no operation" instruction, a LOCATION type instruction may be used in place of a SKIP instruction. The LOCATION instruction has the same semantics as the SKIP instruction, but is not removed by the remove skip instructions pass.

The implementation of this pass is called via remove_skip(goto_modelt &)

Predecessor pass is Add Failed Symbols or the optional Remove Unused Functions if it is being used.

Label Properties

This transformation adds newly generated unique property identifiers to assert instructions. The property identifiers are stored in the location data structure associated with each instruction.

The implementation of this pass is called via label_properties(goto_modelt &)

Predecessor pass is Remove Skip Instructions or the optional Add Coverage Goals if it is being used.

Optional Transformation Passes

The sections lists the optional transformation passes that are optional and will modify a goto model. Note that these are documented here for consistency, but not required for symex ready goto format.

Note for each optional pass there is a listed predeceesor pass. This is the pass currently called before the listed pass in CBMC. While the ordering may not be fully documented, this should be followed.

String Instrumentation

This (optional) pass adds checks on calls to some of the C standard library string functions. It uses the tracking information added by the String Abstraction pass. It is switched on by the --string-abstraction command line option.

The implementation of this pass is called via the string_instrumentation(goto_modelt &goto_model) function. Detailed documentation of this pass belongs in string_instrumentation.h

The predecessor pass is Linking of Standard Libraries .

Partial Inlining

This pass does partial inlining when this option is switched on. Partal inlining is inlining of functions either: marked as inline, or smaller than a specified limit. For further detail see the implementation function goto_partial_inline(goto_modelt &, message_handlert &, unsigned, bool)

Predecessor pass is Instrument/Remove Preconditions.

Transform Assertions Assumptions

This pass removes user provided assertions and assumptions when user has opted to do so. Note that this pass removes skip instructions if any other changes are made. The implementation of this pass is called via the transform_assertions_assumptions(const optionst &, goto_modelt &) function.

Predecessor pass is goto_check_c.

String Abstraction

This (optional) transformation pass adds tracking of length of C strings. It is switched on by the --string-abstraction command line option. This changes made by this pass are documented as part of the documentation for the string_abstractiont class. The implementation of this pass is called via the string_abstraction(goto_modelt &, message_handlert &) function.

Predecessor pass is Adjust Float Expressions.

Add Non-Deterministic Initialisation of Global Scoped Variables

This transformation adds non-deterministic initialisation of global scoped variables including static variables. For details see src/goto-instrument/nondet_static.h. The initialisation code is added to the CPROVER_initialize function in the goto model. The implementation of this pass is called via the nondet_static(goto_modelt &) function.

Predecessor pass is Goto Functions Update.

Remove Unused Functions

This pass removes unused functions from the goto model. In practice this builds a collection of all the functions that are potentially called, and then removes any function not in this collection.

This pass cannot handle function calls via function pointers. Attempting to run this pass against a goto model which contains such a function call will result in an invariant violation. Therefore the function pointer removal pass must always be applied to the goto model before the remove unused functions pass.

The implementation of this pass is called via the remove_unused_functions(goto_modelt &, message_handlert &) function.

Predecessor pass is Add Failed Symbols.

Add Coverage Goals

This transformation adds coverage goals instrumentation and is only applied if the --cover option has been specified. The implementation of this pass is called via the instrument_cover_goals(const cover_configt &, goto_modelt &, message_handlert &) function.

Predecessor pass is Remove Skip Instructions .

Slicing

This includes several slicing passes as specified by various slicing command line options. The reachability slicer is enabled by the --reachability-slice command line option. The implementation of this pass is called via the reachability_slicer(goto_modelt &, message_handlert &) function. The full slicer is enabled by the --full-slice command line option. The implementation of this pass is called via the full_slicer(goto_modelt &, message_handlert &) function.

Predecessor pass is Label Properties .

Tutorials

CBMC Developer Tutorial

Author
Kareem Khazem

This is an introduction to hacking on the cprover codebase. It is not intended as a user guide to CBMC or related tools. It is structured as a series of programming exercises that aim to acclimatise the reader to the basic data structures and workflow needed for contributing to CBMC.

Initial setup

Clone the CBMC repository and build it. The build instructions are written in a file called COMPILING.md in the top level of the repository. I recommend that you build using CMake—this will place all of the CBMC tools in a single directory, which you can add to your $PATH. For example, if you built the codebase with the following two commands at the top level of the repository:

cmake -DCMAKE_BUILD_TYPE=Debug                  \
      -DCMAKE_CXX_COMPILER=/usr/bin/clang++     \
      -DCMAKE_C_COMPILER=/usr/bin/clang         \
      -B build -S .
cmake --build build

then all the CBMC binaries will be built into build/bin, and you can run the following commands to make CBMC invokable from your terminal.

# Assuming you cloned CBMC into ~/code
export PATH=~/code/cbmc/build/bin:$PATH
# Add to your shell's startup configuration file so that you don't have to run that command every time.
echo 'export PATH=~/code/cbmc/build/bin:$PATH' >> .bashrc

If you are using Make instead of CMake, the built binaries will be scattered throughout the source tree. This tutorial uses the binaries src/cbmc/cbmc, src/goto-instrument/goto-instrument, and src/goto-cc/goto-gcc, so you will need to add each of those directories to your $PATH, or symlink to those binaries from a directory that is already in your $PATH.

Ensure that graphviz is installed on your system (in particular, you should be able to run a program called dot). Install Doxygen and generate doxygen documentation:

# In the src directory
doxygen doxyfile
# View the documentation in a web browser
firefox ~/code/cbmc/doc/html/index.html

If you've never used doxygen documentation before, get familiar with the layout. Open the generated HTML page in a web browser; search for the class goto_programt in the search bar, and jump to the documentation for that class; and read through the copious documentation.

Whirlwind tour of the tools

CBMC's code is located under the src directory. Even if you plan to contribute only to CBMC, it is important to be familiar with several other of cprover's auxiliary tools.

Compiling with goto-cc

If you built using CMake on Unix, you should be able to run the goto-gcc command. Find or write a moderately-interesting C program; we'll call it main.c. Run the following commands:

goto-gcc -o main.gb main.c
cc -o main.exe main.c

Invoke ./main.gb and ./main.exe and observe that they run identically. The version that was compiled with goto-gcc is larger, though:

du -hs *.{goto,exe}

Programs compiled with goto-gcc are mostly identical to their clang- or gcc-compiled counterparts, but contain additional object code in cprover's intermediate representation. The intermediate representation is (informally) called a goto-program.

Viewing goto-programs

goto-instrument is a Swiss army knife for viewing goto-programs and performing single program analyses on them. Run the following command:

goto-instrument --show-goto-functions main.gb

Many of the instructions in the goto-program intermediate representation are similar to their C counterparts. if and goto statements replace structured programming constructs.

Find or write a small C program (2 or 3 functions, each containing a few varied statements). Compile it using goto-gcc as above into an object file called main. You can write the diagram to a file and then view it:

goto-instrument --dot main.gb | tail -n +2 | dot -Tpng > main.png
open main.png

(the invocation of tail is used to filter out the first line of goto-instrument output. If goto-instrument writes more or less debug output by the time you read this, read the output of goto-instrument --dot main and change the invocation of tail accordingly.)

There are a few other views of goto-programs. Run goto-instrument -h and try the various switches under the "Diagnosis" section.

Learning about goto-programs

In this section, you will learn about the basic goto-program data structures. Reading from and manipulating these data structures form the core of writing an analysis for CBMC.

First steps with goto-instrument

Task: Write a simple C program with a few functions, each containing a few statements. Compile the program with goto-gcc into a binary called main.

The entry point of goto-instrument is in goto_instrument_main.cpp. Follow the control flow into goto_instrument_parse_optionst::doit(), located in goto_instrument_parse_options.cpp. At some point in that function, there will be a long sequence of if statements.

Task: Add a --greet switch to goto-instrument, taking an optional argument, with the following behaviour:
$ goto-instrument --greet main.gb
hello, world!
$ goto-instrument --greet Leperina main
hello, Leperina!

You will also need to add the greet option to the goto_instrument_parse_options.h file in order for this to work. Notice that in the .h file, options that take an argument are followed by a colon (like (property):), while simple switches have no colon. Make sure that you return 0; after printing the message.

The idea behind goto-instrument is that it parses a goto-program and then performs one single analysis on that goto-program, and then returns. Each of the switches in doit function of goto_instrument_parse_options does something different with the goto-program that was supplied on the command line.

Goto-program basics

At this point in goto-instrument_parse_options (where the if statements are), the goto-program will have been loaded into the object goto_functions, of type goto_functionst. This has a field called function_map, a map from function names to functions.

Task: Add a --print-function-names switch to goto-instrument that prints out the name of every function in the goto-binary. Are there any functions that you didn't expect to see?

The following is quite difficult to follow from doxygen, but: the value type of function_map is goto_function_templatet<goto_programt>.

Task: Read the documentation for goto_function_templatet<bodyT> and goto_programt.

Each goto_programt object contains a list of goto_programt::instructiont called instructions. Each instruction has a field called code, which has type codet.

Task: Add a --pretty-program switch to goto-instrument. This switch should use the codet::pretty() function to pretty-print every codet in the entire program. The strings that pretty() generates for a codet look like this:
index
  * type: unsignedbv
      * width: 8
      * #c_type: char
  0: symbol
      * type: array
          * size: nil
              * type:
          * #source_location:
            * file: src/main.c
            * line: 18
            * function:
            * working_directory: /some/dir
          0: unsignedbv
              * width: 8
              * #c_type: char
...

The sub-nodes of a particular node in the pretty representation are numbered, starting from 0. They can be accessed through the op0(), op1() and op2() methods in the exprt class.

Every node in the pretty representation has an identifier, accessed through the id() function. The file util/irep_ids.def lists the possible values of these identifiers; have a quick scan through that file. In the pretty representation above, the following facts are true of that particular node:

  • node.id() == ID_index
  • node.type().id() == ID_unsignedbv
  • node.op0().id() == ID_symbol
  • node.op0().type().id() == ID_array

The fact that the op0() child has a symbol ID means that you could cast it to a symbol_exprt (which is a subtype of exprt) using the function to_symbol_expr.

Task: Add flags to goto-instrument to print out the following information:
  • the name of every function that is called in the program;
  • the value of every constant in the program;
  • the value of every symbol in the program.

Memory Bounds Checking

cbmc provides means to verify that pointers are within the bounds of (statically or dynamically) allocated memory blocks. When the option --pointer-check is used, cbmc checks the safety of each pointer dereference in the program. Similarly, the primitive __CPROVER_r_ok(pointer, size) returns true when it is safe to read from the memory segment starting at pointer and extending for size bytes. __CPROVER_w_ok(pointer, size) indicates if writing to the given segment is safe. At present, both primitives behave the same.Each memory segment is referred to internally in cbmc via an object id. Pointers are represented as bitvectors (typically of length 32 or 64). The topmost n bits encode the id of the memory segment the pointer is pointing to, and the remaining bits encode the offset within the segment. The number of bits used to represent the object id can be specified via --object-bits n. Object offsets are signed. This allows to distinguish the case of when a pointer has been incremented too much from the case of when a pointer has been decremented too much. In the latter case, a negative value would be visible for the offset portion of a pointer in an error trace.In the following, we describe how the memory bounds checks are implemented in cbmc for dynamically allocated memory. Dynamic memory is allocated in C via the malloc() library function. Its model in cbmc looks as follows (see src/ansi-c/library/stdlib.c, details not related to memory bounds checking are left off):
inline void *malloc(__CPROVER_size_t malloc_size)
{
void *malloc_res;
malloc_res = __CPROVER_allocate(malloc_size, 0);
return malloc_res;
}
void * __CPROVER_allocate(__CPROVER_size_t size, __CPROVER_bool zero)
When the option --pointer-check is used, cbmc generates the following verification condition for each pointer dereference expression (e.g., *pointer):
__CPROVER_POINTER_OFFSET(pointer) >= 0 &&
__CPROVER_size_t __CPROVER_OBJECT_SIZE(const void *)
__CPROVER_size_t __CPROVER_POINTER_OFFSET(const void *)
The primitives __CPROVER_POINTER_OFFSET() and __CPROVER_OBJECT_SIZE() extract the pointer offset and size of the object pointed to, respectively. Similar conditions are generated for assert(__CPROVER_r_ok(pointer, size)) and assert(__CPROVER_w_ok(pointer, size)).To illustrate how the bounds checking works, consider the following example program which is unsafe:
int main()
{
char *p1 = malloc(1);
p1++;
char *p2 = malloc(2);
*p1 = 1;
}
Here the verification condition generated for the pointer dereference should fail: for p1 in *p1, __CPROVER_POINTER_OFFSET will evaluate to 1 (owing to the increment p1++, and __CPROVER_OBJECT_SIZE will also evaluate to 1. Consequently, the less-than comparison in the verification condition evaluates to false.

Other Tools

Author
Martin Brain, Peter Schrammel

Other Tools

FIXME: The text in this section is a bit outdated.

The CPROVER subversion archive contains a number of separate programs. Others are developed separately as patches or separate branches.Interfaces are have been and are continuing to stablise but older code may require work to compile and function correctly.

In the main archive:

  • CBMC: A bounded model checking tool for C and C++. See cbmc.
  • goto-cc: A drop-in, flag compatible replacement for GCC and other compilers that produces goto-programs rather than executable binaries. See goto-cc.
  • goto-instrument: A collection of functions for instrumenting and modifying goto-programs. See goto-instrument.

Model checkers and similar tools:

  • SatABS: A CEGAR model checker using predicate abstraction. Is roughly 10,000 lines of code (on top of the CPROVER code base) and is developed in its own subversion archive. It uses an external model checker to find potentially feasible paths. Key limitations are related to code with pointers and there is scope for significant improvement.
  • Scratch: Alistair Donaldson’s k-induction based tool. The front-end is in the old project CVS and some of the functionality is in goto-instrument.
  • Wolverine: An implementation of Ken McMillan’s IMPACT algorithm for sequential programs. In the old project CVS.
  • C-Impact: An implementation of Ken McMillan’s IMPACT algorithm for parallel programs. In the old project CVS.
  • LoopFrog: A loop summarisation tool.
  • TAN: Christoph’s termination analyser.

Test case generation:

  • cover: A basic test-input generation tool. In the old project CVS.
  • FShell: A test-input generation tool that allows the user to specify the desired coverage using a custom language (which includes regular expressions over paths). It uses incremental SAT and is thus faster than the naïve “add assertions one at a time and use the counter-examples” approach. Is developed in its own subversion.

Alternative front-ends and input translators:

  • Scoot: A System-C to C translator. Probably in the old project CVS.
  • ???: A Simulink to C translator. In the old project CVS.
  • ???: A Verilog front-end. In the old project CVS.
  • ???: A converter from Codewarrior project files to Makefiles. In the old project CVS.

Other tools:

  • ai: Leo’s hybrid abstract interpretation / CEGAR tool.
  • DeltaCheck?: Ajitha’s slicing tool, aimed at locating changes and differential verification. In the old project CVS.

There are tools based on the CPROVER framework from other research groups which are not listed here.