Taint Analysis with Datalog



Objective
Taint analysis consists of determining which parts of the program are dependent on user inputs. The goal of this project is to implement a Datalog based taint analysis for a simple language.

Problem
We are given a source code with no loops, pointers or floating point variables. The code contains assignments, conditionals, op() method (representing a binary operator) and multiple calls to source() and sink() methods. The output of the source() method is controlled by the user and the call to the sink() method represents a sensitive region of the program. We say that there is a flow from a source to a sink on a program path if the value of the argument to the sink() method is derived explicitly or implicitly from the return value of the source() method (as in the examples below) on that path. The flow from a source to a sink on a path can be made secure by sanitizing either the argument of the sink() method or the variables used for deriving its value on that path. This is done by calling the sanitize() method.

The problem is thus to determine the program locations at which sanitization should occur while avoiding undersanitization and oversanitization. Undersanitization occurs when there is a flow on a program path from a source to a sink with unsanitized user inputs. Oversanitization occurs when there is no flow from any source to a sink on a program path and yet there is unnecessary sanitization. Additionally, the number of calls to the sanitize() method should be minimal.

Example
Consider the following examples:
Example1 PredicatesExample2Predicates
x:=source(); source(l1,x) x:=source(); source(l1,x)
y:=op(x,z); follows(l2,l1),assign(l2,y,x,z) if (x > 0){ follows(l2,l1),if(l2,l3,x,0)
if(z+1>0){ follows(l3,l2),if(l3,l4,z,1) z := 0; follows(l3,l2),assign(l3,z,0,0)
y:=5; follows(l4,l3),assign(l4,y,5,0) } follows(l4,l3)
} follows(l5,l4) else{ follows(l5,l1),if(l5,l6,x,0)
else{ follows(l6,l2),,if(l6,l7,z,1) z:=1; follows(l6,l5),assign(l6,z,1,0)
z:=2; follows(l7,l6),assign(l7,z,2,0) } follows(l7,l6)
} follows(l8,l7) sink(z); follows(l8,l7),follows(l8,l4),join(l8,l4,l7),sink(l8,z)
sink(y); follows(l9,l8),follows(l9,l5),join(l9,l5,l8),sink(l9,y)

In Example1, the value of the argument y to the sink() method is derived using the user input x in the else branch. The code can be secured against this explicit dependency by inserting a call to the sanitize() method on y in the else branch.
In Example2, the value of the argument z to the sink() method is implicitly dependent on the user input x which determines its exact value through the if statement on both branches. The code can be guarded against implicit dependency by inserting call to the sanitize() method on x after the call to source() or on z before calling sink () method.

Language
The grammar for the language is formalized below. Note that c is an integer constant and x,y are variables.

expression: c | x | opc(x,c) | opv(x,y)
conditional: if(expression<=0) | if(expression<0) | if(expression==0) | if(expression >=0) | if(expression > 0)
assignment: x:=expression | x:=source()
sink method: sink(x)

We assume that there is always an else statement after an If and the two branches do not share any common statement. We also assume that all calls to the sink() method are at the end of the code.

Predicates
The input code is encoded using the following Datatlog predicates:
  • source(l,x): assign output of source() to variable x at label l
  • sink(l,x): call sink() with x as argument at label l
  • follows(l2,l1): label l2 follows label l1
  • if(l1,l2,x,y): if expression involving x and y at label l1 holds then label l2 follows l1. Here, x and y can be variables or constants. For unary expressions, y is 0.
  • join(l1,l2,l3): join labels l2 and l3 at l1
  • assign(l,x,y,z): assign the expression involving y and z to variable x at label l. Here, y and z can be variables or constants. For unary expressions, z is 0.

The predicate derived for the examples are shown in the second and the fourth columns of the table above.

Input/Output
The input to the analysis is the encoding of the code using the predicates described above. The analysis should produce the list of locations and arguments at which the sanitize() method should be called. An output of (l8,y) or (l7,y) gets perfect grade fot Example 1. For Example 2, (l2,x) or (l8,z) will get full grades.

Framework
The taint analysis should be implemented using the Souffle Datalog frame work.

Template and tests
We provide template containing the definition of input and output predicates as well as 20 test cases for testing the analysis. These can be downloaded from here. Note that there are no unique solutions for many tests. The test files contain one of the correct outputs.

Update
We have fixed a number of typos in the test cases.

Grading
The solutions will be graded based on the following criteria:
  • The analysis should determine the locations at which the inputs should be sanitized so that there is no explicit or implicit dependency of the arguments to the sink() method on the user input.
  • There should be no undersanitization or oversanitization.
  • The number of calls to the sanitize() method should be minimal.

Submission
Please send a zip file containing your code and a README to gsingh@inf.ethz.ch. The README file should describe the input format expected by the solution as well as the instructions to run it.


Deadline
Friday, 23:59:59, 01.06.2018.