Learn-ocaml code classification

Introduction

I am currently doing a summer internship at IRIF under the supervision of Yann Régis-Gianas about Learn-OCaml. It is a great platform designed to learn the OCaml language:

Professors are able to write exercises and grade them automatically by reasoning on the student’s code AST.
Students can experiment OCaml in their web-browser thanks to the js-of-ocaml compiler and a nice UI. It means that grading is performed mostly on the student side, and the teacher can grade files himself only once when he is correcting student’s exercises.

One of my goals this month is to develop a tool to help teachers to determine how exercises were solved by students, that is, to identify “similar” codes, and split the whole exercise database into equivalence classes.

The term “exercises” is quite vague, and for the sake of simplicity, we will consider all exercises as just “defining a function”.

First steps

The first step is to divide answers by their automatic grade: 2 codes with different grades are certainly different. Note that we make the hypothesis that teachers have correctly written their tests (for example, they do not give a random mark for each exercise). If it is not the case, professors will have a bigger problem than partitioning code anyway. Hence, we will focus on how to determine equivalence classes between codes with the same grade.

We have a very strong hypothesis:

All codes are valid OCaml fragments: it is guaranteed that they pass the OCaml type-checker.
Every code has (almost) same semantics because they share the same grade.

Because I am focusing on how an exercise was solved, I have decided to study code’s Abstract Syntax Tree (AST). Indeed, it allows us to abstract from spaces and indentation differences between user codes, and will introduce some semantic into the analysis.

The question becomes now: how do we identify two “similar” AST?

Lambda

Speaking of AST, it is important to say the language it represents. I choose not to work directly on the whole OCaml language, but on an intermediate language used in the compilation pipeline called Lambda. Mainly, Lambda is partially optimized (with inlining of some values, compiled pattern-matching…) and removes syntactic sugar (such as the function key-word), which allows us to abstract even more from the syntax and try to focus on semantic.

Note that, to construct a Lambda expression from an OCaml parse-tree, we need to type-check it. This method is not available for all use-cases, but is fine here since all codes are guaranteed to type-check.

Applied AST fingerprinting

A cool paper

After some research on the Internet, I came across a cool paper: Syntax tree fingerprinting: a foundation for source code similarity detection by Michel Chilowicz, Étienne Duris and Gilles Roussel. 2009, hal-00627811.

The technique presented is pretty simple, while very smart: just hash recursively the AST. It is fast and identical ASTs are guaranteed to share the same hash. But we are not very interested in the direct hash of the structure: for example, two students might have implemented the same code, but with different variable names: we still want to mark the two codes as similar.

The whole point is to find a “good” hash function, that is a function that captures only the interesting part of the AST. The paper rightly suggests that this could be done simply by defining a recursive hash function which doesn’t take into account unneeded data (such as variable names), and only the “shape” of the AST (for example, by hashing a constant string for each node).

In the next paragraph, I will explain what I understood of this paper; we enter after in an original work, about the choice of the clustering analysis of the obtained hashes and an analysis of this choice on a real-world example.

A story of sub-ASTs

Suppose that we have such a “good” hash function. It is still not sufficient, because some codes may share substantial sub-AST, but not be totally equal, and thus the two ASTs will have different hashes. For example, we want to recognize that

if b then x else y

is very close to

if not b then y else x

As suggested in the previously mentioned paper, a possible idea is to keep the hash of “big” sub-ASTs (where the weight of an AST is the number of nodes and leaves it has).

Then, the hash function will now be of type hash : int -> (int * string) list: it takes a weight threshold, and for every node of the AST that is over the threshold, it keeps the weight plus the hash.

The obtained list is after sorted. Because we made the hypothesis that all codes shared the same semantic, we want to identify two codes with a hash list that have the same elements but in a different order (for example, where the order of pattern-matching cases was not the same).

To say it more clearly, we are not interested in little sub-AST alone (because for example, a (partial) function application ( + ) 1 does not give enough information), but in how they are composed (for example, map (( + ) 1) xs). Do not mistake yourself, we do not forget about little sub-AST: because the hash is computed recursively, their hash will be used to compute the hash of their parents. We just don’t take them into account when measuring the distance between two ASTs.

Clustering

We are now able to identify exactly similar ASTs (so, same ASTs up to variable names and syntactic sugar) because they will share the same hash list. But how do we identify “less exactly” similar ASTs, that is, to identify “close” ASTs?

In a mathematical world, this starts by defining something like a distance between two ASTs:

A possible semimetric

let sum_of_fst =
  List.fold_left (fun acc (x,_) -> acc + x) 0

(* Note: None is used to represent infinity *)
let semimetric x y =
  let hx = hash x in
  let hy = hash y in
  if is_intersection_empty hx hy
  then None
  else Some (sum_of_fst (symmetric_difference hx hy))

There is two cases:

If the intersection is empty, then the two set are declared infinitely distant.
Else, we sum the weight of all hashes in the symmetric difference (the symmetric difference of two sets A and B is defined as: (A ∪ B) \ (A ∩ B))

Note that this semimetric is well-defined:

distance x y = 0 <=> x = y, because the symmetric difference of two sets is the empty set if and only if the two sets are the same.
distance x y = distance y x, because the symmetric difference is commutative.

The fact that our hash lists are sorted allows an implementation of symmetric_difference in linear time.

Meaning

What does this “distance” mean? If two ASTs don’t share at least a sub-AST, then they are totally distinct (ie. infinitely distant). Else, we see how many of sub-ASTs they share, compared to all their recorded sub-ASTs. If they share big sub-ASTs, then the sum of the weight of the symmetric difference will be small. Else, if they share only a little sub-AST, then the sum of the weight of the symmetric difference will be bigger.

Cluster analysis

Now that we have a “distance”, we want to group ASTs that are close of each-others. This operation is called a cluster analysis of our hash lists.

I chose to use a hierarchical clustering (which produces a dendrogram) because, mainly, we don’t know how many classes we will obtain, and classical algorithms such as k-means require this information.

The idea is simple: start with every hash list as a class with a single element, then at each step, merge the two closest classes, if it is possible.

The question is: “how do we determine the distance between two classes?”. There are several approaches, but I choose to use the maximum of the distances between 2 elements of the sets, that is, for two classes X and Y: max {distance x y | x ∈ X, y ∈ Y}. This is called a complete-linkage clustering. If the maximum is infinite, then I choose not to merge the two classes.

Mainly, this approach renders the idea that: “if there are two elements that don’t share at least one sub-AST, then the classes cannot be merged”.

The clustering phase returns a list of binary trees representing the classes and how they were made. It allows the end-user to see the distance between two ASTs, and thus get an idea of how good is the relation linking two implementations in the same class.

At the end

Let me sum up the process:

First, we split all codes by their grade.
Then, for each obtained class:
- First we transform all codes into lambda AST.
- Then we hash all of these AST to obtain a list of hash of their sub-ASTs.
- Then we make the classes using the presented hierarchical clustering.

There is only one parameter that needs to be taken into account: the weight threshold for sub-ASTs

Some results

I have tested all of this on the exercises from 3rd year student’s first course of OCaml in my University (a course that I have followed myself this year). The weight threshold is relative to the whole AST weight, and after some experiments, setting it to 30% (that is, only keeping sub-ASTs that have a weight superior to 30% of the whole AST weight) gives good results.

Of course, the results are normally nicely rendered into the Learn-OCaml UI. Here is a simplified version, where is displayed the code of one element in each sub-classes (constituted only of codes with the same hash, thus very close one of each other).

Rev

One interesting exercise was to define rev : 'a list -> 'a list which is reversing a list. My tool split 154 implementations (which all had 10 out of 10) into 9 classes, let us study the main ones:

A main class not so satisfying

The biggest equivalence class has 85 students, split in 7 sub-classes. (The number next to the Node constructor is the actual distance between the two sub-classes).

Node 352.

Leaf with 6 student(s):

let rec rev l = match l with
| [] -> []
| [a] -> a::[]
| a::b -> (rev b) @ [a]

Leaf with 1 student(s)

let rec rev l = match l with
[] | [_] -> l
| t::q -> rev q@[t]

Node 299.

Node 254.

Node 246.

Leaf with 25 student(s)

let rec rev l =
match l with
[]->[]
|a::q -> append (rev q) [a]

Leaf with 1 student(s)

let rec rev l =
match l with
| [] -> l
| t::q -> append (rev q) [t]

Node 196.

Leaf with 49 student(s)

let rec rev l = match l with
|[] -> []
|h::t -> (rev t)@[h]

Leaf with 2 student(s)

let rec rev l = match l with
|[]->l
|t::l' -> (rev l')@[t]

Leaf with 1 student(s)

let rec rev l =
match l with
|[]->[]
|[e]-> l
|e::q -> append (rev q) [e]

I think we can say that the tool effectively captured “close” codes. Even if sometimes there is a not needed pattern-matching, or the use of append (a function defined in a prior exercise) instead of @, the idea is the same.

Note that such a result directly inform the teacher that he needs to explain more the time complexity of the concatenation of lists in his course :)

A second class

The second class is constituted of 38 students:

Node 102.

Leaf with 22 student(s)

let rev l =
let rec rev_acc acc = function
| [] -> acc
| hd::tl -> rev_acc (hd::acc) tl
in
rev_acc [] l

Leaf with 14 student(s

let rec rev ls =
let rec aux xs rs =
match xs with
| [] -> rs
| hd :: tl -> aux tl (hd :: rs)
in aux ls []

Leaf with 2 student(s)

let rev l = match l with
    [] -> []
  | h :: q -> let rec aux new_l l = match l with
        [] -> new_l
      | h :: q -> aux (h :: new_l) q in aux [] l

Again here, the tool captured the idea of an auxiliary function.

The others

The third class is constituted with 12 students who used List.rev or List.fold_left, and the 6 last of other solutions (mainly some variations, using local let bindings, List.hd instead of a pattern-matching, et caetera).

Conclusion

Obtained results are surprisingly good and I think that such a tool can help teachers in their work.

There are many things to do:

The ideas presented here are (almost) purely syntactical and it would be great if we were able to, for example, distinguish the use of a function from the standard library and a function made by the student.
Another possibility is to develop an AI, trained by teachers, to refine the partitioning made. For example, a teacher could say: these two AST looks syntactically equivalent, but their semantic is not the same, and they should not be in the same class.
The implementation I made focus on one toplevel definition, thus it does not take into account auxiliary definition made outside of the function.
Other clustering approach can maybe give better results
A better study of the weight threshold could also help to optimize the results.

Thanks

I want to thank Yann Régis-Gianas for his proofreading and his wise advise on this article.