With sustained growth of software complexity, finding security vulnerabilities in operating systems has become an important necessity. Very well known vulnerability detection techniques like static analysis, symbolic execution or fuzzing can be very costly to be used in a large amount of test cases.
In this lab session, we present an approach that uses Machine Learning to train and predict if a test case shows patterns associated with a vulnerable behavior. It only requires a lightweight analysis to extract data, is fully automatic and adaptive to be trained using different vulnerability detection techniques. Additionally, it works directly on test cases without source code.
We will explain our predictive approach to vulnerability discovery and we will show how our open-source tool, VDiscover performs such analysis on a large amount of test cases. Our tool is open-source and is available in vdiscover.org as well as an academic paper (in revision) explaining how it works.
This lab is divided in two parts: theory and practice (with a break in the middle, of course!). In the theoretical part, we will explain how and why we can use Machine Learning for vulnerability detection:
* Motivation: finding vulnerabilities and bugs faster.
* Previous work: detection of vulnerabilities using source code using Machine Learning.
* Overview of our technique.
* Basic concepts of Machine Learning (training, recall, evaluation, ..).
* A predictive approach: is this test case interesting or not?
* Dynamic analysis of programs: collecting traces.
* Augmented traces using primitive types.
* Preprocessing and vectorization.
* Our results.
In the second part, we focus on the practical aspects. In particular:
* Installing VDiscover.
* Creating a test case.
* Your first trace.
* Training and testing a bug predictor.
* Trace visualization and semantic clustering.
* Exercises and questions to answer.
* Preliminary results and future research.
Only basic knowledge of fuzzing, crash analysis and common vulnerabilities in binaries (buffer overflows, use-after-free, etc) are required. All the Machine Learning background will be explained during the first part of the presentation.