HITB Lab: Practical Machine Learning in InfoSecurity

PRESENTATION SLIDES (PDF)

This lab session is designed to give attendees a quick introduction to ML concepts and gets up and running with the popular machine learning library, sci-kit learn.

We first start by building a basic understanding of how to integrate ML into an email spam identification system. We look at the inner workings and discuss the components involved in the system. Using the training data, we train our system to identify genuine messages and the system automatically learns from these examples. Different classifiers are tuned to get the maximum efficiency we can crunch out from this setup.

Once we have an efficient system, we do a deep dive and look at how one can trick the system to fail, again by using ML techniques.

Machine Learning (ML) is the future. Systems we use today use ML extensively, whether it is powering an e-commerce website or fraud detection in banking. However, it takes the average developer and security professional some level of skill and experience to apply machine learning and get useful results. It is a skill that anyone can learn, but we feel that material in this space is greatly lacking.

We give students a gentle introduction to the topic with the classic boolean classification problem and introduce classifiers, which are at the core of many of the most common ML systems. We deal with some easy to implement classifiers in sci-kit learn (linear classifiers, decision trees etc.), and show visualizations on how it works.

We then dive into training our classifiers with a labelled dataset. Trying different classifiers to approach the problem and verify the accuracy by cross verifying with the test data helps us choose an ideal algorithm for the problem in hand. This lab servers as a quick and practical introduction to the world of machine learning.

In addition, we guide the student through a simple example of deploying security machine learning systems in production pipelines in a distributed and scalable fashion using Apache Spark. Lastly, we will touch on ways that such systems can be poisoned, misguided, and utterly broken if the architects and implementers are not careful.

Overview of the Topics covered for Workshop :

  • Introduction to machine learning
  • Hands-on guided exploration of Python machine learning libraries:
  • Data-wrangling using Numpy and Pandas
  • Scikit-learn’s functions and capabilities
  • Data visualization using Matplotlib/Seaborn
  • Walkthrough of the most commonly used machine learning algorithms (with quick hands-on examples/visualizations for select algorithms)
  • Supervised learning algorithms
  • Linear/logistic regression
  • Support Vector Machines
  • Unsupervised learning algorithms
  • Hierarchical/k-Means clustering
  • Decision trees/Random forests
  • Semi-supervised learning
  • Lecture on application of machine learning in the security/abuse space
  • Spam, fraud, malware, phishing, and intrusion detection short examples
  • Principles behind selecting the best machine learning models for different use-cases
  • Considerations when using machine learning in an adversarial/malicious environment
  • Streaming pipelines for machine learning using Apache Spark MLlib (PySpark)
  • Apache Spark
    • General architecture
    • Distributed, scalable machine learning deployments with Spark
    • Guided example of a streaming architecture for network anomaly detection using reinforcement learning on Spark
    • Evaluating the security of machine learning systems
    • Techniques and guided example of fuzzing a classifier and regressor to find blind spots in the model
    • Evaluation of intelligent learning system architecture that is resilient to model poisoning by an adversar

Below Crucial Components would be explored in detail for developing the filter :

  1. CountVectorizer – Transform text data , tuning the parameters
  2. SVC() /NB – different algorithms that once could use
  3. Why are pipelines in sci-kit learn useful?
  4. DataFrame in Pandas / Numpy Arrays
  5. K-Fold – for easy dataset splitting
  6. confusion_matrix – for cross validation / accuracy testing

Prerequisites Knowledge:

  • Basic familiarity with Linux
  • Python scripting knowledge is a plus, but not essential

Technical Requirements

  • Latest version of VirtualBox Installed
  • Administrative access on your laptop with external USB allowed
  • At least 20 GB free hard disk space
  • At least 4 GB RAM (the more the better)

CONFERENCE
Location: Track 3 / HITB Labs Date: April 13, 2017 Time: 10:45 am - 12:45 pm Clarence Chio Anto Joseph