Machine Learning Overview
The ML module is the intelligence layer of the VNTD project. Its purpose is to detect anomalous network behaviour directly from Suricata-generated logs, without relying on predefined signatures or rules.
The module is based on an Isolation Forest model trained exclusively on benign traffic. Anything that deviates significantly from that baseline is flagged as an anomaly.
Architecture
flowchart LR
SURICATA[Suricata\neve.json]
DATA[Training Data\nbenign.json / attacks.json]
NOTEBOOK[Jupyter Notebook\nVNTD_ML.ipynb]
MODEL[Trained Model\nisolation_forest.pkl\nscaler.pkl]
DETECT[Real-time Detector\ndetect.py]
ALERTS[Anomaly Alerts\nTerminal UI]
DATA --> NOTEBOOK
NOTEBOOK --> MODEL
SURICATA --> DETECT
MODEL --> DETECT
DETECT --> ALERTS
Directory Structure
ml/
├── data/ # (Git LFS)
│ ├── attacks.json # ~300.000 attack events (DoS, port scan, SSH brute force...)
│ └── benign.json # ~240 normal traffic events used for training
├── models/
│ ├── isolation_forest.pkl # Trained Isolation Forest model (Git LFS)
│ ├── scaler.pkl # StandardScaler fitted on benign data (Git LFS)
│ ├── model_threshold.txt # Anomaly score threshold derived at training time
│ ├── confussion_matrix.png
│ ├── feature_distributions.png
│ ├── roc_curve.png
│ ├── score_distribution.png
│ └── time_window_distributions.png
├── notebooks/
│ └── VNTD_ML.ipynb # Jupyter notebook: full training and evaluation pipeline
├── realtime/
│ ├── detect.py # Interactive terminal detector (curses UI)
│ └── pipeline.py # Data processing and model inference logic
└── requirements.txt # Python dependencies
Git LFS
The .pkl model files and .json data files are tracked with Git Large File Storage (LFS) due to their size. Ensure git lfs is installed and pulled before use.
Section Contents
-
Data
Training and evaluation datasets extracted from the Suricata IDS. Explore Data
-
Models
Pre-trained model objects and evaluation charts saved during the last training run. Explore Models
-
Notebook
Step-by-step walkthrough of the full ML pipeline: data loading, feature engineering, training, and evaluation. Explore Notebook
-
Real-Time Detector
How the live detector works: main loop, threading, terminal UI, and the ML inference pipeline. Explore Detector
Setup & Usage
Before using the ML module (both for training and real-time detection), the required Python environment must be configured.
See the ML Environment Setup for full instructions on installing Python, creating the virtual environment, and installing the dependencies listed in requirements.txt.
To launch the real-time detector, use the run.sh main menu or the dedicated ML scripts:
See Scripts - ML for more details.