Data
The ml/data/ directory contains the two datasets used for training and evaluating the Isolation Forest model. Both files are Suricata eve.json logs captured directly from the virtual laboratory.
ml/data/
├── attacks.json # Network traffic captured during attack session
└── benign.json # Network traffic captured during normal session
Git LFS
Both .json files are tracked via Git Large File Storage (LFS) due to their size. After cloning, run git lfs pull to download them if they are not present.
benign.json
| Property | Value |
|---|---|
| Purpose | Train the model |
| Events | ~240 records |
| Source | Normal lab traffic (no attacks running) |
| Event types | dns, flow, http, fileinfo, anomaly, smtp |
This file is the sole input used for training. Because Isolation Forest is an unsupervised anomaly detection algorithm, it only needs to see examples of normal behaviour to learn. Any traffic that significantly deviates from the baseline is later flagged as suspicious.
Why only benign data for training?
The attacks.json file contains around 300.000 events, the vast majority of which are SYN flood packets. Training on that data would teach the model that DoS floods are normal. By training exclusively on the 240 benign events, the model learns what legitimate traffic looks like; everything else becomes suspect.
attacks.json
| Property | Value |
|---|---|
| Purpose | Evaluate the trained model |
| Events | ~300.000 records |
| Source | Lab traffic during simulated attacks |
| Attacks | TCP SYN flood (DoS), port scan (nmap), SSH brute force (hydra), Slowloris DoS, SMTP recon and relay abuse |
The dominant event type is flow, representing the individual TCP connections generated by the SYN flood. The smaller portion: DNS, HTTP, SMTP, SSH, TLS, fileinfo events; corresponds to the other attack types.
Attack Event Breakdown
| Event type | Count | Origin |
|---|---|---|
flow |
300.000 | SYN flood DoS (hping3) |
dns |
100 | Port scan / general DNS activity |
http |
50 | Port scan / HTTP requests |
fileinfo |
50 | File transfer events |
anomaly |
25 | Protocol anomalies (Suricata) |
ssh |
25 | SSH brute force (hydra) |
smtp |
15 | SMTP reconnaissance |
tls |
5 | TLS events |
Data Origin
Both datasets were collected from the logwatch node inside the Containerlab topology. Suricata inspects all mirrored traffic and writes structured JSON events to:
The files were extracted manually from the container after running the respective traffic generation scripts:
[-] Dataset generation | Attacks only- Uses some of the attack scripts found inscripts/attacks/.[+] Dataset generation | Benign traffic only- Normal lab activity (benign browsing, DNS queries, mail).
Event Format
Suricata writes one JSON object per line (JSON Lines format). Each event always contains a set of common fields, and then additional nested fields depending on the event type.
Example flow event (before processing):
{
"timestamp":"2026-04-10T14:03:15.502440+0000",
"flow_id":1099387108089453,
"in_iface":"eth1",
"event_type":"flow",
"src_ip":"10.0.0.2",
"src_port":38018,
"dest_ip":"192.168.10.10",
"dest_port":22,
"proto":"TCP",
"flow":{
"pkts_toserver":2,
"pkts_toclient":0,
"bytes_toserver":128,
"bytes_toclient":0,
"start":"2026-04-10T13:58:36.548461+0000",
"end":"2026-04-10T13:58:36.549139+0000",
"age":0,
"state":"new",
"reason":"timeout",
"alerted":false
},
"community_id":"1:T/hpCH2HdAaBxpHOZeIqbBSQwqo=",
"tcp":{
"tcp_flags":"06",
"tcp_flags_ts":"06",
"tcp_flags_tc":"00",
"syn":true,
"rst":true,
"state":"syn_sent"
}
}
After flattening (see Notebook – Step 1), the nested keys become flat column names: flow_pkts_toserver, flow_age, tcp_syn, etc. Fields that do not apply to a given event type become NaN and are later filled with 0.