Data

The ml/data/ directory contains the two datasets used for training and evaluating the Isolation Forest model. Both files are Suricata eve.json logs captured directly from the virtual laboratory.

ml/data/
├── attacks.json   # Network traffic captured during attack session
└── benign.json    # Network traffic captured during normal session

Git LFS

Both .json files are tracked via Git Large File Storage (LFS) due to their size. After cloning, run git lfs pull to download them if they are not present.

benign.json

Property	Value
Purpose	Train the model
Events	~240 records
Source	Normal lab traffic (no attacks running)
Event types	`dns`, `flow`, `http`, `fileinfo`, `anomaly`, `smtp`

This file is the sole input used for training. Because Isolation Forest is an unsupervised anomaly detection algorithm, it only needs to see examples of normal behaviour to learn. Any traffic that significantly deviates from the baseline is later flagged as suspicious.

Why only benign data for training?

The attacks.json file contains around 300.000 events, the vast majority of which are SYN flood packets. Training on that data would teach the model that DoS floods are normal. By training exclusively on the 240 benign events, the model learns what legitimate traffic looks like; everything else becomes suspect.

attacks.json

Property	Value
Purpose	Evaluate the trained model
Events	~300.000 records
Source	Lab traffic during simulated attacks
Attacks	TCP SYN flood (DoS), port scan (nmap), SSH brute force (hydra), Slowloris DoS, SMTP recon and relay abuse

The dominant event type is flow, representing the individual TCP connections generated by the SYN flood. The smaller portion: DNS, HTTP, SMTP, SSH, TLS, fileinfo events; corresponds to the other attack types.

Attack Event Breakdown

Event type	Count	Origin
`flow`	300.000	SYN flood DoS (hping3)
`dns`	100	Port scan / general DNS activity
`http`	50	Port scan / HTTP requests
`fileinfo`	50	File transfer events
`anomaly`	25	Protocol anomalies (Suricata)
`ssh`	25	SSH brute force (hydra)
`smtp`	15	SMTP reconnaissance
`tls`	5	TLS events

Data Origin

Both datasets were collected from the logwatch node inside the Containerlab topology. Suricata inspects all mirrored traffic and writes structured JSON events to:

/var/log/suricata/eve.json

The files were extracted manually from the container after running the respective traffic generation scripts:

[-] Dataset generation | Attacks only - Uses some of the attack scripts found in scripts/attacks/.
[+] Dataset generation | Benign traffic only - Normal lab activity (benign browsing, DNS queries, mail).

Event Format

Suricata writes one JSON object per line (JSON Lines format). Each event always contains a set of common fields, and then additional nested fields depending on the event type.

Example flow event (before processing):

{
   "timestamp":"2026-04-10T14:03:15.502440+0000",
   "flow_id":1099387108089453,
   "in_iface":"eth1",
   "event_type":"flow",
   "src_ip":"10.0.0.2",
   "src_port":38018,
   "dest_ip":"192.168.10.10",
   "dest_port":22,
   "proto":"TCP",
   "flow":{
      "pkts_toserver":2,
      "pkts_toclient":0,
      "bytes_toserver":128,
      "bytes_toclient":0,
      "start":"2026-04-10T13:58:36.548461+0000",
      "end":"2026-04-10T13:58:36.549139+0000",
      "age":0,
      "state":"new",
      "reason":"timeout",
      "alerted":false
   },
   "community_id":"1:T/hpCH2HdAaBxpHOZeIqbBSQwqo=",
   "tcp":{
      "tcp_flags":"06",
      "tcp_flags_ts":"06",
      "tcp_flags_tc":"00",
      "syn":true,
      "rst":true,
      "state":"syn_sent"
   }
}

After flattening (see Notebook – Step 1), the nested keys become flat column names: flow_pkts_toserver, flow_age, tcp_syn, etc. Fields that do not apply to a given event type become NaN and are later filled with 0.