Skip to content

Data

The ml/data/ directory contains the two datasets used for training and evaluating the Isolation Forest model. Both files are Suricata eve.json logs captured directly from the virtual laboratory.

ml/data/
├── attacks.json   # Network traffic captured during attack session
└── benign.json    # Network traffic captured during normal session

Git LFS

Both .json files are tracked via Git Large File Storage (LFS) due to their size. After cloning, run git lfs pull to download them if they are not present.


benign.json

Property Value
Purpose Train the model
Events ~240 records
Source Normal lab traffic (no attacks running)
Event types dns, flow, http, fileinfo, anomaly, smtp

This file is the sole input used for training. Because Isolation Forest is an unsupervised anomaly detection algorithm, it only needs to see examples of normal behaviour to learn. Any traffic that significantly deviates from the baseline is later flagged as suspicious.

Why only benign data for training?

The attacks.json file contains around 300.000 events, the vast majority of which are SYN flood packets. Training on that data would teach the model that DoS floods are normal. By training exclusively on the 240 benign events, the model learns what legitimate traffic looks like; everything else becomes suspect.


attacks.json

Property Value
Purpose Evaluate the trained model
Events ~300.000 records
Source Lab traffic during simulated attacks
Attacks TCP SYN flood (DoS), port scan (nmap), SSH brute force (hydra), Slowloris DoS, SMTP recon and relay abuse

The dominant event type is flow, representing the individual TCP connections generated by the SYN flood. The smaller portion: DNS, HTTP, SMTP, SSH, TLS, fileinfo events; corresponds to the other attack types.

Attack Event Breakdown

Event type Count Origin
flow 300.000 SYN flood DoS (hping3)
dns 100 Port scan / general DNS activity
http 50 Port scan / HTTP requests
fileinfo 50 File transfer events
anomaly 25 Protocol anomalies (Suricata)
ssh 25 SSH brute force (hydra)
smtp 15 SMTP reconnaissance
tls 5 TLS events

Data Origin

Both datasets were collected from the logwatch node inside the Containerlab topology. Suricata inspects all mirrored traffic and writes structured JSON events to:

/var/log/suricata/eve.json

The files were extracted manually from the container after running the respective traffic generation scripts:

  • [-] Dataset generation | Attacks only - Uses some of the attack scripts found in scripts/attacks/.
  • [+] Dataset generation | Benign traffic only - Normal lab activity (benign browsing, DNS queries, mail).

Event Format

Suricata writes one JSON object per line (JSON Lines format). Each event always contains a set of common fields, and then additional nested fields depending on the event type.

Example flow event (before processing):

{
   "timestamp":"2026-04-10T14:03:15.502440+0000",
   "flow_id":1099387108089453,
   "in_iface":"eth1",
   "event_type":"flow",
   "src_ip":"10.0.0.2",
   "src_port":38018,
   "dest_ip":"192.168.10.10",
   "dest_port":22,
   "proto":"TCP",
   "flow":{
      "pkts_toserver":2,
      "pkts_toclient":0,
      "bytes_toserver":128,
      "bytes_toclient":0,
      "start":"2026-04-10T13:58:36.548461+0000",
      "end":"2026-04-10T13:58:36.549139+0000",
      "age":0,
      "state":"new",
      "reason":"timeout",
      "alerted":false
   },
   "community_id":"1:T/hpCH2HdAaBxpHOZeIqbBSQwqo=",
   "tcp":{
      "tcp_flags":"06",
      "tcp_flags_ts":"06",
      "tcp_flags_tc":"00",
      "syn":true,
      "rst":true,
      "state":"syn_sent"
   }
}

After flattening (see Notebook – Step 1), the nested keys become flat column names: flow_pkts_toserver, flow_age, tcp_syn, etc. Fields that do not apply to a given event type become NaN and are later filled with 0.