Write My Paper Button

Uncategorized

Task 4: Safety Shield Training In this task, you will train a multi-class neural network classifier to predict risk levels for state action pairs. The dataset created in [Task 3] contains all necessary features and labels in a ready-to-use format.

November 19, 2025 · 28 min read · By adminPro

📋 Table of Contents

    COMP3411/COMP9814 Artificial Intelligence
    Assignment 2: Safe Interactive Reinforcement Learning
    Term 3, 2025
    Due: Friday, 14 November 2025, 5:00 PM AEST
    Worth: 21 marks + 4 marks tutorial participation (25% of final grade)
    1 Introduction
    This assignment explores the critical challenge of safe reinforcement learning with human safety interventions. As reinforcement learning agents are increasingly deployed in safety-critical applications, from autonomous vehicles to medical treatment recommendation systems, ensuring their safety during learning becomes paramount [5]. Unlike traditional RL where agents learn
    through trial and error (including catastrophic errors), safe RL requires agents to learn optimal behaviour whilst maintaining safety constraints at all times. You will build and evaluate a Safe
    Interactive RL system where a human monitor observes an agent’s behaviour and intervenes when the agent is about to take an unsafe action. The agent must learn both to complete its
    task (maximise rewards) and to predict which actions are unsafe (learn a safety shield) from sparse human intervention signals. Starting with a baseline unsafe agent and progressing to
    sophisticated safety-aware systems, you will gain hands-on experience with constrained reinforcement learning, multi-class risk prediction, and the fundamental trade-offs between task
    performance, safety, and intervention efficiency.
    2 Background
    2.1 The Safe Reinforcement Learning Problem
    Reinforcement learning has achieved remarkable success in domains ranging from game playing
    to robotics. However, standard RL algorithms optimise purely for task reward without explicit
    safety considerations. This “reward maximisation at all costs” approach can lead to catas-
    trophic failures when agents are deployed in real-world safety-critical applications. Consider an
    autonomous vehicle learning to drive. A standard RL agent might learn that speeding through
    intersections maximises efficiency (reaches destination faster), only discovering the danger when
    a collision occurs. In safety-critical domains, such catastrophic exploration is unacceptable, we
    cannot allow the agent to “learn from mistakes” when those mistakes cause serious harm.
    2.2 Human-in-the-Loop Safety
    One promising approach to safe RL is human-in-the-loop learning, where a human monitor
    observes the agent’s behaviour and provides safety guidance [4]. In the intervention-based
    paradigm, the human does not provide continuous supervision but instead intervenes only when
    the agent is about to violate safety constraints. These interventions serve dual purposes: imme-
    diate prevention (stop the unsafe action from occurring) and learning signal (provide training
    data for the agent to learn what “unsafe” means). The key challenge is safety constraint gen-
    eralisation: after observing a few human interventions at specific state-action pairs, the agent
    1
    COMP3411/COMP9814 Assignment 2
    must learn to predict which other state-action pairs are also unsafe. This is particularly difficult due to sparse data (human interventions are hopefully infrequent), class imbalance (most
    actions are safe, few are unsafe), high stakes (false negatives, missing unsafe actions, can be catastrophic), and credit assignment (human reaction delays make it unclear which past action
    triggered the intervention).
    2.3 Safety Shields and Constrained MDPs
    A safety shield is a learned component that predicts whether a proposed action is safe before execution [1]. Formally, a shield is a function Sˆ : S×A → [0, 1] that outputs the probability that state-action pair (s, a) is unsafe. The safe RL problem can be formulated as a Constrained
    Markov Decision Process (CMDP) [2]. In a standard MDP, we maximise E
    [∑
    t γ
    tR(st, at)
    ]
    .
    In a CMDP, we maximise E
    [∑
    t γ
    tR(st, at)
    ]
    subject to C(st, at) ≤ δ, where C is a cost function
    representing safety violations, and δ is the safety budget (ideally zero). In this assignment, you
    will learn the cost function C (via the safety shield) from human interventions, rather than
    specifying it manually.
    3 Assignment Specification
    3.1 Overview
    This assignment builds a complete safe interactive reinforcement learning system through five progressive tasks. First, you will implement a grid-world environment supporting both fixed and random start position modes [Task 1]. Second, you will train a baseline Q-learning agent and
    experimentally compare two step penalty configurations to understand reward shaping [Task 2]. Third, you will create a complete training dataset with all four risk classes through systematic path discovery, feature extraction, and stratified train/validation/test splitting [Task 3]. Fourth, you will train a multi-class neural network shield that predicts risk levels across
    four safety categories using the dataset from [Task 3] [Task 4]. Finally, you will integrate the shield with your RL agent to achieve safe learning [Task 5]. Throughout the assignment, you will conduct hands-on experiments with critical design choices, step penalty magnitude, start
    position strategy, and risk thresholds, analysing how each affects learning dynamics, task performance, and safety guarantees. These experiments will reveal important insights about the
    interplay between exploration, safety, and generalisation in reinforcement learning. Section 4 provides detailed specifications for each task.
    3.2 Evaluation Metrics
    Your system will be evaluated using multiple metrics across three dimensions:
    Safety Metrics: Total safety violations during training must be zero for a successful safe RL system. False negative rate of the shield on test interventions is the critical safety metric,
    indicating how often the shield fails to detect genuinely unsafe actions. Safety violation rate
    when the shield is disabled demonstrates the shield’s effectiveness by showing baseline unsafe
    behaviour.
    Task Performance Metrics: Average episode reward over the last 25% of training episodes
    gauges overall performance. Success rate measures the percentage of episodes reaching the
    goal. Average episode length indicates policy efficiency, with shorter paths being more efficient.
    Training speed in episodes per second demonstrates computational efficiency.
    Intervention Efficiency Metrics: Total interventions over the entire training period provides
    overall intervention count. Interventions in the last 100 episodes should decrease to near zero
    as the agent learns safe behaviour. Average interventions per episode tracked over time reveals
    the learning trajectory of safety constraint acquisition.
    2
    COMP3411/COMP9814 Assignment 2
    These metrics will be computed at appropriate stages throughout the implementation tasks,
    with detailed evaluation protocols provided in each task specification.
    4 Implementation Tasks
    4.1 Task 1: Safe Grid World Environment Setup
    Begin by implementing the safe grid-world environment from scratch. The environment must support standard gym-like methods: reset() returns the initial state, step(action) executes an action and returns the next state, reward, and done flag, and render() visualises the current state. You must use the following exact configuration to ensure consistency across all student
    submissions:
    Environment Configuration:
    • Grid size: 10× 10
    • Start position: (0, 0) (top-left corner)
    • Goal position: (9, 9) (bottom-right cor-
    ner)
    • Hazard cells (15 total): (0, 3), (1, 1),
    (1, 7), (2, 4), (2, 8), (3, 2), (3, 6), (4, 5),
    (5, 3), (5, 8), (6, 1), (6, 6), (7, 4), (8, 2),
    (8, 7)
    • Walls: Grid boundaries only (no internal
    walls)
    • Action space: UP, DOWN, LEFT,
    RIGHT (4 discrete actions)
    • Reward structure: +10 for reaching goal,
    −0.1 per time step (step penalty), −10
    for entering hazard (episode terminates
    immediately)
    Figure 1: Safe Grid World environment config-
    uration showing the 10×10 grid with start posi-
    tion (S, blue, top-left), goal position (G, green,
    bottom-right), and 15 hazard cells (red) strate-
    gically distributed throughout the grid. Safe
    cells are shown in white.
    0 1 2 3 4 5 6 7 8 9
    Column Index
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    Ro
    w
    In
    de
    x
    S
    G
    Safe Grid World Environment
    Start Position (0,0)
    Goal Position (9,9)
    Hazard Cells (n=15)
    Safe Cells (n=84)
    This configuration strategically distributes hazards throughout the grid to test safety shield
    generalisation and safety-aware navigation. The distributed placement prevents the agent from
    learning simple avoidance rules (such as “avoid certain columns”) and instead requires learning
    the true underlying safety constraints whilst maintaining multiple solvable paths to the goal.
    Implement collision detection for walls (agent stays in place if attempting to move out of bounds)
    and hazard detection (episode terminates immediately upon entering a hazard cell). Random
    Start Position (Required Feature). Your environment must support two start position
    modes controlled by a boolean parameter in the environment constructor. In Fixed Start Mode
    (default), every episode begins at position (0, 0), matching the standard configuration above. In
    Random Start Mode, each episode begins at a randomly sampled safe position (any cell that is
    not a hazard and not the goal), ensuring broader state-space exploration during training.
    3
    COMP3411/COMP9814 Assignment 2
    4.2 Task 2: Q-Learning Baseline Agent
    Implement a tabular Q-learning agent as the baseline for safe RL. Q-learning is the optimal choice for this discrete 10×10 grid (100 states, 4 actions) because the state space is small enough to
    store exact Q-values for every state-action pair in a table. This baseline demonstrates why safety
    mechanisms are essential, the agent will frequently violate safety constraints during exploration
    as it learns the optimal policy.
    Q-Learning Algorithm. Maintain a Q-table Q : S × A → R that stores the expected return
    for each state-action pair. The standard Q-learning update rule is:
    Q(s, a)← Q(s, a) + α
    [
    r + γmax
    a′
    Q(s′, a′)−Q(s, a)
    ]
    where α is the learning rate, r is the immediate reward, γ is the discount factor, and maxa′ Q(s
    ′, a′)
    is the maximum Q-value for the next state (representing the best future return).
    Hyperparameter Configuration. Select values from the following ranges (see Table 1):
    Parameter Range Baseline
    Learning rate (α) [0.05, 0.2] 0.1
    Discount factor (γ) [0.95, 0.999] 0.99
    Epsilon start (ϵstart) Fixed 1.0
    Epsilon min (ϵmin) [0.01, 0.05] 0.01
    Epsilon decay [0.99, 0.997] 0.995
    Training episodes (N) [1,500, 3,000] 2,000
    Max steps per episode Fixed 200
    Table 1: Hyperparameters for [Task 2] Q-learning baseline agent
    Evaluation Metrics. Compute and report the following metrics (Table 2), whereM = ⌊0.25×
    N⌋ denotes the last 25% of episodes.
    Metric Formula Description
    Performance Metrics (computed over last M episodes)
    Success Rate
    1
    M
    N∑
    i=N−M+1
    ⊮[rewardi > 0]
    × 100%
    Percentage reaching goal
    Average Reward R¯ = 1M
    ∑N
    i=N−M+1Ri Mean cumulative reward
    Episode Length L¯ = 1M
    ∑N
    i=N−M+1 Li Mean steps
    Training Metrics (computed over all N episodes)
    Safety Violations Vtotal =
    ∑N
    i=1 Vi Total hazard entries
    Training Speed NTtotal (eps/sec)
    Table 2: Evaluation metrics for [Task 2] Q-learning baseline agent
    where ⊮[·] is the indicator function (1 if condition true, 0 otherwise), Ri is cumulative reward
    for episode i, Li is steps in episode i, Vi is hazard entries in episode i, and Ttotal is total training
    time in seconds.
    4
    COMP3411/COMP9814 Assignment 2
    Required Visualisations. Generate the following plots using 100-episode sliding window
    smoothing where appropriate: Training Reward Curve – episode number vs smoothed episode
    reward; Episode Length Over Time – episode number vs smoothed episode length (steps); Cu-
    mulative Safety Violations – episode number vs cumulative sum of hazard entries; Success Rate
    Over Time – episode number vs success rate (%) with 100-episode rolling window.
    Step Penalty Experimentation (Required). Train the Q-learning agent with two different
    step penalty values using fixed start mode (random start=False). This experiment explores
    reward shaping in safety-critical reinforcement learning [3]:
    1. Configuration 1: step penalty=-1.0
    2. Configuration 2: step penalty=-0.1
    Deliverables: For each configuration:
    • Compute and report all 5 metrics (success rate, average reward, episode length, total
    violations, training speed)
    • Generate all 4 plots (reward curve, episode length, cumulative violations, success rate).
    Figure 2 shows sample plots for reference.
    • Create a comparison table showing side-by-side metric differences
    0 250 500 750 1000 1250 1500 1750
    Episode
    40
    30
    20
    10
    0
    Re
    wa
    rd
    (s
    m
    oo
    th
    ed
    )
    SAM
    PLE
    Training Reward Comparison
    Penalty = -1.0 (harsh)
    Penalty = -0.1 (gentle)
    0 250 500 750 1000 1250 1500 1750
    Episode
    12.5
    15.0
    17.5
    20.0
    22.5
    25.0
    27.5
    30.0
    St
    ep
    s (
    sm
    oo
    th
    ed
    )
    SAM
    PLE
    Episode Length Comparison
    Penalty = -1.0 (harsh)
    Penalty = -0.1 (gentle)
    0 250 500 750 1000 1250 1500 1750 2000
    Episode
    0
    200
    400
    600
    800
    1000
    Cu
    m
    ul
    at
    iv
    e
    Vi
    ol
    at
    io
    ns
    SAM
    PLE
    Safety Violations Comparison (No Shield Yet)
    Penalty = -1.0 (harsh)
    Penalty = -0.1 (gentle)
    0 250 500 750 1000 1250 1500 1750
    Episode
    0
    20
    40
    60
    80
    100
    Su
    cc
    es
    s R
    at
    e
    (%
    )
    SAM
    PLE
    Success Rate Comparison (100-episode rolling window)
    Penalty = -1.0 (harsh, 0%)
    Penalty = -0.1 (gentle, ~97%)
    Target: 100%
    Figure 2: Sample training plots comparing penalty=-1.0 (red) vs penalty=-0.1 (blue) over 2000 episodes.
    Shows the four required plots: training reward curve, episode length, cumulative safety violations, and
    success rate.
    Save the Q-table and metrics from the better-performing configuration to disk using pickle for
    later use in safety shield integration [Task 5]. The baseline agent will violate safety constraints
    frequently, this is expected and demonstrates why safety mechanisms are necessary. You will
    create a comprehensive safety dataset [Task 3], train a neural network safety shield [Task 4], and
    integrate the shield with your RL agent [Task 5] to prevent these violations while maintaining
    task performance.
    5
    COMP3411/COMP9814 Assignment 2
    4.3 Task 3: Complete Dataset Creation
    In this task, you will create a complete training dataset for the safety shield classifier by system-
    atically labelling all possible state-action pairs in the environment. Unlike reactive approaches
    that learn from observed safety violations, you will proactively construct a comprehensive safety
    dataset by computing the danger profile of every state in the grid. This exhaustive approach
    ensures the safety shield has complete knowledge of all possible situations it may encounter
    during deployment.
    The dataset creation pipeline consists of three algorithmic steps: (1) computing a global “danger
    map” that records the minimum steps to hazard for every state using multi-source BFS, (2)
    generating labelled samples by iterating through all state-action pairs and classifying them based
    on the danger map, and (3) feature extraction and train/validation/test splitting. The final
    deliverable is complete dataset.pkl, a ready-to-use dataset containing 336 labelled samples
    with 10-dimensional feature vectors.
    Key Concept Definitions. To understand the dataset creation process, we first establish the
    following formal definitions:
    • Hazard Set: Let H ⊂ S be the set of 15 designated hazard states in the grid world.
    These are fixed grid positions that represent unsafe states where the agent must not enter.
    • Distance to Nearest Hazard: For any state s ∈ S, let d(s) denote the minimum number
    of actions required to reach the nearest hazard from s. For hazard states, d(h) = 0 for all
    h ∈ H.
    • Danger Map: A complete mapping D : S → N where D(s) = min{d(s, h) | h ∈ H} gives
    the minimum steps from state s to the nearest hazard. This is computed efficiently using
    multi-source BFS starting from all hazards simultaneously.
    • Risk Class: The safety classification of a state-action pair (s, a) based on the next state
    s′ = δ(s, a): Class 0 if D(s′) = 0 (immediate hazard), Class 1 if D(s′) = 1 (1-step danger),
    Class 2 if D(s′) = 2 (2-step danger), or Class 3 if D(s′) ≥ 3 (safe).
    Algorithmic Insight – Multi-Source BFS: Rather than performing hundreds of separate
    BFS searches from individual states (naive approach with O(n2 · |V | · |E|) complexity), you will
    compute the minimum steps to hazard for all states in a single, efficient graph traversal with
    O(|V | + |E|) complexity. Initialise a BFS queue with all 15 hazard positions simultaneously
    (multi-source BFS). As the search expands outward from these hazards, each state is labelled
    with its distance to the nearest hazard. This creates a complete “danger map” of the environment
    in one pass, which is then used to label all state-action pairs.
    Step 1: Compute Global Danger Map. Your first step is to compute the minimum number
    of steps from every state in the grid to the nearest hazard using a single multi-source Breadth-
    First Search. This creates a “danger map” that will be the foundation for labelling all state-
    action pairs.
    Multi-Source BFS Algorithm:
    6
    COMP3411/COMP9814 Assignment 2
    Algorithm 1 ComputeDangerMap – Multi-source BFS to compute minimum steps to hazard
    for all states
    Require: Environment env with hazard set H
    Ensure: Danger map D : S → N where D(s) = minimum steps from state s to nearest hazard
    1: Initialise empty map D ← ∅
    2: Initialise empty queue Q← ∅
    3:
    4: ▷ Initialise: All hazards have distance 0
    5: for each h ∈ H do
    6: D[h]← 0
    7: Enqueue (h, 0) into Q
    8: end for
    9:
    10: ▷ Multi-source BFS: Expand outward from all hazards simultaneously
    11: while Q ̸= ∅ do
    12: (scurrent, d)← Dequeue from Q
    13: for each action a ∈ {UP, DOWN, LEFT, RIGHT} do
    14: snext ← ComputeNextPosition(scurrent, a)
    15: if snext /∈ D then
    16: D[snext]← d+ 1 ▷ Label with distance
    17: Enqueue (snext, d+ 1) into Q
    18: end if
    19: end for
    20: end while
    21: return D
    This algorithm performs a single BFS traversal that computes the minimum steps to hazard for
    all 100 states in the grid. States closer to hazards are discovered first, ensuring each state is
    labelled with the shortest distance. The result is a complete map of the environment’s danger
    profile.
    Example: Consider state (5, 5) and assume the nearest hazard is at (5, 8). After running
    the multi-source BFS, danger map[(5, 5)] would store the value 3 (three steps to the nearest
    hazard). If you take action RIGHT from (5, 5), you move to (5, 6), and danger map[(5, 6)]
    would be 2. This means the state-action pair ((5, 5),RIGHT) leads to a next state that is 2
    steps from a hazard, making it a Class 2 (2-Step Danger) sample.
    Step 2: Generate and Label Complete Dataset. With the danger map from Step 1, you
    can now generate all samples in a single pass by iterating through every possible state-action
    pair in the environment. For each pair, look up the pre-computed danger value and assign the
    appropriate class label.
    7
    COMP3411/COMP9814 Assignment 2
    Algorithm 2 GenerateLabelledDataset – Generate complete dataset using pre-computed dan-
    ger map
    Require: Environment env with grid size n, hazard set H, goal state sg
    Require: Danger map D from Algorithm 1
    Ensure: Complete dataset X = {(s, a, c)} where c ∈ {0, 1, 2, 3} is the risk class
    1: Initialise empty dataset X ← ∅
    2:
    3: ▷ Iterate through all state-action pairs
    4: for row ← 0 to n− 1 do
    5: for col← 0 to n− 1 do
    6: s← (row, col)
    7: if s ∈ H or s = sg then
    8: continue ▷ Skip hazards and goal
    9: end if
    10:
    11: for each action a ∈ {UP, DOWN, LEFT, RIGHT} do
    12: s′ ← ComputeNextPosition(s, a)
    13: d← D[s′] ▷ Look up pre-computed distance
    14:
    15: ▷ Assign class label based on minimum steps to hazard
    16: if d = 0 then
    17: c← 0 ▷ Immediate hazard
    18: else if d = 1 then
    19: c← 1 ▷ 1-step danger
    20: else if d = 2 then
    21: c← 2 ▷ 2-step danger
    22: else
    23: c← 3 ▷ Safe (d ≥ 3)
    24: end if
    25:
    26: Add (s, a, c) to X
    27: end for
    28: end for
    29: end for
    30: return X
    This approach is simple, efficient, and complete. You iterate through all ∼84 non-hazard states
    × 4 actions =∼336 state-action pairs, performing only a dictionary lookup for each (no BFS
    needed). The result is a complete dataset covering all possible state-action pairs in the environ-
    ment, automatically labelled by risk class.
    Step 3: Feature Extraction and Final Assembly. Now that you have all labelled samples
    from Step 2, you must extract feature vectors for each sample and prepare the final dataset with
    train/validation/test splits.
    Feature Vector Construction. For each state-action pair (s, a), construct a 10-dimensional
    feature vector (Table 3). Let s = (x, y) be the current state and s′ = (x′, y′) be the next state
    after taking action a:
    8
    COMP3411/COMP9814 Assignment 2
    Feature Description Dims
    1–2 Current position (x, y) normalised to [0, 1] (divide by grid size) 2
    3–6 One-hot encoded action [UP, DOWN, LEFT, RIGHT] 4
    7–8 Next position (x′, y′) normalised to [0, 1] (divide by grid size) 2
    9 Min steps to hazard from current state, D(s), normalised by 10 1
    10 Min steps to hazard from next state, D(s′), normalised by 10 1
    Total 10
    Table 3: Feature vector construction for [Task 3] dataset creation
    Features 9 and 10 are the critical safety features, they encode the safety trajectory by providing
    both the current risk level D(s) and the next risk level D(s′). Together, these features allow the
    classifier to understand whether an action moves the agent closer to or further from hazards.
    Both features require only simple lookups from the pre-computed danger map.
    Class Distribution. After generating the complete dataset in Step 2 (all 336 state-action
    pairs from 84 valid states × 4 actions), you will observe a significant class imbalance. Due to
    the strategic placement of 15 hazards throughout the 10×10 grid, the environment is highly
    constrained, most states are within 2 steps of a hazard. Classes 0–2 (unsafe actions) will signif-
    icantly outnumber Class 3 (safe actions), with Class 1 being the most common. This imbalance
    reflects the genuine difficulty of the environment: very few actions are truly “safe” (far from all
    hazards). You will use ALL 336 samples from all four classes in the final dataset.
    Train/Validation/Test Splits. Split the complete dataset into train (70%), validation (15%),
    and test (15%) sets using stratified sampling (stratify=y) to ensure balanced class distribution
    across all splits.
    Output. Your dataset must contain train, validation, and test splits with 10-dimensional feature
    vectors and corresponding class labels (0–3), in a format suitable for loading into [Task 4].
    4.4 Task 4: Safety Shield Training
    In this task, you will train a multi-class neural network classifier to predict risk levels for state action pairs. The dataset created in [Task 3] contains all necessary features and labels in a ready-to-use format. Your focus here is purely on model training and evaluation.
    Dataset Loading. Load the complete dataset you created in [Task 3]. The dataset contains
    train, validation, and test splits with 10-dimensional feature vectors and class labels (0–3).
    Network Architecture. Implement a multi-class neural network safety shield classifier with
    the following architecture (Table 4):
    Layer Neurons Activation
    Input Layer 10 –
    Hidden Layer 1 Tunable* ReLU
    Hidden Layer 2 Tunable* ReLU
    Output Layer 4 Softmax
    Table 4: Neural network architecture for [Task 4] safety shield classifier
    9
    COMP3411/COMP9814 Assignment 2
    * See hyperparameter table for hidden layer size range and suggested value.
    The input layer accepts 10-dimensional feature vectors from [Task 3]. The output layer produces
    P (class | s, a) for classes 0–3 using softmax activation.
    Risk Classes. The network predicts 4 risk classes (0–3) as defined in [Task 3]: Class 0 (imme-
    diate hazard), Class 1 (1-step danger), Class 2 (2-step danger), and Class 3 (safe states).
    Hyperparameters. Train your neural network using the following hyperparameters (Table 5).
    You may experiment within the specified ranges to optimise performance, but you must report
    the final values used.
    Hyperparameter Range Baseline
    Hidden layer size [32, 128] 64
    Learning rate (α) [0.0001, 0.01] 0.001
    Batch size [16, 64] 32
    Epochs [50, 200] 100
    Loss function Fixed Cross-entropy
    Optimiser Fixed Adam
    Random seed Fixed 42
    Table 5: Hyperparameters for [Task 4] safety shield training
    where cross-entropy loss is defined as Loss = −∑3c=0 yc · log(yˆc) with yc being 1 if the true class
    is c (one-hot encoded) and 0 otherwise, and yˆc is the predicted probability for class c.
    Evaluation Metrics. Evaluate your trained model on both validation and test sets and report
    overall accuracy as the fraction of correctly classified samples (target: > 90%), per-class accuracy
    showing classification accuracy for each of the 4 risk classes separately (critical for safety: Class
    0 accuracy > 95% to correctly identify immediate hazards), confusion matrix as a 4 × 4 table
    showing true labels vs predicted labels, and training curves plotting training and validation loss
    vs epoch number showing smooth convergence.
    Required Deliverables. Report the following results for both validation and test sets (Ta-
    ble 6):
    Deliverable Description
    Overall Accuracy Report as percentage (e.g., 95.2%)
    Per-Class Accuracy Accuracy for each of the 4 risk classes: Class 0, Class
    1, Class 2, Class 3
    Confusion Matrix 4 × 4 table showing true vs predicted labels (see Fig-
    ure 3 for sample format)
    Loss Curves Single plot showing both training and validation loss
    vs epoch (see Figure 3 for sample format)
    Table 6: Required deliverables for [Task 4] safety shield training
    Figure 3 shows sample confusion matrix and loss curves for reference.
    Output. Save your trained model weights for use in [Task 5] along with training metrics and
    visualisations.
    10
    COMP3411/COMP9814 Assignment 2
    Class 0 Class 1 Class 2 Class 3
    Predicted Label
    Class 0
    Class 1
    Class 2
    Class 3
    Tr
    ue
    L
    ab
    el
    9 0 0 0
    0 23 1 0
    1 1 12 0
    0 0 1 3
    SAM
    PLE
    Sample Confusion Matrix (Test Set)
    0 20 40 60 80 100
    Epoch
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    1.2
    1.4
    1.6
    Cr
    os
    s-
    En
    tr
    op
    y
    Lo
    ss
    SAM
    PLE
    Sample Training and Validation Loss Curves
    Training Loss
    Validation Loss
    0
    5
    10
    15
    20
    Num
    ber of Sam
    ples
    Figure 3: Sample results for [Task 4] showing (left) confusion matrix on test set with strong diagonal
    indicating correct classifications, and (right) training and validation loss curves showing smooth conver-
    gence. These are sample results for illustration purposes only; your actual results may differ based on
    implementation and hyperparameter choices.
    Pedagogical Note: Oracle to Approximator. This task demonstrates a fundamental ML
    pattern for safety-critical systems. [Task 3] used exhaustive BFS (the expensive oracle), com-
    putationally expensive with full graph search but deterministic and complete. [Task 4] trains
    a neural network to approximate the oracle’s behaviour, inference is fast (single forward pass
    vs full BFS) and generalises to unseen states. This mirrors real-world ML deployment: use an
    expensive oracle to generate high-quality training data, then train a fast model to approximate
    the oracle’s behaviour for real-time use.
    4.5 Task 5: Integration – Safe RL with Multi-Class Shield
    Integrate your trained 4-class safety shield with the Q-learning agent to enable risk-aware action
    selection during training. The shield predicts a risk class in {0, 1, 2, 3} for each state-action pair,
    where Class 0 indicates immediate hazard, Class 1 indicates 1-step danger, Class 2 indicates
    2-step danger, and Class 3 indicates safe states (as defined in [Task 3]).
    Risk Threshold Parameter. Define a threshold parameter θ ∈ {0, 1, 2, 3} as the intervention
    threshold. An action a in state s is acceptable if and only if its predicted class c(s, a) > θ.
    Actions with class ≤ θ trigger intervention. Set θ = 2 as the default value. For example, with
    θ = 2, only actions predicted as Class 3 are acceptable (class > 2); actions predicted as Class 0,
    1, or 2 trigger intervention. With θ = 0, only Class 0 triggers intervention; Classes 1, 2, and 3
    are acceptable.
    Intervention Policy. At each time step t, the agent must follow this intervention policy:
    (1) the agent proposes an action aprop using ϵ-greedy selection from its Q-table, (2) the shield
    predicts the risk class c(st, a) for all four actions a ∈ {UP, DOWN, LEFT, RIGHT} in the cur-
    rent state st, (3) if c(st, aprop) > θ, execute aprop directly (no intervention needed), (4) otherwise,
    build the candidate set C = {a : c(st, a) > θ} of acceptable actions, (5) if C ̸= ∅, execute at =
    argmaxa∈C Q(st, a) (choose acceptable action with highest Q-value), (6) if C = ∅ (no accept-
    able actions exist), compute cmax = maxa c(st, a) and execute at = argmaxa:c(st,a)=cmax Q(st, a)
    (choose least risky action with highest Q-value). If multiple actions tie on Q-value, use deter-
    ministic tie-breaking (e.g., select first in fixed order UP, RIGHT, DOWN, LEFT).
    Training Loop. Train your safe RL system for 1,000 episodes. For each episode, reset the
    environment to a random safe starting position and run for a maximum of 200 steps or until
    reaching a terminal state (goal or hazard). At each time step: (1) apply the intervention policy
    above to select action at, (2) execute at in the environment to observe reward rt+1 and next
    11
    COMP3411/COMP9814 Assignment 2
    state st+1, (3) update the Q-table using the executed action at (not the proposed action):
    Q(st, at)← Q(st, at) + α
    [
    rt+1 + γmax
    a′
    Q(st+1, a
    ′)−Q(st, at)
    ]
    Log metrics per episode: number of interventions (count of steps where at ̸= aprop), safety
    violations (count of transitions that enter a hazard cell according to ground-truth environment
    state), and episode return (sum of rewards).
    Hyperparameters. Use the hyperparameters specified in Table 7.
    Hyperparameter Value/Range Notes
    Learning rate (α) 0.1 Fixed
    Discount factor (γ) 0.99 Fixed
    Risk threshold (θ) Test ≥ 2 values e.g., θ = 0 and θ = 2
    Episodes 1,000 Fixed
    Max steps per episode 200 Fixed
    Epsilon start (ϵstart) 1.0 Fixed
    Epsilon min (ϵmin) 0.01 Fixed
    Epsilon decay 0.995 Exponential decay per episode
    Random start position Enabled Use random safe starting positions
    Random seed 123 Suggested for reproducibility
    Table 7: Hyperparameters for [Task 5] safe RL training
    Required Deliverables. Generate comparison plots (with smoothing) showing: (1) episode
    rewards over training for baseline Q-learning (no shield) and safe RL with at least two different
    risk threshold values, (2) safety violations per episode for all approaches (target: zero or near-zero
    for safe RL), (3) shield interventions per episode for each threshold value tested, and (4) summary
    statistics comparing final performance metrics (success rate, violations, interventions, average
    reward). Report final metrics averaged over the last 100 episodes for each configuration tested.
    Figure 4 shows sample plots for reference.
    Risk Threshold Experimentation (Required): In your report, you must compare the
    results of using at least two different risk threshold values (e.g., θ = 0 and θ = 2, or θ = 1 and
    θ = 2). Analyse the impact of this parameter on the trade-off between safety (violation rate),
    task performance (success rate, average reward), and intervention efficiency (total interventions,
    interventions per episode). Discuss which threshold value provides the best balance for this
    environment and explain your reasoning. Figure 4 demonstrates this threshold comparison
    showing baseline performance alongside two different threshold configurations.
    4.6 Model Evaluation
    Evaluate your final safe RL system across multiple dimensions using clearly defined metrics.
    Safety Metrics. Measure (1) total safety violations over all 1,000 training episodes, where a
    violation is defined as any transition that results in the agent entering a hazard cell according
    to ground-truth environment state (target: 0 violations), (2) false negative rate of the shield on
    test data, computed as the fraction of Class 0 or Class 1 actions incorrectly predicted as Class
    12
    COMP3411/COMP9814 Assignment 2
    0 250 500 750 1000 1250 1500 1750 2000
    Episode
    15
    10
    5
    0
    5
    10
    Re
    wa
    rd
    (s
    m
    oo
    th
    ed
    )
    SAM
    PLE
    Episode Rewards Comparison
    Baseline (no shield)
    =0
    =2
    0 250 500 750 1000 1250 1500 1750 2000
    Episode
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Sa
    fe
    ty
    V
    io
    la
    tio
    ns
    (s
    m
    oo
    th
    ed
    )
    SAM
    PLE
    Safety Violations Comparison
    Baseline
    =0
    =2
    0 250 500 750 1000 1250 1500 1750 2000
    Episode
    0
    25
    50
    75
    100
    125
    150
    175
    200
    In
    te
    rv
    en
    tio
    ns
    p
    er
    E
    pi
    so
    de
    SAM
    PLE
    Shield Interventions Comparison
    =0
    =2
    THRESHOLD COMPARISON SUMMARY
    ==================================================
    Episodes: 2000
    SAFETY VIOLATIONS:
    Baseline: 423
    =0: 0 (100.0% reduction)
    =2: 0 (100.0% reduction)
    SUCCESS RATE:
    Baseline: 68.5%
    =0: 98.2%
    =2: 3.5%
    INTERVENTIONS (total):
    =0: 18 (avg: 0.0/ep)
    =2: 185340 (avg: 185.7/ep)
    FINAL REWARD (last 100 episodes):
    Baseline: 6.8
    =0: 8.7
    =2: -17.9
    KEY INSIGHT:
    =0: Fewer interventions, faster learning
    =2: More interventions, maximum safety

    Figure 4: Sample results for [Task 5] comparing baseline Q-learning (no shield) with safe RL using
    different risk thresholds (θ = 0 and θ = 2). Top row: Episode rewards (left) and safety violations (right)
    across all three approaches. Bottom row: Shield interventions for θ = 0 and θ = 2 (left), and summary
    statistics (right) showing key metrics including success rate, violations, and intervention frequency. These
    are sample results for illustration purposes; your actual results may differ based on implementation and
    hyperparameter choices.
    2 or Class 3 (indicates shield failures that could allow unsafe actions), and (3) violation rate
    without shield by running the final Q-table for 100 episodes with the shield disabled to measure
    baseline safety.
    Task Performance. Measure (1) average episode return computed as mean total reward per
    episode over the last 100 episodes (compare to unsafe baseline from [Task 2]), (2) success rate
    as the percentage of episodes in the last 100 that reach the goal state without entering a hazard,
    and (3) average episode length as mean number of steps per episode in the last 100 episodes.
    Intervention Efficiency. Measure (1) total interventions over all 1,000 training episodes,
    where an intervention is defined as any step where the executed action at differs from the
    proposed action aprop due to shield intervention, (2) interventions in last 100 episodes to assess
    whether the agent has learned a safe policy (target: near zero), and (3) average interventions
    per episode tracked over training to visualise the learning trajectory.
    Visualisations. Create a comparison summary displaying key metrics for both the unsafe
    baseline [Task 2] and the safe RL system [Task 5] with different threshold values, including
    average reward, success rate, safety violations, and total interventions. See Figure 4 (bottom
    right panel) for reference format showing how to present these comparative statistics alongside
    your training curves.
    13
    COMP3411/COMP9814 Assignment 2
    5 Assessment Breakdown
    The assignment is marked out of 25, with marks distributed across implementation, understand-
    ing, experimental analysis, and tutorial participation components as shown in Table 8.
    Table 8: Assessment breakdown showing mark distribution across components
    Category Component Marks
    Implementation (36%)
    Environment implementation with configurable step penalties 1
    Q-learning baseline with two-penalty comparison 1
    Systematic intervention generation 1
    Safety shield learning and evaluation 3
    Safe RL integration with shield 2
    Code clarity and style 1
    Subtotal 9
    Understanding &
    Discussion (48%)
    Safe RL Concepts:
    Constrained MDPs and safety formulation 1
    Shield design and safety constraint learning 1
    Multi-class risk stratification and advantages 1
    False positive vs. false negative trade-offs 1
    Implementation Understanding:
    Feature engineering for safety prediction 1
    Shield training and class imbalance handling 1
    RL-shield integration strategy 1
    Reward shaping analysis 1
    Experimental Analysis:
    Baseline vs. safe system comparison 1
    Step penalty comparison and reward shaping 1
    Multi-class safety shield performance analysis 1
    Risk-aware intervention strategy evaluation 1
    Subtotal 12
    Tutorial Participation (16%) Practical work and engagement in tutorials 4
    TOTAL 25
    6 Discussion Session
    All students must attend a mandatory 15–20 minute discussion with a tutor (face-to-face or
    online) to demonstrate understanding of their implementation and the underlying concepts.
    You will be asked to explain your implementation choices, demonstrate your working system,
    discuss results and trade-offs, and answer conceptual questions about safe RL and constrained
    MDPs. The discussion is worth 12 marks (48% of the total assignment grade) and assesses
    your genuine understanding of the work submitted. Schedule your discussion session through
    the online booking system (link will be provided on Moodle). Failure to attend your scheduled
    discussion session without valid reason will result in zero marks for the discussion component.
    14
    COMP3411/COMP9814 Assignment 2
    7 Submission
    7.1 Deadline and Late Penalties
    Due Date: Friday, 14 November 2025, 5:00 PM AEST (Week 9)
    Late Penalty: UNSW has a standard late submission penalty of 5% per day from your mark,
    capped at five days from the assessment deadline. After five days, students cannot submit the
    assignment.
    7.2 Required Components
    Your submission must include the following components:
    1. Jupyter Notebook. Submit a single Jupyter notebook containing all implementation code
    for Tasks 1–5. The notebook must be well-organised with clear markdown cells explaining your
    implementation decisions, design choices, and key observations. Your code cells should include
    appropriate comments for complex logic, but avoid over-commenting obvious operations. Use
    meaningful variable names and maintain consistent code structure throughout.
    2. Generated Dataset. Submit the complete intervention dataset generated in [Task 3], saved
    in a format that can be easily loaded (e.g., pickle file, CSV, or NumPy array). This dataset
    should contain all state-action-risk class tuples collected through systematic BFS exploration,
    properly labelled with risk classes 0, 1, 2, and 3. Include the total number of samples and class
    distribution in your notebook documentation.
    7.3 How to Submit
    Submit your assignment electronically via Moodle. Your submission must be a single zip file
    named zID assignment2.zip (replace zID with your student ID) containing your Jupyter note-
    book (.ipynb file) and the generated intervention dataset (e.g., complete dataset.pkl).
    Important: Test thoroughly before submission. If your models fail to load or run during
    evaluation, you may lose up to 50% of the marks for that component. You can submit as many
    times as you like before the deadline; later submissions overwrite earlier ones. After submitting,
    take a screenshot for your records.
    7.4 Getting Help
    Use the Moodle forum for assignment-related questions. We prioritise forum questions, but
    avoid sharing code publicly to prevent plagiarism issues. For code-specific questions, email
    [email protected]. We aim to respond quickly, but may take up to 1–2 business days,
    so avoid last-minute questions that might not receive timely responses. For questions about
    discussion sessions, contact your tutor directly (see Section 8 for tutor information).
    8 Tutor Information
    Table 9 lists the tutors for this course along with their assigned class IDs and contact email
    addresses. Please contact your tutor directly for questions about discussion sessions or class-
    specific matters.
    15
    COMP3411/COMP9814 Assignment 2
    No. Class ID(s) Tutor Email
    1 13192, 13193, 13198, 13199 Adam Stucci [email protected]
    2 4198, 4202, 4204, 6344, 6348, 6350 Hadha Afrisal [email protected]
    3 4212, 4215, 6358, 6361 Haitao Gao [email protected]
    4 4205, 4214, 6351, 6360 Ishan Dubey [email protected]
    5 4197, 4213, 6343, 6359 Joffrey Ji [email protected]
    6 4199, 4209, 6345, 6355 John Chen [email protected]
    7 4219, 6365, 13076, 13077, 13078, 13079 Jonas Macken [email protected]
    8 4206, 4216, 6352, 6362 Leman Kirme [email protected]
    9 4201, 4211, 6347, 6357 Maher Mesto [email protected]
    10 4223, 4224, 6369, 6370 Marium Malik [email protected]
    11 4203, 4208, 6349, 6354 Peter Ho [email protected]
    12 4210, 4220, 6356, 6366 Trishika Abrol [email protected]
    13 4200, 4207, 6346, 6353 Xiongyu Xie [email protected]
    14 4217, 4218, 6363, 6364 Yixin Kang [email protected]
    15 4221, 4222, 6367, 6368 Zahra Donyavi [email protected]
    Table 9: Course tutors and their assigned classes.
    9 Academic Integrity
    This assignment is individual work. You may discuss high-level concepts with classmates, but all
    code and written work must be your own. Do not share code with other students. Large language
    models and AI assistants (such as ChatGPT, GitHub Copilot) may be used for learning concepts,
    understanding syntax, debugging assistance, and clarifying documentation. However, you must
    NOT use AI tools to generate complete solutions for entire tasks or to write substantial portions
    of your implementation, which might lead to poor understanding of the code. The submitted
    code must be your own work that you have written and fully understand.
    10 References
    References
    [1] Mohammed Alshiekh, Roderick Bloem, Ru¨diger Ehlers, Bettina Ko¨nighofer, Scott Niekum,
    and Ufuk Topcu. Safe reinforcement learning via shielding. In Proceedings of the AAAI
    Conference on Artificial Intelligence, volume 32, 2018.
    [2] Eitan Altman. Constrained Markov Decision Processes. Chapman & Hall/CRC, 1999.
    [3] Fatemeh Yousefinejad Ravari and Saeed Jalili. Reward shaping in reinforcement learning
    of multi-objective safety critical systems. In 2024 20th CSI International Symposium on
    Artificial Intelligence and Signal Processing (AISP), pages 1–6. IEEE, 2024.
    [4] William Saunders, Girish Sastry, Andreas Stuhlmu¨ller, and Owain Evans. Trial without
    error: Towards safe reinforcement learning via human intervention. In Proceedings of the
    17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS),
    pages 2067–2069, 2018.
    [5] Brijen Thananjeyan, Ashwin Balakrishna, Suraj Nair, Michael Luo, Krishnan Srinivasan,
    Minho Hwang, Joseph E Gonzalez, Julian Ibarz, Chelsea Finn, and Ken Goldberg. Recovery
    rl: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation
    Letters, 6(3):4915–4922, 2021.
    16

    Need Help with Your Assignment?

    Post your brief free on TopEssay.com and post your brief free and get expert quotes in minutes.

    Get Expert Help Free →
    WeCreativez WhatsApp Support
    Our customer support team is here to answer your questions. Ask us anything!
    👋 Hi, how can I help?