Experiment 20251230-KARL

Experiment design

Agents are playing the knowledge-based agreement game using their symbolic knowledge base. Agents adapt their knowledge base by selecting operators through Reinforcement Learning.

Date: 2025-12-30

Designer: Richard Trézeux

Hypotheses: [Learned policies are independent of the environment, Learned policies enable agents to reach consensus efficiently]

10 agents; 5 environments; 5 runs per environment; 40 generations per run.

Experiment

Date: 2025-12-30 (Richard Trézeux)

Link to code

Simulator code hash: c3fab184644a56f1783a5fbb10d20555997bfe64

Parameter file: params.sh

Exectued command(script.sh):

#!/bin/bash

# retrieve code

. params.sh

git clone https://gitlab.inria.fr/moex/karlOperators.git code
cd code
git checkout $LLHASH
cd ..

pip install -r code/requirements.txt

# run
        
METHODS=("softmax" "thompson" "a3c" "random+")
SEEDS=(24 25 27 2626 2727)

MAX_JOBS=5

for method in "${METHODS[@]}"; do
  for seed in "${SEEDS[@]}"; do
    while [ "$(jobs -r | wc -l)" -ge "$MAX_JOBS" ]; do
    sleep 1
    done
    echo "Running $method seed $seed"

    python code/main.py \
        --config params.sh \
        --method $method \
        --seed $seed &

  done
done

wait

Experimental Plan

The independent variables have been varied as follows:
LEARNING_METHOD = [random+, thompson, softmax, a3c]
SEED = [24, 25, 27, 2626, 2727] (randomly selected)

Constants of the experiment:
NBINTERACTIONS = 100000
NBPROPERTIES = 6
NBDECISIONS = 4
KB_INIT_METHOD = random
ADAPTING_AGENT_SELECTION = accuracy

Raw results

Full results are available on Zenodo

DOI

Key parameters descrpition

Out[2]:
Parameter Description
NBAGENTS Size of the agent population.
NBGENERATIONS Number of generations or iterations.
NBINTERACTIONS Maximum number of interactions for a population (just a security if agents did not converge to a consensus).
NBPROPERTIES Number of properties in the environment (influences complexity).
NBDECISIONS Number of decisions to discriminate objects.
KB_INIT_METHOD KB initialization methods. Two different methods are implemented (see paper).
ADAPTING_AGENT_SELECTION The type of score used to select which agent will be adapting its knowledge (accuracy or successrate).
LEARNING_METHOD Type of RL updates and policy for agents.
SEED Random seed of the experiment. Controls the generation of objects in the environment

NBPROPERTIES, NBCLASSES, NBAGENTS, and NBGENERATIONS have a strong impact on computational time.
Large values may lead to long simulations.

Analysis

The directory folder structure we used for analysis is :

results   
│
└───method1
│   └───seed1
│       │   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed2
|       |   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed3...
│   
└───method2
│   └───seed1
│       │   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed2...

Hypothese 1: Policies are environment-independent

Kruskal-Wallis test on operator use

$H_0 :$ agents use the operator as frequently as agents from other environments

$H_1 :$ at least an environment has a different distribution

No description has been provided for this image
-- KRUSKAL-WALLIS TEST RESULTS METHOD thompson --
        Operator   p-value
0           Wait  0.501126
1  Remove-Mirror  0.719813
2        Spe Max  0.661678
3        Spe Min  0.855492
4        Gen Max  0.704059
5        Gen Min  0.537978
6         Mirror  0.000785
No description has been provided for this image
-- KRUSKAL-WALLIS TEST RESULTS METHOD softmax --
        Operator   p-value
0           Wait  0.899129
1  Remove-Mirror  0.079606
2        Spe Max  0.593257
3        Spe Min  0.570662
4        Gen Max  0.150979
5        Gen Min  0.539263
6         Mirror  0.752636
No description has been provided for this image
-- KRUSKAL-WALLIS TEST RESULTS METHOD a3c --
        Operator   p-value
0           Wait  0.683018
1  Remove-Mirror  0.056482
2        Spe Max  0.009028
3        Spe Min  0.668697
4        Gen Max   0.88204
5        Gen Min  0.289114
6         Mirror  0.146813

Verify policy environment independance with PERMANOVA analysis

1) Boltzmann/Softmax method

Printing the learnt $Q$ vectors averaged over agents.

        Operator  Avg Q-value  Std Q-value
0           Wait        -0.50         0.00
1  Remove-Mirror         0.38         0.23
2        Spe Max         0.53         0.52
3        Spe Min         0.38         0.53
4        Gen Max         0.15         0.31
5        Gen Min         0.14         0.29
6         Mirror        -0.02         0.12

PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                     250
number of groups                  5
test statistic             1.221952
p-value                    0.268746
number of permutations         5000
Name: PERMANOVA results, dtype: object

2) Thompson sampling method

Printing the expectation and standard deviation for $\text{Beta}(\alpha_{op},\beta_{op})$ distribution for each operator averaged over agents.

        Operator  Beta distribution mean value  Beta distribution std
0           Wait                          0.01                   0.00
1  Remove-Mirror                          0.85                   0.07
2        Spe Max                          0.50                   0.18
3        Spe Min                          0.22                   0.14
4        Gen Max                          0.46                   0.22
5        Gen Min                          0.95                   0.05
6         Mirror                          0.68                   0.04

PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                     250
number of groups                  5
test statistic             1.029065
p-value                    0.420716
number of permutations         5000
Name: PERMANOVA results, dtype: object

p-values for both Softmax (0.265) and Thompson (0.390) methods are above 0.05.

Hypothese 2: Policies enable agents to reach consensus efficiently

Reading data from metrics.json files

For a given method, we have for each environment a metrics.json file that contains data from 5 runs with agents playing the knowledge-based agreement game for 40 generations. Agents keep their RL policy from one generation to the next one.

Out[16]:
Number of adaptations for each agent and generation pair (skipped first ten generations) for a given (#run,#environment)
    Agents
    0 1 2 3 4 5 6 7 8 9
Generations 10 333 191 256 241 268 161 261 141 161 104
11 298 169 189 276 148 281 276 300 380 269
12 273 255 142 203 315 197 211 211 251 330
13 264 259 175 276 264 150 285 186 205 236
14 254 217 277 240 155 251 296 277 179 127
15 306 233 150 309 263 170 233 145 171 196
16 248 219 258 109 139 188 229 229 86 254
17 202 214 143 120 252 270 167 192 163 136
18 260 163 155 268 219 295 232 269 262 89
19 149 111 212 207 215 207 243 237 247 179
20 224 265 234 203 179 123 162 245 177 131
21 259 158 233 271 118 271 170 157 161 254
22 292 150 166 302 173 218 171 211 197 224
23 240 101 244 295 208 44 248 304 189 243
24 202 175 257 202 163 155 197 134 193 227
25 103 244 183 184 201 209 246 146 292 211
26 175 203 191 241 217 239 198 121 200 169
27 186 236 160 213 289 223 197 259 182 260
28 237 134 293 261 165 167 119 242 269 224
29 139 321 186 207 214 271 177 224 138 157
30 184 266 165 307 158 120 217 195 266 130
31 201 250 277 303 213 169 188 246 187 330
32 241 259 167 256 188 184 277 309 142 110
33 193 182 216 190 178 211 255 244 242 189
34 183 306 253 191 179 168 204 279 193 197
35 219 210 213 263 182 233 159 273 147 170
36 210 253 193 278 168 229 129 146 199 269
37 170 147 292 310 121 279 159 164 149 253
38 184 232 283 250 223 244 220 250 154 218
39 140 245 228 267 241 261 131 199 263 120
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Reinforcement learning policies decrease the number of adaptation required to reach consensus while Random+ strategies do not.

No description has been provided for this image

Reinforcement learning policies are able to better improve reward earnings than Random+ strategies.

Plot of average number of adaptations before reaching consensus for each method and environment

No description has been provided for this image
No description has been provided for this image

We observe that RL strategies reach consensus using less adaptations than with the Random+ strategy.

We also observe that the incorproration of interaction-specific information into the RL state (method RL A3C) does not accelerate convergence compared to other RL methods (Softmax and Thompson).

Operators usage per learning method

Reading data from logs_gen*.jsonl files

Each line corresponds to an adaptation. The format is the following:

Out[36]:
Episode Adapting_Agent_ID Partner_ID Object Partner decision Operator_Used_ID Action-Operator Probability Probability distribution operators choice KB modifications
0 1820 2 10 {'prop': ['p1', 'p2', 'p3'], 'class': 'c3'} c3 5 0.218551 [0.0, 0.62, 0.0, 0.0, 0.14, 0.22, 0.01] {'delete': [91], 'add': [90]}
1 1821 2 4 {'prop': ['p1', 'p2', 'p5', 'p6'], 'class': 'c3'} c1 2 0.425476 [0.0, 0.22, 0.43, 0.23, 0.05, 0.08, 0.0] {'delete': [], 'add': [188]}
2 1824 1 9 {'prop': ['p2', 'p3', 'p4', 'p5'], 'class': 'c1'} c2 4 0.283351 [0.0, 0.61, 0.0, 0.0, 0.28, 0.1, 0.01] {'delete': [208], 'add': [209]}
3 1825 7 5 {'prop': ['p1', 'p2', 'p3'], 'class': 'c3'} c3 1 0.482486 [0.0, 0.48, 0.0, 0.0, 0.16, 0.35, 0.0] {'delete': [91], 'add': [90]}
4 1826 4 5 {'prop': ['p1', 'p4', 'p6'], 'class': 'c3'} c1 5 0.733313 [0.0, 0.15, 0.0, 0.0, 0.11, 0.73, 0.0] {'delete': [122], 'add': [120]}
5 1827 1 5 {'prop': ['p3', 'p5', 'p6'], 'class': 'c2'} c2 1 0.604019 [0.0, 0.6, 0.0, 0.0, 0.28, 0.1, 0.01] {'delete': [162], 'add': [161]}
6 1832 1 6 {'prop': ['p1', 'p2', 'p4', 'p5'], 'class': 'c2'} c1 4 0.288929 [0.0, 0.6, 0.0, 0.0, 0.29, 0.1, 0.01] {'delete': [181], 'add': [180]}
7 1833 6 5 {'prop': ['p2', 'p3', 'p4', 'p6'], 'class': 'c2'} c2 4 0.166113 [0.0, 0.09, 0.0, 0.0, 0.17, 0.74, 0.01] {'delete': [215], 'add': [213]}
8 1835 1 5 {'prop': ['p1', 'p3', 'p5'], 'class': 'c2'} c2 1 0.597040 [0.0, 0.6, 0.0, 0.0, 0.29, 0.1, 0.01] {'delete': [108], 'add': [109]}
9 1839 2 5 {'prop': ['p2', 'p3'], 'class': 'c2'} c1 1 0.623034 [0.0, 0.62, 0.0, 0.0, 0.14, 0.22, 0.01] {'delete': [51], 'add': [48]}
Method Random+: 250 agents
Method Thompson: 250 agents
Method Softmax: 250 agents
Method RL A3C: 250 agents

Plot results

No description has been provided for this image

Environment generation and seeds used for experiments from the paper

The following trees represent the different $d^*$ function (environment objects generation) used.

SEEDS USED :  [24, 25, 27, 2626, 2727]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Hyperparameters table

Out[14]:
Name Symbol Value Description
Environment-related
Number of properties $|\mathcal{P}|$ (cfg) Number of object types is $2^{|\mathcal{P}|}$
Number of decisions $|\mathcal{D}|$ (cfg)
Number of agents $N_{\text{agents}}$ (cfg)
Number of generations $N_{\text{gen}}$ (cfg) Total training iterations
Interaction window size -- $500$ Window for population success rate computation
Environment Tree generation $(N_{\text{props}}, N_{\text{classes}}, p_{\text{stop}})$ $p_{\text{stop}} = 0.6$ $p_{\text{stop}}$ controls depth of the tree
KB initialization (opt. 1) $N(K_{init})$ $Bin(n = \frac{|\text{obj}|}{2}, p = 0.5)$ Choose clauses among correct ones only
KB initialization (opt. 2) $N(K_{init})$ $Bin(n = |\text{obj}|, p = 0.5)$ Choose clauses among correct and incorrect ones
RL setup
Reward function $R$ $\frac{3}{2} \cdot \text{mean(score)}$ $\text{mean}\in[-1,1]$, amplified
Interactions before reward -- $30$ Number of informative interactions before evaluating past actions
Episode type -- Single-step update Policy updated after every reward
$\tau$ schedule (Softmax) $(\tau_{start},\tau_{end},\tau_{decay})$ $(0.5,0.05,0.9995)$ Decreasing temperature over time to decrease exploration
Learning rate (Softmax) $lr$ $0.1$ Learning rate for Softmax RL update rule
A3C Learning-related
Actor network architecture -- Input → 256 → 256 → 7 Linear layers, empirical choice
Critic network architecture -- Input → 256 → 256 → 1 Linear layers, empirical choice
Actor temperature $\tau$ $[1,10]$ Controls distribution sharpness
Learning rate (Actor) $\alpha_A$ $1\times 10^{-3}$ Empirical choice
Learning rate (Critic) $\alpha_C$ $1\times 10^{-3}$ Empirical choice
Activation (Actor & Critic) -- LeakyReLU
Gradient clipping -- $\|\nabla\|_\infty \leq 20$ Stabilizes training