Experiment 20251230-KARL

Experiment design

Agents are playing the knowledge-based agreement game using their symbolic knowledge base. Agents adapt their knowledge base by selecting operators through Reinforcement Learning.

Date: 2025-12-30

Designer: Richard Trézeux

Hypotheses: [Learned policies are independent of the environment, Learned policies enable agents to reach consensus efficiently]

10 agents; 5 environments; 5 runs per environment; 40 generations per run.

Experiment

Date: 2025-12-30 (Richard Trézeux)

Link to code

Simulator code hash: c3fab184644a56f1783a5fbb10d20555997bfe64

Parameter file: params.sh

Exectued command(script.sh):

#!/bin/bash

# retrieve code

. params.sh

git clone https://gitlab.inria.fr/moex/cke-adapt-operators-marl.git code
cd code
git checkout $LLHASH
cd ..

pip install -r code/requirements.txt

# run
        
METHODS=("softmax" "thompson" "a3c" "random+")
SEEDS=(24 25 27 2626 2727)

MAX_JOBS=5

for method in "${METHODS[@]}"; do
  for seed in "${SEEDS[@]}"; do
    while [ "$(jobs -r | wc -l)" -ge "$MAX_JOBS" ]; do
    sleep 1
    done
    echo "Running $method seed $seed"

    python code/main.py \
        --config params.sh \
        --method $method \
        --seed $seed &

  done
done

wait

Experimental Plan

The independent variables have been varied as follows:
LEARNING_METHOD = [random+, thompson, softmax, a3c]
SEED = [24, 25, 27, 2626, 2727] (randomly selected)

Constants of the experiment:
NBINTERACTIONS = 100000
NBPROPERTIES = 6
NBDECISIONS = 4
KB_INIT_METHOD = random
ADAPTING_AGENT_SELECTION = accuracy

Raw results

Full results are available on Zenodo

DOI

Key parameters descrpition

Out[14]:
Parameter Description
NBAGENTS Size of the agent population.
NBGENERATIONS Number of generations or iterations.
NBINTERACTIONS Maximum number of interactions for a population (just a security if agents did not converge to a consensus).
NBPROPERTIES Number of properties in the environment (influences complexity).
NBDECISIONS Number of decisions to discriminate objects.
KB_INIT_METHOD KB initialization methods. Two different methods are implemented (see paper).
ADAPTING_AGENT_SELECTION The type of score used to select which agent will be adapting its knowledge (accuracy or successrate).
LEARNING_METHOD Type of RL updates and policy for agents.
SEED Random seed of the experiment. Controls the generation of objects in the environment

NBPROPERTIES, NBCLASSES, NBAGENTS, and NBGENERATIONS have a strong impact on computational time.
Large values may lead to long simulations.

Analysis

The directory folder structure we used for analysis is :

results   
│
└───method1
│   └───seed1
│       │   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed2
|       |   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed3...
│   
└───method2
│   └───seed1
│       │   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed2...

Hypothese 1: Policies are environment-independent

Kruskal-Wallis test on operator use

$H_0 :$ agents use the operator as frequently as agents from other environments

$H_1 :$ at least an environment has a different distribution

No description has been provided for this image
-- KRUSKAL-WALLIS TEST RESULTS METHOD thompson --
        Operator   p-value
0           Wait  0.166549
1  Remove-Mirror  0.247012
2        Spe Max  0.390935
3        Spe Min  0.650062
4        Gen Max  0.375935
5        Gen Min  0.551871
6         Mirror  0.164023
No description has been provided for this image
-- KRUSKAL-WALLIS TEST RESULTS METHOD softmax --
        Operator   p-value
0           Wait  0.294654
1  Remove-Mirror  0.520859
2        Spe Max  0.751318
3        Spe Min  0.657099
4        Gen Max   0.18153
5        Gen Min  0.041302
6         Mirror   0.17907
No description has been provided for this image
-- KRUSKAL-WALLIS TEST RESULTS METHOD a3c --
        Operator   p-value
0           Wait  0.358397
1  Remove-Mirror  0.311148
2        Spe Max  0.370415
3        Spe Min  0.371088
4        Gen Max  0.938428
5        Gen Min  0.196124
6         Mirror  0.026506

Verify policy environment independance with PERMANOVA analysis

1) Boltzmann/Softmax method

Printing the learnt $Q$ vectors averaged over agents.

        Operator  Avg Q-value  Std Q-value
0           Wait        -0.50         0.00
1  Remove-Mirror         0.38         0.23
2        Spe Max         0.53         0.52
3        Spe Min         0.38         0.53
4        Gen Max         0.15         0.31
5        Gen Min         0.14         0.29
6         Mirror        -0.02         0.12

PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                     250
number of groups                  5
test statistic             1.221952
p-value                    0.268746
number of permutations         5000
Name: PERMANOVA results, dtype: object

2) Thompson sampling method

Printing the expectation and standard deviation for $\text{Beta}(\alpha_{op},\beta_{op})$ distribution for each operator averaged over agents.

        Operator  Beta distribution mean value  Beta distribution std
0           Wait                          0.01                   0.00
1  Remove-Mirror                          0.85                   0.07
2        Spe Max                          0.50                   0.18
3        Spe Min                          0.22                   0.14
4        Gen Max                          0.46                   0.22
5        Gen Min                          0.95                   0.05
6         Mirror                          0.68                   0.04

PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                     250
number of groups                  5
test statistic             1.029065
p-value                    0.420716
number of permutations         5000
Name: PERMANOVA results, dtype: object

p-values for both Softmax (0.265) and Thompson (0.390) methods are above 0.05.

Hypothese 2: Policies enable agents to reach consensus efficiently

Reading data from metrics.json files

For a given method, we have for each environment a metrics.json file that contains data from 5 runs with agents playing the knowledge-based agreement game for 40 generations. Agents keep their RL policy from one generation to the next one.

Out[18]:
Number of adaptations for each agent and generation pair (skipped first ten generations) for a given (#run,#environment)
    Agents
    0 1 2 3 4 5 6 7 8 9
Generations 10 181 209 227 233 118 263 263 245 294 258
11 194 259 207 190 206 141 135 256 129 158
12 218 132 243 239 187 333 270 192 245 244
13 170 241 160 162 128 261 229 191 260 203
14 360 183 177 253 268 196 208 254 156 149
15 187 280 259 257 284 286 160 229 133 237
16 234 295 218 128 240 205 186 231 201 145
17 191 217 193 214 109 257 206 229 280 217
18 188 180 265 156 232 205 221 136 217 149
19 250 186 259 198 181 202 160 209 203 232
20 243 213 219 191 222 163 109 283 168 213
21 254 356 293 248 140 260 334 121 280 241
22 131 169 124 213 148 250 223 317 106 250
23 203 142 258 281 288 223 290 177 255 123
24 158 163 142 237 215 237 304 219 143 177
25 210 190 240 115 160 205 218 127 152 128
26 179 174 191 230 288 214 222 232 222 182
27 238 259 288 182 206 169 261 240 112 149
28 273 133 277 188 202 202 174 169 215 215
29 255 151 206 238 224 158 94 324 131 187
30 201 227 230 180 299 219 176 92 276 191
31 241 372 211 336 173 198 201 216 162 166
32 146 142 149 118 238 218 194 237 161 246
33 214 157 202 184 228 307 200 168 196 170
34 260 156 266 224 192 211 217 333 275 136
35 173 223 280 185 109 324 232 263 213 364
36 210 341 326 165 181 227 195 235 316 249
37 182 207 147 187 208 254 230 152 191 178
38 159 198 218 226 307 302 235 232 173 208
39 147 216 205 257 248 162 234 223 214 305
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Reinforcement learning policies decrease the number of adaptation required to reach consensus while Random+ strategies do not.

No description has been provided for this image

Reinforcement learning policies are able to better improve reward earnings than Random+ strategies.

Plot of average number of adaptations before reaching consensus for each method and environment

No description has been provided for this image

We observe that RL strategies reach consensus using less adaptations than with the Random+ strategy.

We also observe that the incorproration of interaction-specific information into the RL state (method RL A3C) does not accelerate convergence compared to other RL methods (Softmax and Thompson).

Operators usage per learning method

Reading data from logs_gen*.jsonl files

Each line corresponds to an adaptation. The format is the following:

Out[36]:
Episode Adapting_Agent_ID Partner_ID Object Partner decision Operator_Used_ID Action-Operator Probability Probability distribution operators choice KB modifications
0 1820 2 10 {'prop': ['p1', 'p2', 'p3'], 'class': 'c3'} c3 5 0.218551 [0.0, 0.62, 0.0, 0.0, 0.14, 0.22, 0.01] {'delete': [91], 'add': [90]}
1 1821 2 4 {'prop': ['p1', 'p2', 'p5', 'p6'], 'class': 'c3'} c1 2 0.425476 [0.0, 0.22, 0.43, 0.23, 0.05, 0.08, 0.0] {'delete': [], 'add': [188]}
2 1824 1 9 {'prop': ['p2', 'p3', 'p4', 'p5'], 'class': 'c1'} c2 4 0.283351 [0.0, 0.61, 0.0, 0.0, 0.28, 0.1, 0.01] {'delete': [208], 'add': [209]}
3 1825 7 5 {'prop': ['p1', 'p2', 'p3'], 'class': 'c3'} c3 1 0.482486 [0.0, 0.48, 0.0, 0.0, 0.16, 0.35, 0.0] {'delete': [91], 'add': [90]}
4 1826 4 5 {'prop': ['p1', 'p4', 'p6'], 'class': 'c3'} c1 5 0.733313 [0.0, 0.15, 0.0, 0.0, 0.11, 0.73, 0.0] {'delete': [122], 'add': [120]}
5 1827 1 5 {'prop': ['p3', 'p5', 'p6'], 'class': 'c2'} c2 1 0.604019 [0.0, 0.6, 0.0, 0.0, 0.28, 0.1, 0.01] {'delete': [162], 'add': [161]}
6 1832 1 6 {'prop': ['p1', 'p2', 'p4', 'p5'], 'class': 'c2'} c1 4 0.288929 [0.0, 0.6, 0.0, 0.0, 0.29, 0.1, 0.01] {'delete': [181], 'add': [180]}
7 1833 6 5 {'prop': ['p2', 'p3', 'p4', 'p6'], 'class': 'c2'} c2 4 0.166113 [0.0, 0.09, 0.0, 0.0, 0.17, 0.74, 0.01] {'delete': [215], 'add': [213]}
8 1835 1 5 {'prop': ['p1', 'p3', 'p5'], 'class': 'c2'} c2 1 0.597040 [0.0, 0.6, 0.0, 0.0, 0.29, 0.1, 0.01] {'delete': [108], 'add': [109]}
9 1839 2 5 {'prop': ['p2', 'p3'], 'class': 'c2'} c1 1 0.623034 [0.0, 0.62, 0.0, 0.0, 0.14, 0.22, 0.01] {'delete': [51], 'add': [48]}
Method Random+: 250 agents
Method Thompson: 250 agents
Method Softmax: 250 agents
Method RL A3C: 250 agents

Plot results

No description has been provided for this image

Environment generation and seeds used for experiments from the paper

The following trees represent the different $d^*$ function (environment objects generation) used.

SEEDS USED :  [24, 25, 27, 2626, 2727]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Hyperparameters table

Out[22]:
Name Symbol Value Description
Environment-related
Number of properties $|\mathcal{P}|$ (cfg) Number of object types is $2^{|\mathcal{P}|}$
Number of decisions $|\mathcal{D}|$ (cfg)
Number of agents $N_{\text{agents}}$ (cfg)
Number of generations $N_{\text{gen}}$ (cfg) Total training iterations
Interaction window size -- 500 Window for population success rate computation
Environment Tree generation $(N_{\text{props}}, N_{\text{classes}}, p_{\text{stop}})$ $p_{\text{stop}} = 0.6$ $p_{\text{stop}}$ controls depth of the tree
KB initialization (opt. 1) $N(K_{init})$ $Bin(n = \frac{|\text{obj}|}{2}, p = 0.5)$ Choose clauses among correct ones only
KB initialization (opt. 2) $N(K_{init})$ $Bin(n = |\text{obj}|, p = 0.5)$ Choose clauses among correct and incorrect ones
RL setup
Reward function $R$ $\text{mean(score)} \cdot \text{amplifier}$ $\text{mean}\in[-1,1]$, amplified
Episode type -- Single-step update Policy updated after every reward
$\epsilon$ schedule $(\epsilon_{start},\epsilon_{end},\epsilon_{decay})$ $(0.9,0.01,0.9984)$ Exploration stops after $\approx 2500$ adaptations
Learning-related
Actor network architecture -- Input → 256 → 256 → 7 Linear layers, empirical choice
Critic network architecture -- Input → 256 → 256 → 1 Linear layers, empirical choice
Actor temperature $\tau$ $[1,10]$ Controls distribution sharpness
Boltzmann temperature $\tau$ $[0.05,0.5]$ Controls distribution sharpness
Learning rate (Actor) $\alpha_A$ $1\times 10^{-3}$ Empirical choice
Learning rate (Critic) $\alpha_C$ $1\times 10^{-3}$ Empirical choice
Activation (Actor & Critic) -- LeakyReLU
Gradient clipping -- $\|\nabla\|_\infty \leq 20$ Stabilizes training