Experiment 20251230-KARL¶

Experiment design¶

Agents are playing the knowledge-based agreement game using their symbolic knowledge base. Agents adapt their knowledge base by selecting operators through Reinforcement Learning.

Date: 2025-12-30

Designer: Richard Trézeux

Hypotheses: [Learned policies are independent of the environment, Learned policies enable agents to reach consensus efficiently]

10 agents; 5 environments; 5 runs per environment; 40 generations per run.

Experiment¶

Date: 2025-12-30 (Richard Trézeux)

Link to code

Simulator code hash: c3fab184644a56f1783a5fbb10d20555997bfe64

Parameter file: params.sh

Exectued command(script.sh):

#!/bin/bash

# retrieve code

. params.sh

git clone https://gitlab.inria.fr/moex/karlOperators.git code
cd code
git checkout $LLHASH
cd ..

pip install -r code/requirements.txt

# run
        
METHODS=("softmax" "thompson" "a3c" "random+")
SEEDS=(24 25 27 2626 2727)

MAX_JOBS=5

for method in "${METHODS[@]}"; do
  for seed in "${SEEDS[@]}"; do
    while [ "$(jobs -r | wc -l)" -ge "$MAX_JOBS" ]; do
    sleep 1
    done
    echo "Running $method seed $seed"

    python code/main.py \
        --config params.sh \
        --method $method \
        --seed $seed &

  done
done

wait

Experimental Plan¶

The independent variables have been varied as follows:
LEARNING_METHOD = [random+, thompson, softmax, a3c]
SEED = [24, 25, 27, 2626, 2727] (randomly selected)

Constants of the experiment:
NBINTERACTIONS = 100000
NBPROPERTIES = 6
NBDECISIONS = 4
KB_INIT_METHOD = random
ADAPTING_AGENT_SELECTION = accuracy

Raw results¶

Full results are available on Zenodo

Key parameters descrpition¶

Out[2]:

Parameter	Description
NBAGENTS	Size of the agent population.
NBGENERATIONS	Number of generations or iterations.
NBINTERACTIONS	Maximum number of interactions for a population (just a security if agents did not converge to a consensus).
NBPROPERTIES	Number of properties in the environment (influences complexity).
NBDECISIONS	Number of decisions to discriminate objects.
KB_INIT_METHOD	KB initialization methods. Two different methods are implemented (see paper).
ADAPTING_AGENT_SELECTION	The type of score used to select which agent will be adapting its knowledge (accuracy or successrate).
LEARNING_METHOD	Type of RL updates and policy for agents.
SEED	Random seed of the experiment. Controls the generation of objects in the environment

NBPROPERTIES, NBCLASSES, NBAGENTS, and NBGENERATIONS have a strong impact on computational time.
Large values may lead to long simulations.

Analysis¶

The directory folder structure we used for analysis is :

results   
│
└───method1
│   └───seed1
│       │   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed2
|       |   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed3...
│   
└───method2
│   └───seed1
│       │   logs1_gen1.jsonl
│       │   logs1_gen2.jsonl
│       │   ...
|       |   metrics.json
|   └───seed2...

Hypothese 1: Policies are environment-independent¶

Kruskal-Wallis test on operator use¶

$H_0 :$ agents use the operator as frequently as agents from other environments

$H_1 :$ at least an environment has a different distribution

No description has been provided for this image

-- KRUSKAL-WALLIS TEST RESULTS METHOD thompson --
        Operator   p-value
0           Wait  0.501126
1  Remove-Mirror  0.719813
2        Spe Max  0.661678
3        Spe Min  0.855492
4        Gen Max  0.704059
5        Gen Min  0.537978
6         Mirror  0.000785

-- KRUSKAL-WALLIS TEST RESULTS METHOD softmax --
        Operator   p-value
0           Wait  0.899129
1  Remove-Mirror  0.079606
2        Spe Max  0.593257
3        Spe Min  0.570662
4        Gen Max  0.150979
5        Gen Min  0.539263
6         Mirror  0.752636

-- KRUSKAL-WALLIS TEST RESULTS METHOD a3c --
        Operator   p-value
0           Wait  0.683018
1  Remove-Mirror  0.056482
2        Spe Max  0.009028
3        Spe Min  0.668697
4        Gen Max   0.88204
5        Gen Min  0.289114
6         Mirror  0.146813

Verify policy environment independance with PERMANOVA analysis¶

1) Boltzmann/Softmax method¶

Printing the learnt $Q$ vectors averaged over agents.

        Operator  Avg Q-value  Std Q-value
0           Wait        -0.50         0.00
1  Remove-Mirror         0.38         0.23
2        Spe Max         0.53         0.52
3        Spe Min         0.38         0.53
4        Gen Max         0.15         0.31
5        Gen Min         0.14         0.29
6         Mirror        -0.02         0.12

PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                     250
number of groups                  5
test statistic             1.221952
p-value                    0.268746
number of permutations         5000
Name: PERMANOVA results, dtype: object

2) Thompson sampling method¶

Printing the expectation and standard deviation for $\text{Beta}(\alpha_{op},\beta_{op})$ distribution for each operator averaged over agents.

        Operator  Beta distribution mean value  Beta distribution std
0           Wait                          0.01                   0.00
1  Remove-Mirror                          0.85                   0.07
2        Spe Max                          0.50                   0.18
3        Spe Min                          0.22                   0.14
4        Gen Max                          0.46                   0.22
5        Gen Min                          0.95                   0.05
6         Mirror                          0.68                   0.04

PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)

method name               PERMANOVA
test statistic name        pseudo-F
sample size                     250
number of groups                  5
test statistic             1.029065
p-value                    0.420716
number of permutations         5000
Name: PERMANOVA results, dtype: object

p-values for both Softmax (0.265) and Thompson (0.390) methods are above 0.05.

Hypothese 2: Policies enable agents to reach consensus efficiently¶

Reading data from metrics.json files

For a given method, we have for each environment a metrics.json file that contains data from 5 runs with agents playing the knowledge-based agreement game for 40 generations. Agents keep their RL policy from one generation to the next one.

Out[16]:

Number of adaptations for each agent and generation pair (skipped first ten generations) for a given (#run,#environment)
		Agents
		0	1	2	3	4	5	6	7	8	9
Generations	10	333	191	256	241	268	161	261	141	161	104
	11	298	169	189	276	148	281	276	300	380	269
	12	273	255	142	203	315	197	211	211	251	330
	13	264	259	175	276	264	150	285	186	205	236
	14	254	217	277	240	155	251	296	277	179	127
	15	306	233	150	309	263	170	233	145	171	196
	16	248	219	258	109	139	188	229	229	86	254
	17	202	214	143	120	252	270	167	192	163	136
	18	260	163	155	268	219	295	232	269	262	89
	19	149	111	212	207	215	207	243	237	247	179
	20	224	265	234	203	179	123	162	245	177	131
	21	259	158	233	271	118	271	170	157	161	254
	22	292	150	166	302	173	218	171	211	197	224
	23	240	101	244	295	208	44	248	304	189	243
	24	202	175	257	202	163	155	197	134	193	227
	25	103	244	183	184	201	209	246	146	292	211
	26	175	203	191	241	217	239	198	121	200	169
	27	186	236	160	213	289	223	197	259	182	260
	28	237	134	293	261	165	167	119	242	269	224
	29	139	321	186	207	214	271	177	224	138	157
	30	184	266	165	307	158	120	217	195	266	130
	31	201	250	277	303	213	169	188	246	187	330
	32	241	259	167	256	188	184	277	309	142	110
	33	193	182	216	190	178	211	255	244	242	189
	34	183	306	253	191	179	168	204	279	193	197
	35	219	210	213	263	182	233	159	273	147	170
	36	210	253	193	278	168	229	129	146	199	269
	37	170	147	292	310	121	279	159	164	149	253
	38	184	232	283	250	223	244	220	250	154	218
	39	140	245	228	267	241	261	131	199	263	120

Reinforcement learning policies decrease the number of adaptation required to reach consensus while Random+ strategies do not.

Reinforcement learning policies are able to better improve reward earnings than Random+ strategies.

Plot of average number of adaptations before reaching consensus for each method and environment

We observe that RL strategies reach consensus using less adaptations than with the Random+ strategy.

We also observe that the incorproration of interaction-specific information into the RL state (method RL A3C) does not accelerate convergence compared to other RL methods (Softmax and Thompson).

Operators usage per learning method¶

Reading data from logs_gen*.jsonl files

Each line corresponds to an adaptation. The format is the following:

Out[36]:

	Episode	Adapting_Agent_ID	Partner_ID	Object	Partner decision	Operator_Used_ID	Action-Operator Probability	Probability distribution operators choice	KB modifications
0	1820	2	10	{'prop': ['p1', 'p2', 'p3'], 'class': 'c3'}	c3	5	0.218551	[0.0, 0.62, 0.0, 0.0, 0.14, 0.22, 0.01]	{'delete': [91], 'add': [90]}
1	1821	2	4	{'prop': ['p1', 'p2', 'p5', 'p6'], 'class': 'c3'}	c1	2	0.425476	[0.0, 0.22, 0.43, 0.23, 0.05, 0.08, 0.0]	{'delete': [], 'add': [188]}
2	1824	1	9	{'prop': ['p2', 'p3', 'p4', 'p5'], 'class': 'c1'}	c2	4	0.283351	[0.0, 0.61, 0.0, 0.0, 0.28, 0.1, 0.01]	{'delete': [208], 'add': [209]}
3	1825	7	5	{'prop': ['p1', 'p2', 'p3'], 'class': 'c3'}	c3	1	0.482486	[0.0, 0.48, 0.0, 0.0, 0.16, 0.35, 0.0]	{'delete': [91], 'add': [90]}
4	1826	4	5	{'prop': ['p1', 'p4', 'p6'], 'class': 'c3'}	c1	5	0.733313	[0.0, 0.15, 0.0, 0.0, 0.11, 0.73, 0.0]	{'delete': [122], 'add': [120]}
5	1827	1	5	{'prop': ['p3', 'p5', 'p6'], 'class': 'c2'}	c2	1	0.604019	[0.0, 0.6, 0.0, 0.0, 0.28, 0.1, 0.01]	{'delete': [162], 'add': [161]}
6	1832	1	6	{'prop': ['p1', 'p2', 'p4', 'p5'], 'class': 'c2'}	c1	4	0.288929	[0.0, 0.6, 0.0, 0.0, 0.29, 0.1, 0.01]	{'delete': [181], 'add': [180]}
7	1833	6	5	{'prop': ['p2', 'p3', 'p4', 'p6'], 'class': 'c2'}	c2	4	0.166113	[0.0, 0.09, 0.0, 0.0, 0.17, 0.74, 0.01]	{'delete': [215], 'add': [213]}
8	1835	1	5	{'prop': ['p1', 'p3', 'p5'], 'class': 'c2'}	c2	1	0.597040	[0.0, 0.6, 0.0, 0.0, 0.29, 0.1, 0.01]	{'delete': [108], 'add': [109]}
9	1839	2	5	{'prop': ['p2', 'p3'], 'class': 'c2'}	c1	1	0.623034	[0.0, 0.62, 0.0, 0.0, 0.14, 0.22, 0.01]	{'delete': [51], 'add': [48]}

Method Random+: 250 agents
Method Thompson: 250 agents
Method Softmax: 250 agents
Method RL A3C: 250 agents

Plot results

Environment generation and seeds used for experiments from the paper¶

The following trees represent the different $d^*$ function (environment objects generation) used.

SEEDS USED :  [24, 25, 27, 2626, 2727]

$No description has been provided for this image$

Hyperparameters table¶

Out[14]:

Name	Symbol	Value	Description
Environment-related
Number of properties	$\|\mathcal{P}\|$	(cfg)	Number of object types is $2^{\|\mathcal{P}\|}$
Number of decisions	$\|\mathcal{D}\|$	(cfg)
Number of agents	$N_{\text{agents}}$	(cfg)
Number of generations	$N_{\text{gen}}$	(cfg)	Total training iterations
Interaction window size	--	$500$	Window for population success rate computation
Environment Tree generation	$(N_{\text{props}}, N_{\text{classes}}, p_{\text{stop}})$	$p_{\text{stop}} = 0.6$	$p_{\text{stop}}$ controls depth of the tree
KB initialization (opt. 1)	$N(K_{init})$	$Bin(n = \frac{\|\text{obj}\|}{2}, p = 0.5)$	Choose clauses among correct ones only
KB initialization (opt. 2)	$N(K_{init})$	$Bin(n = \|\text{obj}\|, p = 0.5)$	Choose clauses among correct and incorrect ones
RL setup
Reward function	$R$	$\frac{3}{2} \cdot \text{mean(score)}$	$\text{mean}\in[-1,1]$, amplified
Interactions before reward	--	$30$	Number of informative interactions before evaluating past actions
Episode type	--	Single-step update	Policy updated after every reward
$\tau$ schedule (Softmax)	$(\tau_{start},\tau_{end},\tau_{decay})$	$(0.5,0.05,0.9995)$	Decreasing temperature over time to decrease exploration
Learning rate (Softmax)	$lr$	$0.1$	Learning rate for Softmax RL update rule
A3C Learning-related
Actor network architecture	--	Input → 256 → 256 → 7	Linear layers, empirical choice
Critic network architecture	--	Input → 256 → 256 → 1	Linear layers, empirical choice
Actor temperature	$\tau$	$[1,10]$	Controls distribution sharpness
Learning rate (Actor)	$\alpha_A$	$1\times 10^{-3}$	Empirical choice
Learning rate (Critic)	$\alpha_C$	$1\times 10^{-3}$	Empirical choice
Activation (Actor & Critic)	--	LeakyReLU
Gradient clipping	--	$\\|\nabla\\|_\infty \leq 20$	Stabilizes training