Agents are playing the knowledge-based agreement game using their symbolic knowledge base. Agents adapt their knowledge base by selecting operators through Reinforcement Learning.
Date: 2025-12-30
Designer: Richard Trézeux
Hypotheses: [Learned policies are independent of the environment, Learned policies enable agents to reach consensus efficiently]
10 agents; 5 environments; 5 runs per environment; 40 generations per run.
Date: 2025-12-30 (Richard Trézeux)
Link to code
Simulator code hash: c3fab184644a56f1783a5fbb10d20555997bfe64
Parameter file: params.sh
Exectued command(script.sh):
#!/bin/bash
# retrieve code
. params.sh
git clone https://gitlab.inria.fr/moex/cke-adapt-operators-marl.git code
cd code
git checkout $LLHASH
cd ..
pip install -r code/requirements.txt
# run
METHODS=("softmax" "thompson" "a3c" "random+")
SEEDS=(24 25 27 2626 2727)
MAX_JOBS=5
for method in "${METHODS[@]}"; do
for seed in "${SEEDS[@]}"; do
while [ "$(jobs -r | wc -l)" -ge "$MAX_JOBS" ]; do
sleep 1
done
echo "Running $method seed $seed"
python code/main.py \
--config params.sh \
--method $method \
--seed $seed &
done
done
wait
The independent variables have been varied as follows:
LEARNING_METHOD = [random+, thompson, softmax, a3c]
SEED = [24, 25, 27, 2626, 2727] (randomly selected)
Constants of the experiment:
NBINTERACTIONS = 100000
NBPROPERTIES = 6
NBDECISIONS = 4
KB_INIT_METHOD = random
ADAPTING_AGENT_SELECTION = accuracy
Full results are available on Zenodo
NBPROPERTIES, NBCLASSES, NBAGENTS, and NBGENERATIONS have a strong impact on computational time.
Large values may lead to long simulations.
The directory folder structure we used for analysis is :
results
│
└───method1
│ └───seed1
│ │ logs1_gen1.jsonl
│ │ logs1_gen2.jsonl
│ │ ...
| | metrics.json
| └───seed2
| | logs1_gen1.jsonl
│ │ logs1_gen2.jsonl
│ │ ...
| | metrics.json
| └───seed3...
│
└───method2
│ └───seed1
│ │ logs1_gen1.jsonl
│ │ logs1_gen2.jsonl
│ │ ...
| | metrics.json
| └───seed2...
$H_0 :$ agents use the operator as frequently as agents from other environments
$H_1 :$ at least an environment has a different distribution
Printing the learnt $Q$ vectors averaged over agents.
PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)
Printing the expectation and standard deviation for $\text{Beta}(\alpha_{op},\beta_{op})$ distribution for each operator averaged over agents.
PERMANOVA analysis
$H_0 : $ policies distribution does not depend on the environment (same distributions across environments)
p-values for both Softmax (0.265) and Thompson (0.390) methods are above 0.05.
Reading data from metrics.json files
For a given method, we have for each environment a metrics.json file that contains data from 5 runs with agents playing the knowledge-based agreement game for 40 generations.
Agents keep their RL policy from one generation to the next one.
Reinforcement learning policies decrease the number of adaptation required to reach consensus while Random+ strategies do not.
Reinforcement learning policies are able to better improve reward earnings than Random+ strategies.
Plot of average number of adaptations before reaching consensus for each method and environment
We observe that RL strategies reach consensus using less adaptations than with the Random+ strategy.
We also observe that the incorproration of interaction-specific information into the RL state (method RL A3C) does not accelerate convergence compared to other RL methods (Softmax and Thompson).
Reading data from logs_gen*.jsonl files
Each line corresponds to an adaptation. The format is the following:
Plot results
The following trees represent the different $d^*$ function (environment objects generation) used.