Experiment 20230505-MTOA

Experiment design

20230505-MTOA

Date: 2023-05-05 (Andreas Kalaitzakis)

Hypotheses

1) The more deciding for different tasks relies on common properties, the more tackling additional tasks improves accuracy.

2) The more deciding for different tasks relies on common properties, the higher is success rate.

Measures

Success rate evaluates the interoperability among agents. It is defined as the proportion of successful interactions, over all performed interactions until the $n^{th}$ interaction.

Task accuracy evaluates the quality of agent ontologies. It adapts the accuracy measure introduced in \cite{Bourahla2021a} to different tasks. It is defined as the proportion of object types for which a correct decision would be taken with respect to a task $ t $, by an agent $ \alpha $ on the $ n^{th} $ iteration of the experiment. Task accuracy is used to measure the average and best task accuracy of agents.

\begin{align*} tacc(\alpha,n,t) = \frac{\vert\{o \in \mathcal{I} : h_n^\alpha(o,t) = h^*(o,t) \}\vert}{\vert \mathcal{I} \vert} \end{align*}

Experimental setting

The experiment is executed under 6 setups. Each setup is run 20 times and its results are averaged. One run consists of 80000 interactions with each interaction taking place among two agents. These two agents are randomly selected out of a total population of 18 agents. Their environment contains 64 different object types, each one perceivable through 6 different binary properties. The agents are initially trained with respect to all $|\mathcal{T}|=\{3\}$ tasks. Deciding with respect to each task relies on 2 out of the 6 perceivable binary properties. These properties are either the same for all tasks, or different for each task. Agents induce an initial ontology based on a random 10 \% of all existing labeled examples. The agents are assigned 1 to 3 assigned tasks ($|\mathcal{T}_{ass}|=\{1,2,3\}$). For each task, 4 different decisions exist. Between two consecutive interactions, the environment attributes a score to each agent. This score is calculated taking into account the 60 \% of all samples.

Variables independent variables: ['maxAdaptingRank']

dependent variables: ['avg_accuracy', 'avg_max_accuracy', 'success_rate']

Experiment

Date: 2023-04-01 (Andreas Kalaitzakis)

Computer: Dell Precision-5540 (CC: 12 * Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz with 16GB RAM OS: Linux 5.4.0-92-generic)

Duration : 720 minutes

Lazy lavender hash: ceb1c5d1ca8109373d293b687fc55953fce5241d

Parameter file: params.sh

Executed command (script.sh):

#!/bin/bash

. params.sh

CURRDIR=$(pwd)
OUTPUT=${CURRDIR}/${DIRPREF}
# cd ${LLPATH}
cd lazylav
# this sample runs ExperimentalPlan. It can be replaced with Monitor if parameters are not varied.
bash scripts/runexp.sh -p ${CURRDIR} -d ${DIRPREF} java -Dlog.level=INFO -cp ${JPATH} fr.inria.exmo.lazylavender.engine.ExperimentalPlan -Dexperiment=fr.inria.exmo.lazylavender.decisiontaking.multitask.SelectiveAcceptanceSpecializationExperiment ${OPT} -DresultDir=${OUTPUT}

Analysis

Raw data

Full results can be found at:

Zenodo DOI

Table 1: Final success rate values

Table 1 consists of the final achieved average success rate values, i.e., the average success rate after the last iteration. Each column corresponds to a different number of adapting tasks, while each row corresponds to a different run, for the same size of scope.

Out[6]:
sh_0 ind_1 sh_2 ind_2 ind_0 sh_1
0 0.834462 0.399813 0.554875 0.414062 0.567388 0.876687
1 0.740500 0.472037 0.508925 0.400987 0.520713 0.878650
2 0.754413 0.438650 0.686063 0.379437 0.600625 0.518100
3 0.619888 0.439375 0.849100 0.384700 0.771500 0.651950
4 0.966013 0.417525 0.593125 0.388238 0.755687 0.590538
5 0.794913 0.461900 0.602638 0.413212 0.546937 0.645938
6 0.679550 0.368463 0.788475 0.375287 0.786062 0.727775
7 0.776012 0.396600 0.601375 0.363762 0.535563 0.692937
8 0.871387 0.469075 0.852325 0.486350 0.689750 0.850237
9 0.901400 0.408963 0.712163 0.379850 0.704412 0.780375
10 0.784575 0.447163 0.866825 0.448437 0.685837 0.808063
11 0.811738 0.424700 0.802063 0.405162 0.679288 0.793500
12 0.907975 0.406175 0.708825 0.403388 0.786937 0.812763
13 0.937662 0.390600 0.638563 0.348750 0.723250 0.655050
14 0.959287 0.399713 0.830438 0.438337 0.700762 0.662900
15 0.721450 0.482075 0.850075 0.441575 0.542013 0.803225
16 0.807325 0.469450 0.604575 0.422237 0.558250 0.707050
17 0.836925 0.465938 0.814462 0.389525 0.477237 0.911587
18 0.785613 0.378325 0.712013 0.439175 0.589575 0.514525
19 0.927063 0.514150 0.561063 0.386775 0.624912 0.802037
Out[7]:
sh_0 sh_1 sh_2 ind_0 ind_1 ind_2
0 0.834462 0.876687 0.554875 0.567388 0.399813 0.414062
1 0.740500 0.878650 0.508925 0.520713 0.472037 0.400987
2 0.754413 0.518100 0.686063 0.600625 0.438650 0.379437
3 0.619888 0.651950 0.849100 0.771500 0.439375 0.384700
4 0.966013 0.590538 0.593125 0.755687 0.417525 0.388238
5 0.794913 0.645938 0.602638 0.546937 0.461900 0.413212
6 0.679550 0.727775 0.788475 0.786062 0.368463 0.375287
7 0.776012 0.692937 0.601375 0.535563 0.396600 0.363762
8 0.871387 0.850237 0.852325 0.689750 0.469075 0.486350
9 0.901400 0.780375 0.712163 0.704412 0.408963 0.379850
10 0.784575 0.808063 0.866825 0.685837 0.447163 0.448437
11 0.811738 0.793500 0.802063 0.679288 0.424700 0.405162
12 0.907975 0.812763 0.708825 0.786937 0.406175 0.403388
13 0.937662 0.655050 0.638563 0.723250 0.390600 0.348750
14 0.959287 0.662900 0.830438 0.700762 0.399713 0.438337
15 0.721450 0.803225 0.850075 0.542013 0.482075 0.441575
16 0.807325 0.707050 0.604575 0.558250 0.469450 0.422237
17 0.836925 0.911587 0.814462 0.477237 0.465938 0.389525
18 0.785613 0.514525 0.712013 0.589575 0.378325 0.439175
19 0.927063 0.802037 0.561063 0.624912 0.514150 0.386775

Table 2: Final average worst task accuracy values

Table 2 consists of the final average minimum accuracy values with respect to the worst task, i.e., the accuracy after the last iteration for the task for which the agent scores the lowest accuracies. Each column corresponds to a different number of undertaken tasks, while each row corresponds to a different run, for the same size of scope.

Out[8]:
sh_0 sh_1 sh_2 ind_0 ind_1 ind_2
0 0.362847 0.425347 0.687500 0.283854 0.315104 0.366319
1 0.336806 0.435764 0.572917 0.277778 0.314236 0.348090
2 0.360243 0.426215 0.814236 0.279514 0.292535 0.335938
3 0.309028 0.436632 0.921007 0.244792 0.289931 0.325521
4 0.392361 0.383681 0.736979 0.258681 0.296875 0.373264
5 0.329861 0.407986 0.724826 0.265625 0.277778 0.316840
6 0.381944 0.411458 0.890625 0.270833 0.282118 0.324653
7 0.359375 0.407986 0.752604 0.315104 0.271701 0.315972
8 0.331597 0.460069 0.927083 0.252604 0.269097 0.335938
9 0.355903 0.440972 0.837674 0.267361 0.284722 0.325521
10 0.376736 0.447917 0.947049 0.259549 0.263021 0.352431
11 0.376736 0.444444 0.875868 0.258681 0.294271 0.337674
12 0.432292 0.425347 0.812500 0.258681 0.276042 0.302951
13 0.411458 0.407986 0.780382 0.282118 0.269097 0.353299
14 0.362847 0.380208 0.912326 0.295139 0.278646 0.343750
15 0.295139 0.491319 0.920139 0.281250 0.276910 0.336806
16 0.288194 0.399306 0.718750 0.262153 0.302083 0.353299
17 0.343750 0.428819 0.904514 0.253472 0.262153 0.322049
18 0.401042 0.403646 0.829861 0.251736 0.288194 0.349826
19 0.387153 0.453125 0.674479 0.298611 0.293403 0.336806

Table 3: Final average accuracy values

Table 3 consists of the final achieved average ontology accuracy with respect to all tasks, i.e., the accuracy after the last iteration averaged on all tasks and agents. Each column corresponds to a different number of adapting tasks, while each row corresponds to a different run, for the same size of scope.

Out[9]:
sh_0 sh_1 sh_2 ind_0 ind_1 ind_2
0 0.600984 0.769676 0.734086 0.459491 0.451100 0.522569
1 0.487558 0.776042 0.655093 0.437789 0.476273 0.495660
2 0.559317 0.588542 0.841725 0.460069 0.465278 0.483507
3 0.455729 0.691840 0.928819 0.462095 0.428819 0.453125
4 0.615741 0.623843 0.774016 0.450231 0.478588 0.513310
5 0.570891 0.676505 0.774306 0.471065 0.480035 0.500289
6 0.505498 0.712095 0.905671 0.438079 0.443576 0.497106
7 0.544560 0.696759 0.779803 0.466435 0.438657 0.463252
8 0.546296 0.776042 0.933449 0.451968 0.428530 0.533565
9 0.591146 0.744213 0.856771 0.486979 0.435475 0.450521
10 0.537616 0.753762 0.950521 0.454282 0.460359 0.525174
11 0.567708 0.748264 0.909433 0.466725 0.425637 0.505787
12 0.639178 0.750289 0.851273 0.476273 0.411748 0.508391
13 0.624421 0.680845 0.807870 0.412905 0.410012 0.491898
14 0.544560 0.605035 0.927083 0.495949 0.449363 0.492187
15 0.451389 0.765625 0.934028 0.414931 0.453704 0.528646
16 0.430266 0.700231 0.774595 0.464410 0.479456 0.483796
17 0.533854 0.785301 0.916088 0.400174 0.458623 0.477141
18 0.562789 0.594907 0.855035 0.448495 0.447338 0.554688
19 0.588252 0.756944 0.739294 0.452836 0.545718 0.466725

Table 4: Final average best task accuracy values

Table 4 consists of the final average best task accuracy values with respect to the best task, i.e., the accuracy after the last iteration for the task for which the agent score the highest accuracies. Each column corresponds to a different number of undertaken tasks, while each row corresponds to a different run, for the same size of scope.

Out[10]:
sh_0 sh_1 sh_2 ind_0 ind_1 ind_2
0 0.921007 0.945312 0.769965 0.684028 0.615451 0.728299
1 0.683160 0.949653 0.731771 0.636285 0.654514 0.644965
2 0.876736 0.710938 0.861111 0.666667 0.656250 0.651042
3 0.679688 0.835938 0.934028 0.771701 0.592882 0.594618
4 0.906250 0.773438 0.806424 0.693576 0.687500 0.684896
5 0.889757 0.822917 0.822917 0.718750 0.688368 0.698785
6 0.683160 0.881944 0.915799 0.681424 0.656250 0.680556
7 0.809028 0.848090 0.809028 0.652778 0.649306 0.610243
8 0.862847 0.937500 0.937500 0.683160 0.598090 0.788194
9 0.926215 0.908854 0.879340 0.794271 0.626736 0.606771
10 0.767361 0.917535 0.953993 0.745660 0.681424 0.715278
11 0.822917 0.910590 0.934028 0.726562 0.565972 0.706597
12 0.960938 0.923611 0.883681 0.793403 0.577257 0.744792
13 0.888889 0.841146 0.834201 0.575521 0.580729 0.646701
14 0.781250 0.780382 0.940104 0.763889 0.635417 0.657986
15 0.644965 0.907118 0.944444 0.568576 0.653646 0.730903
16 0.655382 0.873264 0.822917 0.703993 0.728299 0.637153
17 0.809896 0.964410 0.928819 0.592882 0.680556 0.651042
18 0.804688 0.718750 0.881944 0.702257 0.593750 0.779514
19 0.867188 0.917535 0.802083 0.685764 0.822917 0.621528

Figures

Analysis of variance (ANOVA)

We perform one-way ANOVA, testing if the independent variable 'maxAdaptingRank' has a statistically significant effect on different dependent variables.

One-way anova on table 1: Effect on final success rate

Effect of number of tasks

One-way Anova on final success rate values (independent properties)
F : 83.00705334166891
p : 1.3030522700631479e-17
One-way Anova on final success rate values (shared properties)
F : 5.857285497191952
p : 0.004858939185702237

Effect of number of common properties

1 task shared properties - 1 task independent properties
F : 34.68099773368894
p : 8.082376440143561e-07
2 task shared properties - 2 task independent properties
F : 120.84359132672233
p : 2.305024295736919e-13
3 task shared properties - 3 task independent properties
F : 121.14093772628819
p : 2.2239445624389512e-13

One-way anova on table 3: Effect on final average accuracy

Effect of number of tasks

One-way Anova on final average accuracy values (independent properties)
F : 17.01814564027129
p : 1.6025755772920014e-06
One-way Anova on final average accuracy values (shared properties)
F : 89.86202643303236
p : 2.3797171457264745e-18

Effect of number of common properties

1 task shared properties - 1 task independent properties
F : 45.287628035911744
p : 5.7233961278756755e-08
2 tasks shared properties - 2 tasks independent properties
F : 379.7536311684798
p : 2.224458642910822e-21
3 tasks shared properties - 3 tasks independent properties
F : 28.706198579708552
p : 4.321438278916253e-06

One-way anova on table 4: Effect on final average best task accuracy values

Effect of number of tasks

One-way Anova on final average best task accuracy values (independent properties)
F : 2.869076973389113
p : 0.06497988276010566
One-way Anova on final average best task accuracy values (shared properties)
F : 3.292035781134944
p : 0.044361463114824876

Effect of number of common properties

1 task shared properties - 1 task independent properties
F : 20.610189367523958
p : 5.525127164232047e-05
2 task shared properties - 2 task independent properties
F : 105.17371121385158
p : 1.6858266297022713e-12
3 task shared properties - 3 task independent properties
F : 96.09162977593141
p : 5.9239089297919325e-12

RESULTS DISCUSSION

The presented figure depicts the evolution of the agents (a) average accuracy, (b) accuracy on their best task and (c) success rate, for different number of tasks and common properties.

(a) shows that assigining more tasks to agents, significantly improves their average accuracy. This improvement is higher when agents tackle tasks that rely on the same properties. On the one hand, when tasks rely on different properties, agents tackling 3 tasks are 9\% more accurate than agents tackling 1 task. On the other hand, when tasks rely on common properties, agents tackling 3 tasks are 55\% more accurate than agents tackling 1 task. This shows that when tasks rely on common properties, knowledge is transferable from one task to another. Put differently, agents tackling tasks relying on a common set of properties may improve their accuracy on one task by carrying out another task. Results thus support our hypothesis.

(b) shows two things. First, when tasks rely on different properties, the number of tasks does not affect the agents accuracy on their best task. This indicates that when tasks rely on different properties, learning to decide with respect to one task is not related to learning to decide with respect to a different task. Second, when agents tackle tasks that rely on common properties, tackling additional tasks immproves their accuracy on their best task. Finally, results show that even when agents tackle only 1 task, these agents benefit from tasks that rely on common properties. This indicates that while the agents abstain from all tasks that are not assigned to them, their ontologies contain general-purpose knowledge, acquired during the initial ontology induction phase. These results agree with subfigure (a), further supporting our hypothesis.

(c) shows that tackling less tasks or having tasks that rely on common properties improves the success rate. This is due to two reasons. The first is that the fewer the assigned tasks, the fewer are the decisions over which agents need to agree. The second is that the more tasks rely on common properties, the less non relevant knowledge may be present to an agent's initially induced ontology. Furthermore, while success rate improves over the course of the experiment, it does not converge to 1. This indicates that the final ontologies do not allow agents to reach consensus. This can be explained by the limitation of resources: agents may lack the resources required to learn to decide accurately for all assigned tasks and objects. As a result, they are able to decide accurately for different subsets of the existing object types at a given time. The latter it true even when agents interact over one task.

ANOVA

Analysis of variance shows that the number of common properties among different tasks, has a statistically significant impact (p $\leq$ 0.01) on all measures. The number of assigned tasks has a statistically significant impact on (1) the success rate and (2) the average accuracy.

CONCLUSIONS

Based on the results, two conclusions are drawn. The first is that when agents tackle additional tasks relying on common properties, the agents may transfer knowledge from one task to another. The second is that when agents tackle additional tasks that rely on different properties, the number of assigned tasks does not affect their accuracy on their best task.