Emergent Behaviour, Cooperation, And Adversarial Environments In Multi-Agent Systems

11 min readAug 3, 2021

1. Introduction

Multi-Agent Systems (MAS) are a transdisciplinary topic of research that has been closely linked to Game Theory since the ’80s and ’90s, when the latter became extremely popular. It thus offers unique opportunities to study emergent behavior, particularly in relation to cooperation and competitive environments, which is what I propose as my topic for the theoretical report. Not long ago, scientists have studied the spread of infectious diseases by simulating real-world scenarios in the popular multi-player game “World of Warcraft”. Research in economy, video game development or the creation of systems capable of autonomous driving, the coordination of robots in heavy industrial processes all rely on the study of emergent behavior in MAS. We therefore believe that the potential for the application of such research is immense.

Although it is difficult to say how much or in what manner, it is still clear, however, that the social sciences may also benefit from the opportunity to study behavior in simulated environments. In the following sub-sections we are considering, in a more-or-less decoupled way, certain vital aspects related to cooperation in multi-agent systems from the most recent literature at the time of this writing.

2. Overview Of Multi-Agent Reinforcement Learning (MARL)

Multi-agent intelligence can be viewed as an intersection of three great fields: reinforcement learning, game theory, and deep learning. Game theory offers possible solutions and concepts as well as approaches for how to describe learning outcomes in a theoretical framework, while deep learning is used mainly as a function-approximation tool. However, we need reinforcement learning for algorithms to converge and reach stable equilibria. The cooperation aspect that is the focus of the present meta-analysis is something studied within, or as a result of, the paradigm of MARL.

Unlike “traditional” Machine Learning (ML), the system itself needs not only extract knowledge, but make decisions based on it. Agents need to consider the consequences of their actions, projected into the future, and, in order to be viable in real-world applications, learn strategies for interacting with other agents. This necessity leads to the paradigm of MARL, which has dominated the literature (and with good reason, since it produced incredible results) ever since AlphaGO. MARL has been primarily used to play challenging video games, however the authors of [1] have reported emergent behaviour that could be understood and interpreted by humans.

The Reinforcement Learning (RL)/Q-learning premise is that rationality[1] can be modelled as maximizing an expected value and this is what makes the agents intelligent. A regular Markov Decision Process (MDP) that involves multiple players, thus turns into a Stochastic Game (SG). For example, for a given time t and an environment state st, agent i executes action ati at the same time with all other agents, thus resulting in the joint action

which leads to the next environment state. The transition to state st+1 will determine a reward Ri(st, at, st+1) for each agent. It is not an over-simplification to state that MARL amounts to solving an SG. However, during this process, a cumulative, not an individual reward, must be maximized, which means that agents must consider the actions of other agents. Furthermore, when compared to single-agent RL, MARL presents with additional challenges, the most relevant of which being:

- Combinatorial complexity, which could theoretically be solved by making various assumptions to simply the reward function

- Goals are multi-dimensional

- The non-stationary issue: the agent cannot tell whether a state transition is the result of its action or the actions of other agents

- Scalability

The incentive to reduce complexity being extremely high, novel approaches and assumptions are being presented in the literature for dealing with some of the challenges mentioned above. For example, there are algorithms in which each agent chooses actions in accordance with its belief about the other agents, simulating fictitious play in humans. Not doing so could, even more, lead to suboptimal results. We believe this area represents incredible possibility for further trans-disciplinary research.

3. Aspects Of Cooperative[2] AI

It is no secret that the social sciences consider our success as a species a by-product of our ability to cooperate at scale. This idea is now largely accepted and has been popularized by bestseller non-fiction books such as Yuval Harari’s “Sapiens”. Also widely accepted, at least at the time of writing the present paper, is that AI agents will play an increasingly important role in human lives. It is thus more relevant than ever to equip them with the capability to cooperate, which, from a Machine Learning engineer or researcher standpoint, translates to developing environments for training with tasks that can only be completed using cooperative abilities. We should note that by “cooperation”, one should not necessarily only understand cooperation with other agents (although this is the focus of MARL), but also cooperation with humans and institutions (complex, non-AI agents) in our current, complex global climate.

MARL seems to be reorienting right now, according to [2], towards games of pure common interest, after achieving great success in the past years with two-player zero-sum games and two-team zero-sum games (Dota 2, variations of Capture The Flag, etc.). This move of attention makes researches investigate much more into social dilemmas and work with other sciences of cooperation (psychology, political theory, etc.).

[2] identifies and details three main cooperative capabilities which we will also present and briefly discuss below.

3.1 Understanding

Although, as humans, we intuitively grasp the meaning of the concept, for the rigorous research purposes of [2] it is defined as a combination between 1) prediction of the consequences of one’s own actions, 2) prediction of another’s behaviour, and 3) modifying behaviour based on another’s preferences. By 1) what is meant is having an understanding of the world/environment and this is what single-agent RL can be reduced to. Needless to say that 2), the ability to predict the behaviour of other agents is critical to achieving desired results. In psychology, this is termed as having a “theory of mind”. It would greatly help cooperation to have access to another agent’s private information about the state of the world/environment, although this will also introduce problems. Leading us, thus, to the second pillar of cooperative capabilities.

3.2 Communication

Communication is critical to gaining insight into another’s behaviour. The alternative is a painful process of selecting what to trust and basing the knowledge of another agent’s intentions on interpreting the so-called “costly signals”.[3] Even so, communication will inevitably lead to the need for solving interest-related trust problems. It has to be based on common-ground: messages need to be understood so that they can lead to actions. How to establish common ground between agents is also referred to as emergent communication.

Research in the social sciences shows that communication also brings into consideration problems of commitment, which can cause cooperation failure. The canonical Prisoner’s Dilemma can be re-framed as a commitment problem, since no extra knowledge or understanding can help: everything is completely known. Most of what human societies do, from commerce to politics, is solve commitment problems. We do it by using commitment devices, such as penalties. Within MAS, much like in the real-life interactions between humans, reputation-like commitment devices can be enforced by taking games such as Prisoner’s Dilemma and making them iterative. Another possible solution is agents could defer authority to a trusted third-party. All of these are open problems and represent fertile ground for future research.

3.3 Institutions

By “institutions”, [2] refers to meta-systems representing everything needed for the emergence of cooperative behaviour within agents: systems of rules, patterns, conventions, social rewards and sanctions, roles, and allocation of responsibility and power. Naturally, institutions vary in the extent to which they are emergent or designed. Although not an institution per-se, at least not in the way those mentioned previously are, we propose a look at research into common pool resource appropriation to further explore cooperation and emergent behaviour.

4. Common Pool Resource (CPR) Appropriation

The theoretical prediction of [3] is that agents prefer to appropriate rather than show restraint. However, we must keep in mind that humans solve CPR problems constantly. The paper uses reinforcement learning for CPR, where the incentives “felt” by the agents are adjusted over time. This is an added dimension that results in the emergence of social outcomes due to trial-and-error learning (over the long time).

In the single-agent case for common pool resource appropriation, the agent can learn sustainable appropriation, id est, to avoid over-exploits. However, in any society, the interests of the individual can conflict with the interests of the group. For the multi-agent case, [3] defines special social metrics to measure the efficiency of the system overall: utility, sustainability, equality, and peace. The last one refers to the fact that, in the particular game devised by the authors, agents have the opportunity to tag one another out of the game for a set time.

The authors of [3] identify three phases for the iterations of their game:

- The first phase, also termed “naivety phase”, is roughly 900 episodes and it is characterized by the fact that agents begin by acting randomly, after which they learn to move towards regions with greater resource density. They do not use the tagging option described above (so “peace” is high) and the CPR stock is healthy.

- In the following phase, also called “tragedy phase”, a catastrophic depletion of the CPR stock makes sustainability and utility decline.

- Lastly, after around episode 1500, a “maturity phase” kicks in, where sustainability starts to recover, but agents tag one another leading to a decline in peace. Since the population size is reduced, pressure is relieved from the CPR stock and utility increases as well. Tagging agents learn to harvest sustainably.

We believe the important contribution that [3] brings to the field of MARL is the highlight of emergent behaviour observed among agents. The first aspect of which is the exclusion, where it became clear that agents who tag (ergo excluding others) will protect a smaller part of the game map and achieve better returns. The consequence of this and the second emergent aspect is inequality. When exclusion is made easier, it leads to much greater inequality.

5. Game Theory For MARL In Adversarial Environments

We have mentioned in the sub-sections above that many relationships of MAS can be represented using game theory. The present meta-analysis includes an overview of [4], an incredible paper which proposes a new model, termed Game-Theoretic Utility Tree (GUT): a network to achieve cooperative decision-making for MAS in adversarial environments.

Past work has been focused on agents dealing with non-intentional adversaries, such as obstacles in the environment. Little research has been done in how to deal with intentional adversaries. In order to address the threats posed by such antagonists, cooperative decision-making among the agents is essential. The system needs to achieve optimal evasion strategies, task allocation, etc. Without presenting too many details, GUT, which is a combination of game theory, utility theory, probabilistic graphical modelling, and tree structure, helps mitigate the threats posed by adversarial agents. The evaluation is done by letting GUT solve a game proposed by [4], entitled “Explorers and Monsters”, where the agents need to find a treasure and collaborate in evading the deliberately malicious ‘monsters’ hunting them down. It overcomes the challenges in said game by decomposing high-level strategies into lower-level executables.

6. Delay Awareness

[5] brings the vital contribution of noticing that action and observation delays exist in real-world systems. Although this aspect may pose problems for learning, the paper proposes a framework for dealing with delay. Most deep reinforcement learning algorithms are designed for systems where the action-observation sequence is instantaneous. However, real-world situations introduce a delay problem. And, even more, the delay in the action-observation sequence of one agent can spread to other agents.

The proposed solution works within the paradigm of “centralized training, decentralized execution” (CTDE) which tries to approach the problem of non-stationarity. Since agents need to cooperate, unlike single-agent RL algorithms, where the Markov assumption is strongly present, the transitions and rewards depend on the joint action of agents who need to change their learning policies to adapt to other agents. According to [6]: “each agent can enter an endless cycle of adapting to other agents”. In CTDE, this is solved by giving access to the full state of the environment during training (global observations), but the learnt policy should then, at execution time, only be dependent on local observations. So they are applied in a decentralized fashion.

[5] defines a delay-aware Markov Game, where, essentially, the agents interact with the environment not directly, but through an action buffer.

7. Benchmarking With StarCraft II

There are not many benchmarks specifically designed for cooperative multi-agent reinforcement learning and [7] proposes the StarCraft Multi-Agent Challenge (SMAC) to fill this gap. The underlying considerations are, firstly, that many real-world problems that require reinforcement learning are multi-agent in nature and, secondly, that when researchers benchmark and evaluate their proposals, they use environments which are tuned to their algorithms, leading to a lack of standardization.

[7] uses the video game “StarCraft II” as an environment and focuses on decentralized micromanagement challenges. The main idea behind their approach is not to use the entire game as an RL environment, but rather to have isolated scenarios that deal with partial observability and require agents to learn to coordinate their behaviour. As such, each video game “unit”[4] is controlled by an independent agent that has observations limited to its local field of view. Groups of such agents must cooperate to solve the proposed challenges. The latter can only be overcome when the agents learn rich cooperative behaviour such as focusing fire or unit formation. SMAC can evaluate how well the independent agents will learn to coordinate to solve these complex tasks.

8. Conclusion

The trans-disciplinary potential that the research into multi-agent systems has in incredible. Considering that the present survey of the literature was written in the most fertile “AI summer” period, there has never been a time more suitable for the modelling of real-world problems using distributed systems and individual, intelligent agents. The above sub-sections have briefly presented contemporary trends in research, problems, and opportunities within the field. We hope that the modelling of cooperative behaviour in multi-agent systems would bring technology and AI much deeper into the fields of sociology, political theory, and philosophy and help us better understand how to approach some of the deepest issues facing humanity.

[1] An agent behaving properly within the system, towards achieving its goal.

[2] In MARL, in strict mathematical terms, “cooperative” simply means that the agents share a mutual reward function, while “competitive” means that agents have inverse reward functions.

[3] Although beyond the scope of this presentation, in the social sciences, a “costly signal” is an action that would, simply put, be too costly to fake. The assumption being that it therefore reveals information.

[4] StarCraft is a game of the Real-Time Strategy (RTS) sub-genre. In it, players build bases and control soldiers or vehicles, generally referred to as “units” to try to destroy opposing armies and bases.

Bibliography

[1]

Y. Yang and J. Wang, An Overview Of Multi-agent Reinforcement Learning from Game Theoretical Perspective, Univeristy College London, Huawei R&D U.K., 2020.

[2]

A. Dafoe, E. Hughes, Y. Bachrach, T. Collins, K. R. McKee, J. Z. Leibo, K. Larson and T. Graepel, Open Problems In Cooperative AI, DeepMind, 2020.

[3]

J. Perolat, J. Leibo, V. Zambaldi, C. Beattie, K. Tuyls and T. Graepel, A multi-agent reinforcement learning model of common-pool resource appropriation, 2017.

[4]

Q. Yang and R. Parasuraman, A Game-Theoretic Utility Network For Cooperative Multi-Agent Decisions In Adversarial Environments, University of Georgia, 2020.

[5]

C. Baiming, X. Mengdi , L. Zuxin, L. Liang and Z. Ding, Delay-Aware Multi-Agent Reinforcement Learning For Cooperative And Competitive Environments, 2020.

[6]

G. Papoudakis, F. Christianos, A. Rahman and S. V. Albrecht, Dealing With Non-Stationarity In Multi-Agent Deep Reinforcement Learning, University of Edinburgh, 2019.

[7]

M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr, J. Foerster and S. Whiteson, The StarCraft Multi-Agent Challenge, Russian-Armenian University, University of Oxford, Facebook AI Research, 2019.