Cooperation: a Systematic Review of how to Enable Agent to Circumvent the Prisoner’s Dilemma

. It is widely accepted that rational individuals are unable to create cooperation in a prisoner's dilemma. However, in everyday life, cooperation, for example, during a fishing moratorium, can be observed frequently. Additionally, the appearance of cooperation in the prisoner's dilemma can be seen in numerous simulation studies. This paper reviews 31 simulation studies published between January 2017 and January 2023 in which agents can be observed in the results to improve cooperation in a prisoner's dilemma. The proposed methodologies were sorted into seven categories, including Bounded Rationality, Memory, Adaptive Strategy, Mood Model, Intrinsic Reward, Network Dynamics, and Altruistic Attribute. Based on their impacts, the effectiveness of these seven approaches was classified into three categories: generating cooperation, maintaining cooperation, and spreading cooperation. This review is expected to be helpful for scholars conducting future research on multi-agent cooperation and irrational agent modeling.


Background and Rationale
The Prisoner's Dilemma is a game that has been analyzed in game theory (Tucker & Straffin, 1983; Felkins, 2001;  Chong, et al., 2007) [4,9,35] .It presents a dilemma for two entirely rational agents, where they must decide to cooperate with their partner for the common good or to betray their partner for a personal reward (defect).The classic prisoner's dilemma depicts a scenario in which two rational prisoners, A and B, are held in solitary confinement and are each faced with an either/or situation: either they choose to be silent (Cooperate), or they choose to betray (Defect).This leads to 4 different outcomes: 1.If A and B both remain silent, they will each serve the lesser charge of 2 years in prison.2. If A betrays B but B remains silent, A will be set free while B serves 10 years in prison.3.If A remains silent but B betrays A, A will serve 10 years in prison and B will be set free.
a pan.j.aa@m.titech.ac.jp 4. If A and B both betray the other, they share the sentence and serve 5 years.Since both Prisoner A and Prisoner B would avoid the worst of 10 years of incarceration by choosing to "Defect", while having a chance of being released outright, they both end up selecting to "Defect" and receive 5 years of imprisonment.However, overall the outcome would have been more favorable if they had chosen to cooperate.A total of 4 years in prison would have been shorter than the 5 years for both.
The definition of the prisoner's dilemma (Hofstadter,  1983) [16] can be standardized, as shown in Table 1, by defining the gain of cooperating with each other as Reward (R), the gain of betraying each other as Punishment (P), the gain of betraying alone as Temptation (T), and the gain of being betrayed alone as Suckers (S).When the four gains satisfy the inequality: 2   the game is a prisoner's dilemma game.Many studies related to the prisoner's dilemma have utilized agent-based simulations (Gotts, et al., 2003)  [13] , which use the ABS model to examine possible social problems in the prisoner's dilemma and to suggest possible solutions.
In these studies, researchers have proposed several ways to prevent agents from falling into a prisoner's dilemma, the most famous of which is the tit-for-tat strategy (Rapoport, 1989)  [28] , which allows agents to mimic the choices their opponents made in the previous round.It is widely believed that only repeated prisoner's dilemma games can cause participants to shift from focusing on     to focusing on 2   (Hofstadter, 1983)  [16] , which leads to cooperation.Repeated prisoner's dilemma games come in various forms, and the ontology mainly considers the "game → elimination → reproduction → game" model.In this model, all participants play the game, then the participants with the lowest scores are eliminated and the remaining participants are proportionally reproduced to the original number and the game continues.In this way, the strategies and overall gains of the last remaining participants can be observed to determine whether cooperation is generated.
Firstly, the two most basic policies, AllD and AllC, need to be defined.
Where AllD refers to a participant's unconditional choice of "defect" and AllC refers to a participant's unconditional choice of " cooperate".
It is easy to see that AllD can gain a lot by exploiting AllC.In the "game → elimination → reproduction → game" model, AllC will be eliminated quickly.Since the TFT strategy suppresses the return of AllD and protects the return of AllC, AllC will be retained and not eliminated in studies where the TFT strategy is used.
However TFT still has many drawbacks such as no fault tolerance mechanism (Kopelman, 2020) [19] , so more methods or better strategies need to be added to achieve more cooperation.This review will summarize and conclude these methods.
The agent that introduces learning can learn through its own choices and gains, and constantly change its strategy.However, it is impossible to escape the prisoner's dilemma by learning only the immediate gains.In recent years, a plethora of methods exist that enable agent to learn group gains.
This review also summarizes such studies.

Purpose of the Review
Research on cooperation between multiple agents has focused on how to encourage rational agents to cooperate  [18,20,44] .Simulations of the prisoner's dilemma show that cooperation can be observed under certain approaches.However, the absence of a unified framework to consolidate these methods has resulted in numerous studies repeating the same work.
The purpose of this paper is to summarize the methods used in recent years for how to get agents out of the prisoner's dilemma and to categorize the methods used so that subsequent researchers can more easily begin their work.Furthermore, it is anticipated that this review will contribute to further research regarding the modeling of irrational behavior.
The objective of this paper is to reach a general conclusion on how agents can evade the prisoner's dilemma by answering the following research questions: 1.What methods are used in these studies to circumvent the prisoner's dilemma? 2. Which environments are these methods applicable to? 3. How effective are these methods?

Search Strategy
The literature for this review was searched for the period January 2017 to January 2023, where the subject terms need to include either variablesum, non-constant-sum or Prisoner's Dilemma, and in the abstract, it needs to mention either agents, Artificial Intelligence or AI, also in the abstract needs to mention one of Cooperation, Cooperate or Win-Win.

Inclusion/Exclusion Criteria
There were five inclusion criteria for the corresponding studies in this review, see Table 3.
First, it must be research related to the prisoner's dilemma; other research is not included whenever it does not meet the definition of the prisoner's dilemma.
Also, the study must be for surrogate research, and primary work in examining humans or other organisms should be excluded.
Other studies, such as those analyzing the effects of the prisoner's dilemma and the social problems that may result from particular strategies, need to be removed.
Finally, all review articles and articles written in languages other than English were also excluded from inclusion.

Quality Assessment
Based on the Inclusion/Exclusion Criteria of 2.2, 31 studies were finally screened in this paper.The specific screening process is shown in Figure 1.

What methods are used in these studies
to circumvent the prisoner's dilemma?
If rationality is the cause that drives agents into a prisoner's dilemma, then limited rationality can help to alleviate it.
A low rationality level implies that it is possible to choose a random strategy instead of the strategy with the highest payoff (Xu, et al., 2018, May) [43] .
The Tit for Tat (TFT) strategy is a classic approach that enables an agent to imitate the opponent's decision from the previous round, after the initial round of cooperation.On the other hand, the Extended Expectation-of-Cooperation (EEoC) strategy involves cooperating multiple times when agents encounter cooperation, regardless of whether the object changes.

Memory
Among the 31 studies, 6 (Wang, et al., 2016  [10,15,23,34,38,39] used the Memory method. With memory, agents can be less concerned with immediate interests.Agents in the population update their strategies not by all the previous interactions but decided by the length of history records (Wang, et al., 2017)  [39] .
Alternatively, the agent can be allowed to remember the results of previous games between its opponent and itself to choose this game choice (Tao, et al., 2022, February) [34] , or to choose whether to interrupt the link with an agent that is less cooperative (Fernández-Domingos, et al., 2017) [10] .
Of course, memory here also refers to the fact that the agent can know the choices made by other agents in the past few rounds of the game (whether or not the game was played with itself) and decide its own strategy based on these (Lotfi & Rodrigues, 2022; Wang, et al., 2016, August; Heller & Mohlin, 2018) [15,23,38] .

Adaptive strategy
Among the 31 studies, 6 (Xue, et   [22,30,31,36,42,45] used the Adaptive strategy method. In contrast to the humanly set strategy in Bounded rationality, agents can learn and adapt their strategies through reinforcement learning method (Xue, et al., 2017) [45] , or adjust their strategy update rates according to the environment they are in (Shang, et al., 2021, July) [31] .For example, if an agent is surrounded by many cooperators, it will lower its strategy update rate to maintain a cooperative state.
Unlike Bounded rationality, the degree of rationality of an agent will be determined by an emotion value under the mood model.
Agent has an expectation of payoff, and if the actual payoff is lower than that expectation, its mood value decreases, and a lower mood value means more rationality (Collenette, et al., 2017, September) [5] and more risk seeking (Zeng, et al., 2017; Zeng & Li, 2020) [47,48] , and vice versa.
When the mood is high the agent cooperates with new agents, and when the mood is very high, the agent will always cooperate (Collenette, et al., 2019, July) [6] .
For social comparison, the expectation of benefits depends on the average of the benefits of the agent's neighbors or the community (Zeng, et   [5,6,47,48] .
An intrinsic reward is an additional reward that is independent of the game, but this reward is involved in the judgment of the game choice.While it seems to be no different than directly modifying the payoff table of the game, they tended to be based on some sociological research that conferred some known psychological need for satisfaction on agents.
Using adherence as an intrinsic reward enables agents to consider the collective, thus promoting cooperation (Yuan, et al., 2022) [46] .
Assume that each agent has an Internal-standard, a behavior that they consider to be the right.If their behavior is consistent with the Internal-standard, they receive an additional reward; if it is not, they deduct the reward (Wu,  et al., 2017)  [40] .
Using social payoff as an intrinsic reward enables agents to consider the collective (Fan, et al., 2022) [7] .
In the Network dynamics approach, agents are allowed to change their game objects in different ways, so this approach can also be regarded as "partner selection".
Agents can choose to interrupt the link with an uncooperative object (i.e., refuse to game) and create a new link (Takesue, 2018; Takesue, 2021) [32,33] , or agents can choose to move to an empty node in the network to game with a new neighbor (Ichinose, et al., 2018) [17] .
There are also some more specific partner selection mechanisms: The agents will send a signal with strength to each other, and they will play if both sides receive each other's signal, they will not play as long as one side's signal is not delivered to the other side.The weaker the signal strength is, the harder it will be delivered to the other side.At this point, allowing agents to adjust the strength of the signal based on their own payoffs (Li, et al., 2020)  [21] is also one of the approaches of Network dynamics.
A concept of active or inactive can also be introduced to the agents, so that if the agents are inactive, they will stop playing with any neighbor.In each round, agents select the action of being active or inactive via iterative Qtable (Guo, et al., 2022)  [14] .
Several altruistic agents are designed for the network.When an altruistic agent cooperates, its neighbors, regardless of their strategies, can gain additional benefits (Wu, et al., 2018)  [40] .

Which environments are these methods applicable to?
For narrative convenience, all environments in this paper are divided into only 2 cases: IPD and Network.
IPD stands for Iterative Prisoner's Dilemma, although it usually includes the game of iterative prisoner's dilemma among multiple agents, in this paper, it refers specifically to the game between 2 agents.
Network refers to all prisoner's dilemma games between more than 2 agents, which includes various network forms such as square lattices network, scale-free network, etc., and includes all game models with partner selection.NIPD (The agent plays with all its neighbors one at a time and has the choice of cooperating with all or none) is also among them.
Among the 31 studies that used Bounded rationality, there were 26 applied to Network and 5 applied to IPD.Details as follows:

Bounded rationality
Among the 5 studies that used Bounded rationality, there were 4 (Xu, et   [25] applied to IPD.

Intrinsic reward
All of the 3 studies (Yuan, et   [7,40,46] that used Intrinsic reward were applied to Network.

Altruistic attribute
The only study (Wu, et al., 2018) [41] that used Altruistic attribute was applied to Network.

Bounded rationality
Bounded rationality only makes it possible that the cooperators are not all destroyed (Xu, et al., 2018, May) [43] .Agent will try to cooperate several times in a short period of time (Moriyama, et al.,2017, July) [25] .EEoC make it possible to spread cooperation rather than generate it (Otsuka, et al.,2017, August) [26] and if the number of ProbD (A strategy of always choosing defects) agents is not large, EEoC agents can spread and maintain mutual cooperation (Otsuka & Sugawara, 2018) [27] .
TFT can promote the cooperative behavior of agents in a multi-agent system and improve the overall benefit of the system (Wang & Jiang, 2019) [37] .
For memory length (the number of innings that can be remembered), the longer the memory length, the more it promotes cooperation (Lotfi & Rodrigues, 2022) [23] , while there exists an optimal memory length to develop cooperation (Wang, et al., 2017)  [39] .

Adaptive strategy
By updating the strategy with Adaptive strategy, agents were able to cooperate with their opponents without losing competitiveness (Xue, et al.,2017)  [45] and, in some cases, can even make it the only stable state (Shang, et al.,2021, July) [31] .
If it is limited to imitating the strategies of other superior agents, Adaptive strategy can promote the formation and maintenance of cooperation, especially when there is a significant payoff difference between cooperators and defectors (Seredyński & Gąsior, 2019; Wang, et al., 2021) [30,36] .Multi-hop learning can enhance cooperation, and there is an optimal hop number (Liu, et al., 2021, August) [22] .
With the extreme tendency to imitate the superior agents, cooperators can dominate a limited population, and the level of cooperation increases with population size (Xu & Hui, 2019) [42] .
With Historical-comparison a high level of cooperation can be achieved and with Social-comparison the high level of cooperation can only be sustained in a portion of the population (Zeng, et al., 2017)  [48] .

Network dynamics
Network dynamics enhance the cooperation in the prisoner's dilemma (Takesue, 2021; Li, et al., 2020)  [21,33] , networks with medium density (a higher number of links between agents playing the game) can increase cooperation (Ichinose, et al., 2018)  [17] , and cooperation can be the best strategy when the density of the network increases (Takesue, 2018) [32] .
For games that introduce the concept of signal strength (Li, et al., 2020)  [21] , Network dynamics can not only help cooperators escape from the risk of extinction but also can greatly contribute to the population size of cooperators.
For games that introduce active and inactive states (Guo, et al., 2022)  [14] , Network dynamics can make Inactive agent form a belt to separate cooperative and defective clusters.

Summary
This paper summarizes 31 studies from January 2017 to January 2023 on how to make AGENT cooperate in a prisoner's dilemma and summarizes seven methods: Bounded rationality; Memory; Adaptive strategy; Mood model; Intrinsic reward; Network dynamics; and Altruistic attribute.
The Boundary rationality approach enables agents to engage in certain "prescribed" behaviors, often of a perceptual nature.Numerous strategies have been proposed based on this idea, including the TFT strategy.This perceptual strategy disregards the "benefits of Temptation" that may not provide maximum returns in the short term but can enhance overall benefits in the long run.
In general, not all agents use the same strategy; they are often placed into populations with a certain proportion of ProbD and ProbC, and the proportion of mutual cooperation in the whole population is observed after several rounds of "game → elimination → reproduction → game".Therefore, this approach is more akin to "spreading" cooperation than "generating" cooperation.
The exception to this is allowing agents to make mistakes, which does generate cooperation.If agents are allowed to learn, they will reach cooperation on occasional mutual mistakes and as a result learn to cooperate for a short period of time, somewhat like the operant conditioning chamber (McLeod, 2015) [24] , where cooperation is more rewarding and becomes an incentive due to the long-term environment of mutual betrayal.
Unlike the fixed strategy of Bounded rationality, both Adaptive strategies and Memory allow agents to change their strategies through learning.They differ only in the agent's ability to observe the outcome of past rounds of the game.
When an agent holds Memory, it can learn from past gains and learn to cooperate without being tied to immediate benefits, since multiple Reward gains are more than Punishment gains.If agents can also observe other agents' past game choices, they can learn richer strategies to gain more while avoiding exploitation; a simple example is the TFT strategy that can be learned when Memory length is 1 (Anastassacos, et al., 2020, April) [1] .Memory is not only the agent's own, but also can be the whole society's.In fact, in real life, criminal records are an example of social memory means, and employers can judge whether to make employment (cooperation) based on criminal records.
Learning from immediate gains alone cannot produce cooperation (Axelrod, 1980) if Memory is not included [2] .Thus, the best approach is to learn the strategies of "better" performing agents, but this requires that agents observe the total gains of other agents, and agents with the highest gains tend to be able to exploit or cooperate with others over time, and exploitation is not sustainable when agents are able to change their strategies to avoid being exploited.When agents are able to change their strategies to avoid being exploited, the exploitation is not sustainable, and agents end up learning from those who cooperate in the long-term.
However, this method needs to ensure that cooperators do not disappear prematurely.If agents can observe the highest global gainers and all turn to exploitation, then cooperators will die out at an early stage and finally become unavailable for learning, which is one of the reasons why all six studies using Adaptive strategies only learn from their neighbors.
The Mood model is actually a kind of bounded rationality, but it is classified as an independent approach due to its dynamic adjustment.It is commonly believed that rationality makes agents choose the optimal solution in a prisoner's dilemma (Campbell & Sowden 1985) [3] -Defection.And the Mood model is set up in such a way that mood becomes low when agents feel their gains decrease and high when agents feel their gains increase.When sentiment is high, agents will be less "calculating" and ignore the optimal solution in favor of cooperation.At the same time, when they are exploited to reduce their gains, they "calm down" and return to rational thinking to avoid further exploitation.
If it is in a two-player IPD, when both parties are in a high mood state, both parties' gains will increase, which in turn will further increase the mood value.In fact, the TFT strategy can also be seen as an extreme Mood model, i.e., if the opponent cooperates, whether they choose to cooperate or defect, the increase in gain will lead to a rapid increase in the mood value and thus the choice to cooperate, and similarly, if they encounter a defection, they will immediately feel frustrated because of the decrease in gain.Thus, the disadvantage of the mood model is very obvious, when the number of participants increases, once someone chooses to defect when the mood value is low, it will quickly break the mutual cooperation and cannot stop the propagation of defections.This is why sentiment models are asked to be performed in a less connected network.social-comparison performs better than historical-comparison for this same reason, because defections may be passed through the network, so the average gain for the society as a whole is reduced, which makes it less likely to compare with past gains, compared to see a decrease in gains to maintain good mood.
The final method related to the rationality level is Intrinsic reward, which allows agents to use some psychological satisfaction as a reward for participating in the prisoner's dilemma.From a global perspective, these agents act as if they have many "obsessions" in general, such as setting agents to be inclined to cooperate, so that when agents choose to cooperate, they receive an additional reward for whatever their opponent chooses, which will undoubtedly increase cooperation.Although at first glance this appears to be a "cheating" strategy because it directly modifies the payoff table and undermines the definition of the prisoner's dilemma, in reality, all humanset strategies are the same, such as the TFT strategy, which is a strategy that ignores the payoff table.And usually these intrinsic rewards are based on psychological findings or guesses, so they are also considered as a method.
Network dynamics is also a strategy, but it is fundamentally different from the previously mentioned strategies because the strategy is not a decision to cooperate or defect, but a decision to play or not to play with another agent, namely partner selection.Similar to Memory, agents can observe other agents' past choices or changes in their own past gains and choose whether to continue playing with a particular agent.
If the model is "game → elimination → reproduction → game", ProbD will end up being eliminated because no partner is willing to play with him and he cannot gain benefits as a result; if the model is with learning, the agent will know that he can gain more benefits only if he is allowed to play and learn to cooperate.Network dynamics disguises itself as a "self-protection" means, which is also a means of punishment.The example of "criminal record" is mentioned in the discussion of memory, but it is closer to the case of Network dynamics, because the employer does not choose to defect, but refuses to play together.
The last method is Altruistic property, i.e., adding some special agents to the game, and when playing with these special agents, if the special agent chooses to cooperate, the opponent will get an additional reward regardless of what the opponent chooses.Such agents are easy targets for exploitation, so it is necessary for agents to learn self-preservation strategies.Once such a special individual does not choose to cooperate, then his opponent does not receive an additional reward.
When the extra reward is large enough, Reward gains will exceed Suckers gains, so other agents will choose to cooperate.Similar to Intrinsic rewards, it alters the payoff table, but unlike Intrinsic rewards, the Altruism property alters the opponent's payoff, not his own.Even though the additional payoff is not large, other agents may choose to cooperate in order to receive additional payoffs in the long run and avoid special agents turning to defection.Because the method requires the observation of long-term outcomes, it requires a memory function.
In summary, after reviewing the seven methods for generating cooperation among agents in the 31 studies, it is easy to see that there is some interoperability among the seven methods, some of which are even variants of another method (e.g., Intrinsic rewards are complex manifestations of Bounded rationality).However, this paper distinguishes them as different methods for ease of description and also because they are all set up from different human perspectives.The various strategies in Bounded rationality are similar to the various "must follow" rules we were taught as children, Memory is like the records we acquire (criminal records, awards, etc.), Adaptive strategies are learning, Mood models are, as the name implies, emotional, Intrinsic rewards are psychological satisfaction, Network dynamics are the right to choose partners, and Altruistic attributes are like social welfare policies.
Some of them can generate cooperation (Memory, Adaptive strategy, Intrinsic reward), some can maintain it (Network dynamics, Altruistic attribute), and some can spread it (Bounded rationality, Mood model).
Returning to the three questions that this paper seeks to answer: 1.
What methods are used in these studies to circumvent the prisoner's dilemma?
In total, there are seven methods to get agents to produce cooperation from the prisoner's dilemma.They are: Bounded rationality; Memory; Adaptive strategy; Mood model; Intrinsic reward; Network dynamics; Altruistic attribute. 2.
Which environments are these methods applicable to?
All seven methods have been used in multi-agent networks.
Where Bounded rationality; Memory and Mood model were also used in IPD with only 2 agents.
3. How effective are these methods?The study showed that Memory, Adaptive strategy, and Intrinsic reward performed well in generating cooperation, Network dynamics, Altruistic attribute was suitable for maintaining the generated cooperation, and Bounded rationality, Mood model could spread the cooperation out.

Future work
In future work, further research will be conducted to unify the seven methods summarized in this paper, in order to model the generation, maintenance, and propagation of cooperation and simulate real-life human-generated cooperation -such as the development of concepts like fishing moratoriums.
The strengths of these seven approaches will be integrated and explored, with a particular focus on extending the mood modeling component.This will enable agents to spontaneously generate emotions based on their environment and develop corresponding strategies, thereby creating a more realistic social environment.
Additionally, this review will be regularly updated to incorporate any new methods that emerge.

Table 1 .
Payoff table for the prisoner's dilemma

Table 2 .
Database, search statements and search