iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games

上傳人：我*** IP屬地：北京上傳時(shí)間：2020-06-20 格式：PDF 頁(yè)數(shù)：11 大?。?.47MB 積分：12 舉報(bào) 版權(quán)申訴

iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games_第2頁(yè)

iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games_第3頁(yè)

iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games_第4頁(yè)

iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games_第5頁(yè)

已閱讀5頁(yè)，還剩6頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Learning to Correlate in Multi-Player General-Sum Sequential Games Andrea Celli Politecnico di Milano andrea.cellipolimi.it Alberto Marchesi Politecnico di Milano alberto.marchesipolimi.it Tommaso Bianchi Politecnico di Milano tommaso4.bianchimail.polimi.it Nicola Gatti Politecnico di Milano nicola.

2、gattipolimi.it Abstract In the context of multi-player, general-sum games, there is a growing interest in solution concepts involving some form of communication among players, since they can lead to socially better outcomes with respect to Nash equilibria and may be reached through learning dynamics

3、 in a decentralized fashion. In this paper, we focus on coarse correlated equilibria (CCEs) in sequential games. First, we complete the picture on the complexity of fi nding social-welfare-maximizing CCEs by proving that the problem is not in Poly-APX, unless P = NP, in games with three or more play

4、ers (including chance). Then, we provide simple arguments showing that CFRworking with behavioral strategiesmay not converge to a CCE in multi-player, general-sum sequential games. In order to amend this issue, we devise two variants of CFR that provably converge to a CCE. The fi rst one (CFR-S) is

5、a simple stochastic adaptation of CFR which employs sampling to build a correlated strategy, whereas the second variant (called CFR-Jr) enhances CFR with a more involved reconstruction procedure to recover correlated strategies from behavioral ones. Experiments on a rich testbed of multi-player, gen

6、eral-sum sequential games show that both CFR-S and CFR-Jr are dramatically faster than the state-of-the-art algorithms to compute CCEs, with CFR-Jr being also a good heuristic to fi nd socially-optimal CCEs. 1Introduction A number of recent studies explore relaxations of the classical notion of equi

7、librium (i.e., the Nash equilibrium (NE) 30), allowing to model communication among the players 3,14,34. Communication naturally brings about the possibility of playing correlated strategies. These are customarily modeled through a trusted external mediator who privately recommends actions to the pl

8、ayers 1. In particular, a correlated strategy is a correlated equilibrium (CE) if each player has no incentive to deviate from the recommendation, assuming the other players would not deviate either. A popular variation of the CE is the coarse correlated equilibrium (CCE), which only prevents deviat

9、ions before knowing the recommendation 28. In sequential games, CEs and CCEs are well-suited for scenarios where the players have limited communication capabilities and can only communicate before the game starts, such as, e.g., military settings where fi eld units have no time or means of communica

10、ting during a battle, collusion in auctions where communication is illegal during bidding, and, in general, any setting with costly communication channels or blocking environments. Equal contribution. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. CCEs pr

11、esent a number of appealing properties. A CCE can be reached through simple (no-regret) learning dynamics in a decentralized fashion 17,22, and, in several classes of games (such as, e.g., normal-form and succinct games 31,26), it can be computed exactly in time polynomial in the size of the input.

12、Furthermore, an optimal (i.e., social-welfare-maximizing) CCE may provide arbitrarily larger welfare than an optimal CE, which, in turn, may provide arbitrarily better welfare than an optimal NE 12 . Although the problem of fi nding an optimal CCE is NP-hard for some game classes (such as, e.g., gra

13、phical, polymatrix, congestion, and anonymous games 3), Roughgarden34shows that the CCEs reached through regret-minimizing procedures have near-optimal social welfare when the(,) -smoothness condition holds. This happens, e.g., in some specifi c auctions, congestion games, and even in Bayesian setti

14、ngs, as showed by Hartline et al.23. Thus, decentralized computation via learning dynamics, computational effi ciency, and welfare optimality make the CCE one of the most interesting solution concepts for practical applications. However, the problem of computing CCEs has been addressed only for some

15、 specifi c games with particular structures 3,23. In this work, we study how to compute CCEs in the general class of games which are sequential, general-sum, and multi-player. This is a crucial advancement of CCE computation, as sequential games provide a model for strategic interactions which is ri

16、cher and more adherent to real-world situations than the normal form. In sequential games, it is known that, when there are two players without chance moves, an optimal CCE can be computed in polynomial time 12. Celli et al.12also provide an algorithm (with no polynomiality guarantees) to compute so

17、lutions in multi-player games, using a column-generation procedure with a MILP pricing oracle. As for computing approximate CCEs, in the normal-form setting, any Hannan consistent regret-minimizing procedure for simplex decision spaces may be employed to approach the set of CCEs 5,13the most common

18、of such techniques is regret matching(RM)4,22. However, approachingthesetofCCEsinsequentialgamesismoredemanding. One could represent the sequential game with its equivalent normal form and apply RM to it. However, this would result in a guarantee on the cumulative regret which would be exponential i

19、n the size of the game tree (see Section 2). Thus, reaching a good approximation of a CCE could require an exponential number of iterations. The problem of designing learning algorithms avoiding the construction of the normal form has been successfully addressed in sequential games for the two- play

20、er, zero-sum setting. This is done by decomposing the overall regret locally at the information sets of the game 15. The most widely adopted of such approaches are counterfactual regret minimization (CFR) 43 and CFR+ 39,38, which originated variants such as those introduced by Brown and Sandholm9and

21、 Brown et al.10. These techniques were the key for many recent remarkable results 6,7,8?. However, these algorithms work with players behavioral strategies rather than with correlated strategies, and, thus, they are not guaranteed to approach CCEs in general-sum games, even with two players. The onl

22、y known theoretical guarantee of CFR when applied to multi-player, general-sum games is that it excludes dominated actions 19. Some works also attempt to apply CFR to multi-player, zero-sum games, see, e.g., 32. Original contributions First, we complete the picture on the computational complexity of

23、 fi nding an optimal CCE in sequential games, showing that the problem is inapproximable (i.e., not in Poly- APX), unless P = NP, in games with three or more players (chance included). In the rest of the paper, we focus on how to compute approximate CCEs in multi-player, general-sum, sequential game

24、s using no-regret-learning procedures. We start pointing out simple examples where CFR-like algorithms available in the literature cannot be directly employed to our purpose, as they only provide players average behavioral strategies whose product is not guaranteed to converge to an approximate CCE.

25、 However, we show how CFR can be easily adapted to approach the set of CCEs in multi-player, general-sum sequential games by resorting to sampling procedures (we call the resulting, nave algorithm CFR-S). Then, we design an enhanced version of CFR (called CFR-Jr) which computes an average correlated

26、 strategy guaranteed to converge to an approximate CCE with a bound on the regret sub-linear in the size of the game tree. The key component of CFR-Jr is a polynomial-time algorithm which constructs, at each iteration, the players normal-form strategies by working on the game tree, avoiding to build

27、 the (exponential-sized) normal-form representation. We evaluate the scalability of CFR-S and CFR-Jr on a rich testbed of multi-player, general-sum sequential games. Both algorithms solve instances which are orders of magnitude larger than those solved by previous state-of-the-art CCE-fi nding techn

28、iques. Moreover, CFR-Jr proved to be a good heuristic to compute optimal CCEs, returning nearly-socially-optimal solutions in all the instances of our testbeds. Finally, we also test our algorithms against CFR in multi-player, general-sum games, showing that, in several 2 instances of our testbed, C

29、FR does not converge to a CCE and it returns solutions providing a social welfare considerably lower than that achieved with CFR-S and CFR-Jr. 2Preliminaries In this section, we introduce some basic concepts which are used in the rest of the paper (see Shoham and Leyton-Brown 36 and Cesa-Bianchi and

30、 Lugosi 13 for further details). 2.1Extensive-form games and relevant solution concepts We focus on extensive-form games (EFGs) with imperfect information and perfect recall. We denote the set of players asP c, wherecis the Nature (chance) player (representing exogenous stochasticity) selecting acti

31、ons with a fi xed known probability distribution.His the set of nodes of the game tree, and a nodeh H is identifi ed by the ordered sequence of actions from the root to the node.Z His the set of terminal nodes, which are the leaves of the game tree. For every h H Z, we letP(h)be the unique player wh

32、o acts athandA(h)be the set of actions she has available. We writeh ato denote the node reached whena A(h)is played ath. For each player i P,ui: Z Ris the payoff function. We denote bythe maximum range of payoffs in the game, i.e., = maxiP(maxzZui(z) minzZui(z). We represent imperfect information us

33、ing information sets (from here on, infosets). Any infosetI belongs to a unique playeri, and it groups nodes which are indistinguishable for that player, i.e., A(h) = A(h0)for any pair of nodesh,h0 I.Iidenotes the set of all playeris infosets, which form a partition ofh H | P(h) = i. We denote byA(I

34、)the set of actions available at infosetI. In perfect-recall games, the infosets are such that no player forgets information once acquired. Wedenotewithiabehavioralstrategyofplayeri , whichisavectordefi ningaprobabilitydistribution at each playeris infoset. Giveni, we leti,Ibe the (sub)vector repres

35、enting the probability distribution at I Ii, with i,I,adenoting the probability of choosing action a A(I). An EFG has an equivalent tabular (normal-form) representation. A normal-form plan for playeri is a vectori i= IIi A(I) which specifi es an action for each playeris infoset. Then, an EFG is desc

36、ribed through a |P|-dimensional matrix specifying a utility for each player at each joint normal-form plan = iP i. The expected payoff of playeri, when she playsi iand the opponents play normal-form plans ini i= j6=iP j, is denoted, with an overload of notation, byui(i,i). Finally, a normal-form str

37、ategyxiis a probability distribution overi. We denote byXithe set of the normal-form strategies of playeri. Moreover,Xdenotes the set of joint probability distributions defi ned over . We also introduce the following notation. We letibe a vector in which each componenti z is the probability of reach

38、ing the terminal nodez Z, given that playeriadopts the behavioral strategyi and the other players play so as to reachz. Similarly, given a normal-form plani i , we defi ne the vectori. Moreover, with an abuse of notation,i I andi I denote the probability of reaching infosetI Ii. Finally,Z(i) Zis the

39、 subset of terminal nodes which are (potentially) reachable if player i plays according to i i. The classical notion of CE by Aumann1models correlation via the introduction of an external mediator who, before the play, draws the joint normal-form plan according to a publicly knownx X, and privately

40、communicates each recommendation i to the corresponding player. After observing their recommended plan, each player decides whether to follow it or not. A CCE is a relaxation of the CE, defi ned by Moulin and Vial28, which enforces protection against deviations which are independent from the sampled

41、 joint normal-form plan. Defi nition 1.A CCE of an EFG is a probability distributionx Xsuch that, for everyi P, and 0 i i, it holds: X ii X ii x(i,i)(ui(i,i) ui(0 i,i) 0. CCEs differ from CEs in that a CCE only requires that following the suggested plan is a best response in expectation, before the

42、recommended plan is actually revealed. In both equilibrium concepts, the 3 entire probability distribution according to which recommendations are drawn is revealed before the game starts. After that, each player commits to playing a normal-form plan (see Appendix A for further details on the various

43、 notions of correlated equilibrium in EFGs). An NE 30 is a CCE which can be written as a product of players normal-form strategiesx i Xi. In conclusion, an-CCE is a relaxation of a CCE in which every player has an incentive to deviate less than or equal to(the same defi nition holds true for -CE and

44、 -NE). 2.2Regret and regret minimization In the online convex optimization framework 42, each playeriplays repeatedly against an unknown environment by making a series of decisionsx1 i,x2i,.,xti. In the basic setting, the decision space of playeriis the whole normal-form strategy spaceXi. At iterati

45、ont, after selectingxt i, playeri observes a utility ut i(xti ). The cumulative external regret of player i up to iteration T is defi ned as RT i = max xiXi T X t=1 ut i( xi) T X t=1 ut i(x t i). (1) A regret minimizer is a function providing the next playeris strategyxt+1 i on the basis of the past

46、 history of play and the observed utilities up to iterationt. A desirable property for regret minimizers is Hannan consistency 21, which requires thatlimsupT 1 TR T i 0, i.e., the cumulative regret grows at a sublinear rate in the number of iterations T. In an EFG, the regret can be defi ned at each

47、 infoset. AfterTiterations, the cumulative regret for not having selected actiona A(I)at infosetI Ii(denoted byRT I(a) is the cumulative difference in utility that playeriwould have experienced by selectingaatIinstead of following the behavioral strategyt i at each iterationtup toT. Then, the regret

48、 for playeriat infosetI Ii is defi ned as RT I = maxaA(I)RT I(a). Moreover, we let R T,+ I (a) = maxRT I(a),0. RM 22 is the most widely adopted regret-minimizing scheme when the decision space isXi(e.g., in normal-form games). In the context of EFGs, RM is usually applied locally at each infoset, wh

49、ere the player selects a distribution over available actions proportionally to their positive regret. Specifi cally, at iterationT +1playeriselects actionsa A(I)according to the following probability distribution: T+1 i,I,a = RT,+ I (a) P a0A(I)R T,+ I (a0), if P a0A(I)R T,+ I (a0) 0 1 |A(I)|, other

50、wise . Playing according to RM at each iteration guarantees, on iterationT,RT I |A(I)| T 13. CFR 43 is an anytime algorithm to compute-NEs in two-player, zero-sum EFGs. CFR minimizes the external regret RT i by employing RM locally at each infoset. In two-player, zero-sum games, if both players have

51、 cumulative regrets such that 1 TR T i , then their average behavioral strategies are a2- NE 41. CFR+ is a variation of classical CFR which exhibits better practical performances 39. However, it uses alternation (i.e., it alternates which player updates her regret on each iteration), which complicat

52、es the theoretical analysis to prove convergence 15, 11. 3Hardness of approximating optimal CCEs We address the following question: given an EFG, can we fi nd a social-welfare-maximizing CCE in polynomial time? As shown by Celli et al.12, the answer is yes in two-player EFGs without Nature. Here, we

53、 give a negative answer to the question in the remaining cases, i.e., two-player EFGs with Nature (Theorem 1) and EFGs with three or more players without Nature (Theorem 2). Specifi cally, we provide an even stronger negative result: there is no polynomial-time approximation algorithm which fi nds a

54、 CCE whose value approximates that of a social-welfare-maximizing CCE up to any polynomial factor in the input size, unless P = NP. 2 We prove our results by means of a reduction from SAT, a well known NP-complete problem 18, which reads as follows. 2Formally, an r-approximation algorithmAfor a maxi

55、mization problem is such that OPT APX r, where OPT is the value of an optimal solution to the problem instance and APXis the value of the solution returned byA. See 2 for additional details on approximation algorithms. 4 I hN I1 1 x 0 x x 1 3 I2 0 x 1 x x 1 y 0 y y 1 3 I3 0 x 1 x x 0 y 1 y y 1 3 aI

56、1,1 + aO IyIx I 1,1 + ,0 aO hN I1 aN,1 I2 aN,2 I3 aN,3 aI . aO,1 |C| 2(|C|1), |C| |C|1 aN,1 |C| 2 ,|C| aN,2 |C| 2(|C|1), |C| |C|1 aN,3 aO,2 . aO,3 aI IN player 1 player 2 ? Nature / player 3 Figure 1: Left: Example of game for the reduction of Theorem 1, whereV = x,y,z,C = 1,2,3,1= x,2= x y, and3= x

57、 y. Right: Example of game for the reduction of Theorem 2, with V and C as before. Defi nition 2(SAT). Given a fi nite setC of clauses defi ned over a fi nite setVof variables, is there a truth assignment to the variables which satisfi es all clauses? For clarity, Figure 1 shows concrete examples of

58、 the EFGs employed for the reductions of Theo- rems 1 and 2. Here, we only provide proof sketches, while we report full proofs in Appendix B. Theorem 1.Given a two-player EFG with Nature, the problem of computing a social-welfare- maximizing CCE is not in Poly-APX unless P = NP. 3 Proof sketch.An example of our reduction from SATis provided on the left of Figure 1. Its main idea is the following: player 2 selects a truth assignment to the variables, while player 1 chooses a literal for each clause in order to satisfy it. It

人人文庫(kù)> 全部分類> 應(yīng)用文書

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

iccv2019論文全集9465-learning-to-correlate-in-multi-player-general-sum-sequential-games

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔