Decentralized Opportunistic Spectrum Resources Access Model and Algorithm toward Cooperative Ad-Hoc Networks.

RESEARCH ARTICLE

Decentralized Opportunistic Spectrum Resources Access Model and Algorithm toward Cooperative Ad-Hoc Networks Ming Liu, Yang Xu*, Abdul-Wahid Mohammed School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, P.R. China * [email protected]

Abstract

OPEN ACCESS Citation: Liu M, Xu Y, Mohammed A-W (2016) Decentralized Opportunistic Spectrum Resources Access Model and Algorithm toward Cooperative AdHoc Networks. PLoS ONE 11(1): e0145526. doi:10.1371/journal.pone.0145526 Editor: Catalin Buiu, Politehnica University of Bucharest, ROMANIA Received: September 11, 2015 Accepted: December 4, 2015 Published: January 4, 2016 Copyright: © 2016 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting Information files. Funding: This research was sponsored by the National Natural Science Foundation of China (http:// www.nsfc.gov.cn/) grants 61370151 and 61202211 to YX, the National Science and Technology Major Project of China (http://www.nmp.gov.cn/) grant 2015ZX03003012 to YX, the Central University Basic Research Funds Foundation of China grant ZYGX2014J055 to YX, and the Science and Technology on Electronic Information Control Laboratory Project. The funders had no role in study

Limited communication resources have gradually become a critical factor toward efficiency of decentralized large scale multi-agent coordination when both system scales up and tasks become more complex. In current researches, due to the agent’s limited communication and observational capability, an agent in a decentralized setting can only choose a part of channels to access, but cannot perceive or share global information. Each agent’s cooperative decision is based on the partial observation of the system state, and as such, uncertainty in the communication network is unavoidable. In this situation, it is a major challenge working out cooperative decision-making under uncertainty with only a partial observation of the environment. In this paper, we propose a decentralized approach that allows agents cooperatively search and independently choose channels. The key to our design is to build an up-to-date observation for each agent’s view so that a local decision model is achievable in a large scale team coordination. We simplify the Dec-POMDP model problem, and each agent can jointly work out its communication policy in order to improve its local decision utilities for the choice of communication resources. Finally, we discuss an implicate resource competition game, and show that, there exists an approximate resources access tradeoff balance between agents. Based on this discovery, the tradeoff between real-time decisionmaking and the efficiency of cooperation using these channels can be well improved.

Introduction Communication resources always play a latent role in networked large-scale agent team coordination applications, such as multi-robots system, mobile sensor system, etc. With the expansion of the system, communication resources exert a momentous impact on the cooperative efficiency [1], and numerous attention from both industry and academia has been devoted to this research [2]. For instance, the utmost transfer rate of IEEE 802.11b protocol is 11Mbit/s, and with the insecurity of latency and packet loss, this may fail to meet the capacity requirement of large-scale robots carrying video equipment for surveillance in an open environment [3]. In our previous work [4], we found that, with the expansion of the team size, robots will

PLOS ONE | DOI:10.1371/journal.pone.0145526 January 4, 2016

1 / 21

Decentralized Resources Access for Cooperative Ad-Hoc Networks

design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist.

compete the limited spectrum resources, which is a phenomenon also supported by other studies [5, 6]. On the other hand, different from Cognitive Radio (CR) [7], Mobile Sensor [8] and other traditional wireless communication researches, multi-agent system usually consist of multiple inexpensive agents, and without a strong central processing unit or resources preauthorization, but with more incomplete channels observation and changing dynamics. In addition, there are no typical technical characteristics, such as a fixed base station or a central node to manage and distribute channels, etc. The major communication mode for most decentralized multi-agent system is Ad-hoc network [2]. However, the typical pre-authorization and consultative allocation approach cannot be applied in the dynamic tasks and agents’ migration. In consequence, new concepts and strategies should be developed, and this is the main motivation proposed here. As a main technical part of our research. In this paper, we model the decentralized multiagent multi-channel access problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) problem. We use a continuous time Markov model to simulate the usage of channels while the constant slotted opportunity is used to support agents’ interaction. In addition, we use a sample-based Partially Observable Markov Decision Process (POMDP) to simplify the model. Finally, based on game theory, we model and analyze implicit resource competition between agents, and prove the existence of equilibrium in an ideal state.

State of the Art Even though centralized channel resource allocation methods can provide some sort of optimal solutions, they are less effective in situations where the central point fails. For instance, typical auction-based algorithms generally have low communication requirements [9], and the negotiation process, in addition, can degrade in overall efficiency as communication deteriorates [10]. It has been shown in [11] that spatial channels opportunity allocation is equivalent to a graph coloring problem, which objective is to obtain colors assignment that maximizes the utility. But obtaining the optimal coloring is generally known to be NP-hard. Opportunistic Spectrum Access (OSA) [12] and Opportunistic Spectrum Sharing (OSS) [13] are widely adopted in most recent researches, and several investigations have modeled OSA problems as a POMDP model [14]. Basic OSA concept is described as an agent, which can identify and access idle frequency bands and obtain maximized rewards. Many decentralized methods have referenced the design of POMDP, varying reliance on schemes and can only handle intermittent communication resource scheduling. Reinforcement learning (RL) [15] is a paradigm to solve POMDP problems, and it is inspired by a learning theory which has good performance in multi-robots decision applications [16, 17]. For most RL-based multi-agent systems, the rewards are more achieved by long-team learning, which is the expected accumulated reward that the agent expects to receive in the future under the policy, and can be specified by update value function. However, for the fixed utility function design, time restrains, interaction and observation limited applications, RL is restricted. Game Theory provides another approach to OSA. Stochastic game [18] as an extension of Game Theory, can improve the capability to solve the OSA problems, and a deeper analysis between the game and the graph-based method is noted in [19]. It is important to note that in many situations, states of the system cannot be observed completely. Therefore, some researches adopt the definition of Partially Observable Stochastic Game (POSG), and a cooperative case of POSG, namely Dec-POMDP [20]. Although some efforts have been made in building heuristic algorithms to solve this intrinsic NEXP-complete problem [21], it is still less feasible obtaining optimal results in a limited time with the partial observation over channels. In addition, in a non-cooperative case, this Dec-POMDP will no longer be suitable.


2 / 21


Many existing works assume that the observation information obtained from an agent’s neighbors is highly correlated. It can improve the efficiency of multi-agent coordination. In this case, exchange of local observations becomes important in coordination. From this view, we present a decentralized cooperative game model in which agents can iteratively adapt their strategies in terms of reduced competition or conflict, and can meet the minimum communication requirements for each agent timely. This presents a novel approach addressing the gaps in the aforementioned works.

System Model and Problem Statement In this section, we follow the basic idea of continuous time Markov model to define the basic model of a multi-Channel access problem, and then describe the specific functional definition of each variable and the decision model.

Multi-Channel Access Model We consider a multi-agent Ad-hoc network as being created by agents themselves in an open environment, with set R ¼ fr1 ; r2 ; ::::rN g consisting of N distributed agents. Although the multi-hops information sharing method can make each agent finally gain full knowledge of the global state, this consumes a lot of communication resources and also deteriorates the system’s performance. Therefore, an agent makes decisions based on its limited observations, and the entire system would still be partially observed. The network consists of a set of contiguous, orthogonal (non-interfering) and homogeneous channels (e.g., 3 such channels in IEEE 802.11b/g and 12 in IEEE 802.11a), denoted by CH = {c1, c2, . . ., cK}. The available channels are also numbered from 1 to K, and we assume that N > K agents are seeking channel opportunities in these K channels. We should recognize that agents can only access channels if the sensed channels are idle. As shown in Fig 1, a time slot consists of 3 parts: sensing, transmission and acknowledgment. Because of practical considerations, agent ri can sense a set of channels and a subset of sensed channels to access. Limited by its hardware constraints, ri can sense {C1} channels, ({C1} 2 {CH}, jC1 j , where • S = {s0, s1, . . .sn} denotes the finite set of network states. • Ai = {a1, a2, . . ., am} denotes ri’s available actions set. At each time step, all the agents in R take a joint action Lt ¼ 1im fati g. • T denotes Markovian state transition function. P(s0 js, Λt) denotes the probability that doing action Λt and being in state s then going to state s0 . • Ot ¼ 1ik foti g denotes the set of joint observations of all the agents and oti is the observation by ri at time t. • O denotes observation function, which specifies the probability of joint observation O(Otjs0 , Λt). • Rðs0 jLt ; sÞ denotes the reward value obtained from taking action Λt in state s. • p0 = {B0, s0} is the initial belief and state distribution. Action a is determined by the policy π: b!a, which is the function that maps a belief state = ×1 i t−1{ωt−1} denotes the known network to the action that an agent should execute. Ot1 i states. Formally, most policies can be represented as decision trees. We use Qi to denote the possible policy space for agent ri, and Q−i denotes the sets of policy trees for all agents except ri. With a programming approach, it is required that we generate incrementally the sets of useful policies for each agent. Thus, a joint policy P = ×i 2 N{πi} is a vector of policy trees. Evaluating a joint policy can then use the following formulation: X X Pðojs; PÞ½ Vðs; PÞ ¼ Pðs0 js; p; oÞVðs0 ; PðoÞÞ ð1Þ o2O

s0 2S

where P(ω) is the joint policy of subtree selected after observation ω. So we get the utility function as: X X bi ðs; Pi ÞV½s; fPi ; pi g Uðbi ; Pi Þ ¼ ð2Þ s2S Pi 2Qi

Therefore, the essence of this framework is to find a set of n policies to maximize a total reward function from finite horizon T under initial belief state p0, and the expected joint PT reward is given by Eð t¼0 Rðst ; Lt Þjp0 Þ.

A Resource-aware Approach for Multi-agent Multi-channel Access In this section, we demonstrate an agent’s decision-making process based on current observation and resource perception, and analyze the computational complexity under the instincts of no-information-sharing.


4 / 21


Resource Awareness Policy Generation From the idealistic view of the Shannon’s theory [22], the optimal available resources under an ideal state for ri is: Capi ¼ Bn

N1 X gi2 poj gi2 poj fð1 pm Þlog2 ð1 þ o Þ þ pm log2 ð1 þ p Þg sj si þ gi2 ppi j¼1

ð3Þ

where Bn is the channel bandwidth and pm is the channel state misperception probability. soj and spi respectively denote the noise variances from other agents and ri affected channel chi. poj and ppi respectively denote the communication power of other agents and ri. gi is the channel sensing gain. However, in the presence of sensing error, not only the sensing and access policy but also the operating characteristics of the channel sensor affect the performance of the network and the interference perceived by all the agents. The loss of resources caused by interferences are: 4Capi ¼ Bn

N1 X gi2 poj gi2 poj ½log2 ð1 þ o Þ log2 ð1 þ p Þp sj si þ gi2 ppi m j¼1

ð4Þ

As a result, agent ri can obtain the idealistic expectation channel resources in C2 as: ECapi ¼

jC2 j X ðCapi 4Capi Þ

ð5Þ

i¼1

We can see that the agent can access the network interval sequence independently, and this follows the same negative exponential distribution G(t) = 1−e−μi t, where μi is the channel free probability. Thus, we can get the probability of agent ri to choose and access channel cj as: ppi;j ¼ psi;j VðECapi ; Pi Þlog2 ð1 þ

ni;j Þ 0:2 ln BERi;n

ð6Þ

We can use ppi;j and psi;j to denote the probability that agent ri select channel cj and the probaECap p

j bility of channel cj being sensed idle respectively. ni;j ¼ Eð4Capjj ÞþN is the Signal to Interference 0

plus Noise Ratio (SINR) for agent ri from the other agents in channel cj. This problem cannot be solved in one stage, and as such, should be done in an iterative manner. Therefore, based on the above analysis, we use Eqs (5) and (6) to obtain the policy tree and a target BER equal to s z

i;n , where σ1 and σ2 are Lagrangian multipliers, bi, n is the number of bits BERi;n s1 exp½2bi;n2 1

per symbol in channel cn, and zi, n is the Signal to Noise Ratio (SNR) for the receiver agent ri in channel cn. Consequently, we adopt the utility function design in [23]: X p pi;j logðm2 chi ki Þ costi Uðb; pÞ ¼ m1 ð7Þ ci 2C2

where the product chi ki is the bandwidth (i.e. transmission rate), chi is the size of access channel in Hz, ki is the spectral efficiency2 in bits per symbol per Hz due to adaptive modulation, and μ1 and μ2 are constants that depend on the communications protocol and agent communication system performance, respectively. costi is the communication consumption, which relates to the agent’s hardware system. The optimal policy is therefore π = argmax[U(b, πi)].


5 / 21


Dynamic Local Search Based on the model described above, agent ri cannot get a full view of the state of the system, since it can only use its observation to update its actions. The goal of this problem’s model is to come up with a joint policy P = 1iN fpti g, which can maximize the expected reward of all the agents over a finite horizon. The belief space is a sufficient statistic [21], and can be independent of the decision time. We remark that ri can only infer what action its neighbors may take, but the inference or conflict is inevitable. At each time slot, we can compute the expected value of a policy as follows: X EðVpt ðOt1 ; ot ÞÞ ¼ RðOt1 ; hot ; pt iÞ þ PðOt1 ; hot ; pt i; S0 Þ S0 2S

ð8Þ

X OðS0 ; hot ; pt i; o0 Þ Vptþ1 ðS0 ; o0 Þ o2O

Solutions to a finite-horizon POMDP can be represented as a decision tree, where nodes denote the actions and arcs denote the observations. Similarly, solving a finite-horizon DecPOMDP with known state space can be formulated as a multiple vector of horizon T policy tree searching process. Algorithm 1: Resource aware policy search for agent ri. Require: Set g0 = 0; > < ð9Þ jC2 j X > > > Vðpt Þ ¼ 1 RðECapi jat ; pt ; dti Þ; : jC2 j 1 where Xπ represents the conditional expectation given that policy πi is employed, and B0 is the initial belief, which can be the stationary distribution of the network state. dti is the knowledge, consisting of two parts: channels observation ωt and the known status Oti . The search strategy performs a summation over all possible network states from agents’ observations. Since each policy specifies different actions over possible histories of observations, the number of possible jOjT 1

policies for agent ri is OðjAi j jQi j1 Þ. In consequence, the time complexity of finding the optimal jOjT 1

policy by searching this space is: OðjAi j jQi j1 jSj jQi jÞ Þ. T

A Decision Theoretical Approach for Multi-channel Access In the previous sections, we proposed a random searching solution without coordination. This method has very high computational complexity and time cost. But from a practical point of view, each agent can be aware of its neighbor. Therefore, with the neighbor’s policy sharing, the agent ri can get a proximate full local observation. Consequently, we refer to the design in [21], and the multi-agent finite horizon Dec-POMDP model can decompose into several single-agent POMDP decision problems.

Neighbor-Aware Policy Generation i as a new parameter In order to solve a single-agent POMDP, we introduce neighbor policies p p i g. Thereto the knowledge δi, and the joint policy of n neighbors is formulated as Pi ¼ i2n f where the second set S is the state variables fore, we augment the state space to be I ¼ fS Sg, of the other agents’ beliefs. In consequence, we resolve and upgrade the Dec-POMDP to a POMDP model as a tuple < I; A; T; O; O; R; fdi g >. All variable definitions remain unchanged, and to accomplish this, we factor the transition distribution into two terms: i ðs Þ; ðs; s Þ ¼ T½s0 ja; P i ðs Þ; T ðs0 js0 ; a; P i ðs Þ, and the upper bound of the POMDP T½ðs0 ; s0 Þja; P value function can be reached through the complete observation. In consequence, the belief update function can be denoted as: P Oðs0 ; a; oÞ s2I Tðs; a; s0 ÞbðsÞ ð10Þ bðs0 Þ ¼ Pðs0 jo; a; bÞ ¼ Pðoja; bÞ The value function of a POMDP is defined over the space of beliefs, where a belief state b represents a probability distribution over states. The optimal value of policy π can then be approximated as: X Vp ðbÞ ¼ maxfRd ðb; aÞ þ l pðojb; aÞ V ðb; a; oÞg ð11Þ a2A o2O


7 / 21


Heuristic Local Policy Search Modeling our problem as a POMDP model is to search for the optimal policy π , and maximize the expected reward over a finite horizon-T policy distribution over states. Formally, a belief state bt+1 = P(st+1jδt, δt−1, . . .., δ0) is a probability distribution over states conditioned on knowledge dti . In order to avoid a heuristic with unbounded input (the knowledge can be arbitrary), a traditional approach is to learn a mapping from belief states to actions, which is from the known knowledge dti . But in discrete worlds, beliefs can only be represented by a state with probabilities. We represent the regeneration process of belief states by sampling. A sample x is annotated with a numerical importance factor to account for the difference in the sampling distribution. Heuristic search is based on the decomposition of the evaluation function into a sequence of exact sub-evaluations. As aforementioned, we denote qt as an arbitrary depth t policy vector extract from policy vector QT, and {qt, QT−t} constitutes a complete policy vector of depth T. This allows us to decompose the policy vector into any t depth vector, and the value of the completion is: VðQTt jfqt ; p0 gÞ ¼ Vðp0 ; qt Þ þ H Tt ðQTt jqt ; p0 Þ

ð12Þ

where H(q) is the heuristic function, and the value of QT−t depends on the previous execution and the underlying state distribution at time t. In consequence, we can describe the heuristic function as: X Pðsjp0 ; qt ÞH Tt ðsÞ H Tt ðQTt Þ ¼ ð13Þ s2S

As in Algorithm 2, randomly extract a sample qt from the possible policy space Q, and each node in the tree is a belief state bi. For each encountered state xi, belief state bi is updated to include the new state xi0 . In each sample searching, the agent selects the policy b0 at the greatest value. The sampling path terminates when it reaches a sufficient depth of the bounds of Tq, and goes back to the root so as to improve the upper and lower bound estimates. The search moves towards π only with the acceptance probability P(b0), otherwise it remains at b0 . At this point, the node b0 becomes the root of the new search tree, and the remainder of the tree is pruned, as all other beliefs are now impossible. The search in new sample trees would not stop until there appears a policy to meet the resource requirements. Obviously, under a statistical hypothesis, the searching process converges to the expected distribution at a rate of p1ffiffiHffi, and H denotes the sample size. Algorithm 2: Sample extract-based search for agent ri. Require: random extract sample fqti g from Q; v(b0) = 0; Ensure: 9 8 vðpi Þ vðpi Þ; 1: random extract sample fqti g from Q; 2: for each qti do 3: qualify Tq; 4: repeat 5: for each state xi from bi do Tðxi ; a; xi0 Þ; 6: compute bðxi0 Þ; xi0 t 0 7: if b 2 q then 8: continue to next bi; 9: else 10: add b0 to Tq; 11: if U(b), the reward is (-20). As such, each agent has its own observation and network belief. In order to get a better reward, each agent removes all of the joint beliefs that are not consistent with its entire observation. After policies sharing, there is only a single possible belief b(W) = 0.033, and the optimal joint action for this belief is < Switch, Switch >.

Table 1. State-action transfer probability. Action

S,S

L,S

S,L

L,L

W

0.09

0.21

0.21

0.49

R

0.49

0.21

0.21

0.09

State

doi:10.1371/journal.pone.0145526.t001

Fig 2. Beliefs Update Processing. A 3-step policy tree captured from Table 1, each of which can be conditioned on the outcome of previous actions. Each node is labeled with the action that should be taken if it is reached. doi:10.1371/journal.pone.0145526.g002


9 / 21


It should be noted that there exists hidden competitions between agents for the finite resources (each agent wants to get more resources), that is, there should exist optimal joint policies to reach the Pareto optimal. But, it is infeasible for Dec-POMDP model because of the partial observation that we briefly described in the previous sections. In our design, it can finally reach the approximate Pareto optimal after a finite search. Therefore, it means that, in the finite belief space, there exists a pair of policies π = (π1, π2) such that: V 8p0 ðV1 ðp1 ; p2 ÞÞ V1 ðp01 ; p2 Þ 8p0 ðV2 ðp1 ; p2 ÞÞ V2 ðp1 ; p02 Þ. That is, for each agent, playing 1

2

πi gives an equal or higher expected resource than playing p0i . So both policies are best responses to each other.

Implicit Competition Modeling and Equilibrium Analysis As aforementioned, there exists hidden competition between agents for the finite channel resources, and techniques for eliminating dominated strategies in solving a POMDP are very closely related to techniques for eliminating dominated strategies in solving games in normal form [24]. From the game perspective, agents can get their locally optimal policy according to the Best Response (BR) dynamic iteration. In a general game, each agent negotiates and chooses the channels to maximize its payoff based on the channel situation in the last time slot observation, but the other agents (interference) can not change their channels simultaneously. However, BR does not guarantee convergence in all cases, and the stable state can not always be with the optimal overall reward. Hence, we study the characteristics of the multi-channel access game and its sub-optimal as in the following.

Implicit Competition Game Model According to the aforementioned, the access problem can be defined as a cooperative game G ¼ hR; S; Di ; Ri, where the definitions of R and S are unchanged, Di = ×1 i k{πi} is the finite set of policies available to agent ri, R denotes the reward. We use θi(πi) to denote the probability distribution assignment over policies available to agent ri. Since agents select their policies simultaneously, agent ri’s belief about the other agents’ likely policies can be denoted as θ−i. If we define Vπi(s, θ−i) = ∑d−i θ−i(d−i)Vi(πi, d−i), then Bi ðyi Þ ¼ fpi 2 Di jVi ðpi ; yv Þ Vi ðp0i ; yi Þg denotes the best response function of agent ri, which is the set of policies for agent ri that maximize its value of some belief about the policies of the other agents d−i. Any policy that is not a best response to some belief can be abandoned. Algorithm 3: General framework of competition equilibrium. Require: 9 θ−i = {b1, . . ., bi−1, bi+1, . . .}; Ensure: 8 s 2 S and vðpi Þ vðpi Þ; 1: for each episode do 2: Initialize get state S and D; 3: repeat 4: compute Vpi ;di ðbi Þ Oðsi ; a; s0i Þ; i 5: if Vpi ;di ðbi Þ < Rmin then 6: prune pi and get new p0i ; 7: else 8: return πi to D0 ; 9: end if P bi ðs; di ÞVpi ðs; yi Þ; 10: until Ui ðbi Þ ¼ max pi 2D0

11: end for 12: return i ! ;


10 / 21


As in Algorithm 3, in a cooperative game, the reward functions for the game correspond to the reward functions of the POMDP, and an agent’s belief is a distribution synthesization over the possible policies of the other agents. For each agent ri, a belief is defined as a distribution over S×D−i, where the distribution is still denoted by bi, and the utility of bi is: X Ui ðbi Þ ¼ max bi ðs; di ÞVpi ðs; yi Þ ð14Þ 0 pi 2D

Given the set of policies and the reward function for a horizon-t’s game, the sets of policies and value functions for the t horizon game are constructed by exhaustive backup. When a horizon-t’s POMDP is represented in the normal form with implicit competition, the policy sets include all depth-t policy trees. Each policy profile is associated with a belief vector B, reresenting the expected t-step cumulative reward achieved for each potential start state by following t

an apposite joint policy, whiles the size of the policy set for each agent ri is more than AjOj i , which is doubly exponential in the horizon-t. Because of the large sizes of the candidate policy sets, it is usually not feasible working directly. The search algorithm (Algorithm 2) we present in this paper only partially alleviates this problem by performing iterative elimination of dominated policies at each stage in the construction of the normal form representation, rather than waiting until the construction completes. Considering an N-player implicit competition game, we can formulate the game subject as: 8 K X > Vðbi;n ÞGi;i;n s3 > > olnð1 þ Þ Rimin 0 > K > X > > n¼1 > d2i þ pj;n gj;i;n > > > > j¼1 < ð15Þ > > K > X Vðbi;n ÞGi;i;n s3 > > > olnð1 þ Þ Riexp 0 > K > X > n¼1 > > d2i þ pj;n gj;i;n > : j¼1; j6¼i

In this constraints equations, ri’s desired reward is no less than Rimin , and this guarantees a minimum level of resources achieved by each agent. v(bi, n) is value of the belief distribution of s2 . Riexp agent ri’s access in channel cn, d2i is the variance of the white Gaussian noise, and s3 ¼ lns 1 is the expected resource reward, and Gi, j, n is the channel gain between two agents in channel PK cn, and all policies should meet i¼1 vðbi;n Þ Rp ðb; KÞ. The existence and stability of the competition will be investigated in the following subsection.

Evolutionary Equilibrium Analysis with Replicator Dynamics In a multi-agent multi-channel access game, the stable state can be defined as the following: a joint policy P is and only if, for each agent and an arbitrary policy π in its policy space, vi(π ) vi(π, θ−i) is always satisfied. Consequently, the process of this game can be modeled as a replicator dynamics, and this can be derived for each agent separately. We consider a concise example with two new access agents r1 and r2. These agents appear first in the network with some spared channel opportunity (i.e., c1 to cn). With this specification, we analyze the evolutionary equilibrium for both deterministic and stochastic models. For the hidden competition among agents, the evolutionary equilibrium can be obtained as Replicator Dynamics solution [25], where χi denotes the proportion of the eager channel resources


11 / 21


that agents can get, and the replicator dynamics can be defined as the following: @wbi i ðtÞ ¼ swbi i ðtÞ½vibi vbi @t

ð16Þ

ðPi Þ ¼ sw ðtÞ½Uðp; Pi Þ U bi i

where vðci Þ is the estimated average reward for other agents in channel ci, and the function U is defined in Eq (7). With the two agents case, the evolutionary equilibrium is obtained as the solution of the following equation:

m1 log ðm2

wb11 Uðb1 Þ Þ vp1 ðb1 Þ wb11 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ

ð17Þ

wb22 Uðb2 Þ Þ vp2 ðb2 Þ ¼ m1 logðm2 ð1 wb11 Uðb1 ÞÞ þ wb22 Uðb2 Þ where the terms on both sides of the equation are the rewards that the new access agents can get from their beliefs b1 and b2, respectively. Accordingly, the stability of the evolutionary equilibrium can be analyzed using the following Jacobian matrix: 2

3 ðP1 Þ @swb11 ½Uðp; P1 Þ U ðP1 Þ @swb11 ½Uðp; P1 Þ U 3 6 7 2 b1 b2 J 1;1 J 1;2 6 7 @w @w i i 6 7 4 5 6 7¼ 6 b2 b 2 ðP2 Þ @sw1 ½Uðp; P2 Þ U ðP2 Þ 7 J 2;1 J 2;2 4 @sw1 ½Uðp; P2 Þ U 5 @wbi 1 @wbi 2

ð18Þ

where

J 1;1 ¼ sfZ2 vp1 ðb1 Þ wb11 ðZ2 vp1 ðb1 ÞÞ ð1 wb11 Þ ½m1 pp1;2 log

m2 ch1 k1 vp2 ðb2 Þg Z1

m1 Uðb1 Þ þ Z2 vp1 ðb1 Þ swb11 f b1 w1 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ

ð19Þ

m1 wb11 Uðb1 Þ m2 ch2 k2 1 wb11 m1 Uðb1 Þ p p log þ v ðb Þ þ g m 1 1;2 p2 2 Z1 Z1 wb11 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ

J 1;2 ¼

swb11 f

m1 Uðb1 Þ ð1 wb11 Þm1 Uðb1 Þ b2 Z1 w Uðb1 Þ þ ð1 w2 Uðb2 ÞÞ b1 1

ð20Þ

m1 wb11 Uðb1 Þ g þ b1 w1 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ


12 / 21


J 2;1 ¼

swb11 f

m1 Uðb1 Þ ð1 wb22 Þm1 Uðb1 Þ b2 Z1 w Uðb1 Þ þ ð1 w2 Uðb2 ÞÞ b1 1

ð21Þ

m1 wb11 Uðb1 Þ g þ b1 w1 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ J 2;2 ¼ sfZ2 vp1 ðb1 Þ wb11 ðZ2 vp1 ðb1 ÞÞ ð1 wb11 Þ ½m1 pp1;2 log

m2 ch1 k1 vp2 ðb2 Þg Z1

m1 Uðb2 Þ swb22 f b1 þ Z2 vp1 ðb1 Þ w1 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ

ð22Þ

m1 wb22 Uðb2 Þ m2 ch2 k2 1 wb22 m1 Uðb2 Þ p p log þ v ðb Þ þ g m 1 1;2 p2 2 Z1 Z1 wb11 Uðb1 Þ þ ð1 wb22 Uðb2 ÞÞ m ch k

where Zi specified as Z1 ¼ ð1 wb11 ÞUðb1 Þ þ ð1 wb22 ÞUðb2 Þ and Z2 ¼ m1 log b1 2 1 b12 . The w1 Uðb1 Þþw2 Uðb2 Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi J þJ 4J 1;2 J 2;1 þðJ 1;1 J 2;2 Þ2 , and the evotwo eigenvalues of J can be obtained from DðJ Þ ¼ 1;1 2;2 2 lutionary equilibrium is stable if these two eigenvalues have negative real parts [23].

Approximate Fair Maximization Policy Analysis Among the different Cooperative Game solutions, it is important to note that the issue about fairness in this context, e.g., new access agents, is different from the case of resource occupation among the early-existing agents in the network. In this section we will analyze approximate fairness of the game, and discuss the feasibility of the proposed neighbor-aware channel access scenario in the previous section. In this scenario, the approximate Pareto optimal result can satisfy all agents’ minimum requirements. Typically, if a channel is occupied, the other agents should be denied access to the frequency band. In the proposed competitive game model, a virtual-feasible resource access assignment set is 1 T 2 g , which denotes the miniexistent, hence, we can use a bounded set F = f℧min ; ℧2min ; :::; ℧Cmin T mum resource required of the game. The vector r = fR1exp ; R2exp ; :::; RCexp2 g represents the set of rewards for the access agents. The reward vector r 2 RKþ1 in the K channels can form the fairness problem ðF; rÞ. It has been shown that there exists a unique equilibrium, which can be calculated by Eq (17): fðF; rÞ ¼ argmax

K Y

Riexp vwi ð℧i Þ

ð23Þ

i¼1

Hence, we can use Eq (23), to confirm the selected solution for stable point in the previous section. This is also the point where “egalitarian” solutions of the game come in, and one such method is applicable to the equal gains principle, a Pareto optimal. For the 2 agents case in the previous section, the proportion χi in f, which is weakly efficient and satisfies the equal gain condition w1 ð℧1 Þ r1 ¼ w2 ð℧2 Þ r2 , is called the “egalitarian” solution. As mentioned earlier, in our resource access method, the stable acts as a marketplace where the primary and secondary systems can do bargaining. The fair solution for two agents about one channel access is at


13 / 21


the intersection of the egalitarian solutions: 8 2 > < U1 w1 ð℧1 Þ ¼ U2 Rexp > : U þ U ¼ argmax ðU 0 þ U 0 Þ 1 2 wi 1 2

ð24Þ

Condition Eq (24) dictates that the operating point should be on the boundary of the minimum region. Therefore, the intersection gives the unique approximate fairness solution. For a N agents game, the fairness problem Eq (23), should be solved by calculating the χi, i = 1, 2, . . ., N. To verify the case, we note that the stable point in the proposed design is also on the perpendicular boundary to (24) at its intersection. The corresponding optimization satisfies Eqs (15) and ½U1 w1 ð℧1 Þ ½R2exp U2 , which is subject to ðri ; BÞ, i = 1, 2, . . .N. It is (17), defined by max w i

straightforward to confirm the solution of Eq (18), as it satisfies the description at the beginning of this section.

Experiments and Results In this section, we designed several experiments to evaluate the proposed methods in above sections. We employed the multi-agent platform in [4] to simulate the multi-agent Ad-hoc network. The data unit length was fixed at 1,024 bytes. We evaluate the performance of the proposed scheme with wide band available by simulation and compare it with the priced-based centralized channel allocation method (OPTIMAL) [26] and the RANDOM method to validate the efficiency, which allows agent randomly accesses channels from its current belief on each channel. The simulation parameters are shown in Table 2. In order to facilitate the numerical statistics, we use one channel for global listening (especially for the OPTIMAL method of the centralized resource allocation), the rest of 10 channels allow agents to access. We conducted the simulations under various scale agents and the simulation results are the average value of 100 runs.

Resource Lost Rate As in Fig 3, the results show that the influences of different channel access strategies, have direct impact on the available channel resources. The axes represent the number of agents and the resource loss rates. Agents adopt a RANDOM method, and with the expansion of scale (20–200, from 0 agent there has no conflict), the congestion and resource loss rates continue to rise closing to 90%. Meanwhile, in our algorithm, agents communicate with their neighbors to exchange decision policies, and acquire a better joint behavior through continuous negotiation and iterations (the average max amount of loss is 65.38%, the average sample variance is 8.12%), the variance shows that our algorithm is more stable. Table 2. Simulation parameters. Simulation Parameters Number of agents

Value 200–1000

Maximum number of channels

11

Number of perception channels

5

Number of access channels Maximum resources for a channel Data frame size

3 100 1

doi:10.1371/journal.pone.0145526.t002


14 / 21


Fig 3. Resources Loss Rate. With the increased size of the agents, the randomness of the RANDOM method increased interference between agents, which brings down the network resources utilization. OPTIMAL method can maintain an efficient use of the resources, but its time consumption is much larger than the self-decision methods. POMDP methods maintain a relative balance to the above methods. doi:10.1371/journal.pone.0145526.g003

The OPTIMAL method can provide a better result, but the resource consumption of global consultations could not be avoided (the average max amount of loss is 21.92%, the average sample variance is 5.67%). Furthermore, due to agents’ misperception and accessing, the resource loss (conflicts) is inevitable, and will increase sharply with the expansion of the agents.

Resource Available Rate With the premise of partial observation, we set the RANDOM and our algorithm to start from the initial belief probability 0.5, as shown in Fig 4. But the difference is, our algorithm can reach an average resource showed at 52.67% and the average sample variance is 3.32%. RANDOM’s resource obtain rapid descent, and when there has 200 agents, the available resources only remain 33.48%, but with 12.16% average sample variance. When the number of agents and network resources are relatively homogeneous, the available resources rate can approach


15 / 21


Fig 4. Resources Available Rate. Under the same experimental setup, with the increasing size of the agents, RANDOM method reduces the resources available for each agents. Because of neighbor’s awareness in POMDP, the agents can be maintained in a state of relative balance (less variance than RANDOM). doi:10.1371/journal.pone.0145526.g004

the expected value. Then due to the increase of the agents’ number, the available rates decrease. The OPTIMAL can keep an average resource obtained at 68.23% and with a very stable average sample variance 2.37%. Obviously, agents can obtain more resources with our design than RANDOM provides. Especially, with the increase of the agents’ number and passing time (agents can exchange information with neighbors and accumulate from known knowledge), the resource availability rate remains in a relatively stable state until agents reaching network’s saturation.

Available Resource in Different Interaction Frequencies In this simulation, we test the average available channel resources for the new accessed agents under different interaction frequencies of the other agents in the network. We set 5 channels and 100 per slot new accessed agents, which are uniformly distributed in these 5 channels, the max agents number is 1000. The interaction frequency of the other agents was set to r = [0.2,


16 / 21


0.4, 0.6, 0.8]. X-axis represents the agents’ number. As in Fig 5, we can find two significant changes for the new accessed agents: with increasing numbers of network agents or the higher interaction frequency, the available resources decline. In addition, when the number of agents is more than the maximum number the network can support, the available resources for the entire network will be sharply reduced. According to the experiment’s results, we can make a bold hypothesis that while the number of agents and the resource relatively balance, there should be a suitable interaction frequency that makes each agent obtain available resources to maximize its utility.

Resource Available in Different Assignments In this simulation, we discuss the relationship between different team sizes when agents access channels under different assignment. We divide 100 agents into different team sizes, and allow them to access 5 channels. Simulation results are shown in Fig 6. Caused by new access agent,

Fig 5. Available Resources in Different Interaction Frequencies. In different interaction frequencies, the available resources shrink with the increasing number of agents. Similar to the allocation of limited resources in human society, the average gain decreases with the increasing number of people. Experimental results are consistent with the general understanding. doi:10.1371/journal.pone.0145526.g005


17 / 21


Fig 6. Resource Available in Different Assignments. In the 5 specified channel, the more uniform distribution of the agents, the higher the probability of their available resources, whereas the performance reduces (more crowded, no elimination of competition that makes the average income is lower). doi:10.1371/journal.pone.0145526.g006

blocking rate of channel will increase and influence the original agents due to the partial observation. Obviously, we can see that when the combination in each channel distribution is more uniform, the greater resource available, as assignment [10, 20, 20, 20, 30] and [20, 20, 20, 20, 20]. In an extreme access situation with [100, 0, 0, 0, 0], all agents are in one channel. When communication demand escalates, all agents almost have no chance to obtain available resources. From above analysis, we can conclude that agents will gain more available resources when they distribute more uniformly.

Resource Available Comparison Fig 7 displays the contrast of channel resources awareness between our algorithm and RANDOM. In this simulation, we set 100 agents with freedom interactive frequency. It can be seen that, with 500 tests for the same channel (distribution within the circles), agents can obtain the actual state of the channel. The red trail denotes the search result by


18 / 21


Fig 7. Resource Perception of 500 Tests. To illustrate the variances in the 4 aforementioned simulation results, this figure gives the resource perception comparison in 500 tests between POMDP and RANDOM. doi:10.1371/journal.pone.0145526.g007

POMDP, and black trail denotes the RANDOM method. It is obvious that the randomness and divergence of RANDOM far outweigh that of POMDP.

Conclusion We assumed in this paper that channel state transition probabilities can be entirely perceived, but in practice, this may not be available. The problem then becomes a decision model with unknown transition probabilities, but such mode is beyond the scope of this paper. In our design, we reduce a Dec-POMDP model to a simplified one by separating the problem into single-agent decision coordination, which may result in a low-complexity but potentially suboptimal design. In practical applications, systems Dynamics making use of pure policy space searching to solve all the problems become impractical, and need to be adjusted according to the actual situation and dynamics, and add more factors. In our future work, we will pursue the optimal joint design of the tradeoff between complexity and optimality, and will apply reinforcement learning theory on real multi-robots platform.

Supporting Information S1 Table. Experiment Data for Resource lost Rate Comparison. (XLS) S2 Table. Experiment Data for Resource Available Rate Comparison. (XLS)


19 / 21


S3 Table. Experiment Data for Available Resource in Different Interaction Frequencies Comparison. (XLS) S4 Table. Experiment Data for Resource Available in Different Assignments Comparison. (XLS) S5 Table. Experiment Data for Available Resource Perception. (XLS) S1 File. Supplementary Methods and Datasets Introduction. Supporting Information Supplementary Methods and Datasets Introduction. (DOC)

Acknowledgments This research was sponsored by the NSFC 61370151 and 61202211, the National Science and Technology Major Project of China 2015ZX03003012, the Central University Basic Research Funds Foundation of China ZYGX2014J055, and the Science and Technology on Electronic Information Control Laboratory Project.

Author Contributions Conceived and designed the experiments: YX ML AWM. Performed the experiments: ML YX AWM. Analyzed the data: ML YX AWM. Contributed reagents/materials/analysis tools: YX ML AWM. Wrote the paper: ML AWM YX. Responsible for the theoretical study and mathematical derivation: ML. PI (Principal Investigator) of all the funding projects, and, as the first and third authors’ PhD adviser, responded to problem modeling and algorithm layout: YX. Presided over the experimental test bed design and language revision: AWM.

References 1.

Wang T, Dang Q, Pan P. A Multi-Robot System Based on A Hybrid Communication Approach. Studies in Media and Communication. 2013; 1(1):91–100. doi: 10.11114/smc.v1i1.124

2.

Iqbal, J, Yousaf, MM, Awais, MM, editors. A scalable approach of message interpretation by demonstrations for multi-robot communication. Multitopic Conference, 2009 INMIC 2009 IEEE 13th International; 2009: IEEE. 10.1109/INMIC.2009.5383082

3.

Conti M, Giordano S. Mobile ad hoc networking: milestones, challenges, and new research directions. Communications Magazine, IEEE. 2014; 52(1):85–96. doi: 10.1109/MCOM.2014.6710069

4.

Zhang Y, Xu Y, Hu H. Cooperative Decision Algorithm for Time Critical Assignment without Explicit Communication. Intelligent Information Processing VII: Springer; 2014. p. 197–206.

5.

Liu M, Xu Y, Wu S, Lan T. Design and Optimization of Hierarchical Routing Protocol for 6LoWPAN. International Journal of Distributed Sensor Networks. 2015; 2015. doi: 10.1155/2015/802387

6.

Ab Wahab MN, Nefti-Meziani S, Atyabi A. A Comprehensive Review of Swarm Optimization Algorithms. PLoS ONE. 2015; 10(5): e0122827. doi: 10.1371/journal.pone.0122827 PMID: 25992655

7.

Zhao Q, Sadler BM. A survey of dynamic spectrum access. Signal Processing Magazine, IEEE. 2007; 24(3):79–89. doi: 10.1109/MSP.2007.361604

8.

Kulkarni RV, Venayagamoorthy GK. Particle swarm optimization in wireless-sensor networks: A brief survey. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on. 2011; 41(2):262–7. doi: 10.1109/TSMCC.2010.2054080

9.

Yan Z, Jouandeau N, Cherif AA. A survey and analysis of multi-robot coordination. International Journal of Advanced Robotic Systems. 2013; 10:399. doi: 10.5772/57313

10.

Capitan J, Spaan MT, Merino L, Ollero A. Decentralized multi-robot cooperation with auctioned POMDPs. The International Journal of Robotics Research. 2013; 32(6):650–71. doi: 10.1177/ 0278364913483345


20 / 21


11.

Tan L, Feng Z, Li W, Jing Z, Gulliver TA. Graph coloring based spectrum allocation for femtocell downlink interference mitigation. Wireless Communications and Networking Conference (WCNC), 2011 IEEE; 2011: IEEE. 10.1109/WCNC.2011.5779338.

12.

Xu Y, Wang J, Wu Q, Anpalagan A, Yao Y-D. Opportunistic spectrum access in cognitive radio networks: Global optimization using local interaction games. Selected Topics in Signal Processing, IEEE Journal of. 2012; 6(2):180–94. doi: 10.1109/JSTSP.2011.2176916

13.

Tang S, Mark BL, editors. Performance analysis of a wireless network with opportunistic spectrum sharing. Global Telecommunications Conference, 2007 GLOBECOM’07 IEEE; 2007: IEEE. 10.1109/ glocom.2007.880.

14.

Liu H, Krishnamachari B, Zhao Q, editors. Cooperation and learning in multiuser opportunistic spectrum access. Communications Workshops, 2008 ICC Workshops’ 08 IEEE International Conference on; 2008: IEEE. 10.1109/ICCW.2008.98.

15.

Busoniu L, Babuska R, De Schutter B. A comprehensive survey of multiagent reinforcement learning. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on. 2008; 38 (2):156–72. doi: 10.1109/TSMCC.2007.913919

16.

Fernandez-Gauna B, Graña M, Lopez-Guede JM, Etxeberria-Agiriano I, Ansoategui I. Reinforcement Learning endowed with safe veto policies to learn the control of Linked-Multicomponent Robotic Systems. Information Sciences. 2015; 317:25–47.

17.

Fernandez-Gauna B, Etxeberria-Agiriano I, Graña M. Learning Multirobot Hose Transportation and Deployment by Distributed Round-Robin Q-Learning. PLoS ONE. 2015; 10(7): e0127129. doi: 10. 1371/journal.pone.0127129 PMID: 26158587

18.

Halldórsson MM, Halpern JY, Li LE, Mirrokni VS, editors. On spectrum sharing games. Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing; 2004: ACM. 10.1145/ 1011767.1011783.

19.

Yichen W, Pinyi R, Zhou S. A POMDP based distributed adaptive opportunistic spectrum access strategy for cognitive ad hoc networks. IEICE transactions on communications. 2011; 94(6):1621–4.

20.

Seuken S, Zilberstein S. Improved memory-bounded dynamic programming for decentralized POMDPs. arXiv preprint arXiv:12065295. 2012.

21.

Feng M, Qu H, Yi Z. Highest Degree Likelihood Search Algorithm Using a State Transition Matrix for Complex Networks. Circuits and Systems I: Regular Papers, IEEE Transactions on. 2014; 61 (10):2941–50. doi: 10.1109/TCSI.2014.2333677

22.

Kish LB, Harmer GP, Abbott D. Information transfer rate of neurons: stochastic resonance of Shannon’s information channel capacity. Fluctuation and Noise Letters. 2001; 1(01):L13–L9. doi: 10.1142/ S0219477501000093

23.

Niyato D, Hossain E, Han Z. Dynamics of multiple-seller and multiple-buyer spectrum trading in cognitive radio networks: A game-theoretic modeling approach. Mobile Computing, IEEE Transactions on. 2009; 8(8):1009–22. doi: 10.1109/TMC.2008.157

24.

Hansen E A, Bernstein D S, Zilberstein S. Dynamic programming for partially observable stochastic games. AAAI. 2004; 4:709–715.

25.

Roca CP, Cuesta JA, Sánchez A. Evolutionary game theory: Temporal and spatial effects beyond replicator dynamics. Physics of life reviews. 2009; 6(4):208–49.

26.

Xue Y, Li B, Nahrstedt K. Optimal resource allocation in wireless ad hoc networks: A price-based approach. Mobile Computing, IEEE Transactions on. 2006; 5(4):347–64. doi: 10.1109/TMC.2006. 1599404


21 / 21

Cooperative Opportunistic Pressure Based Routing for Underwater Wireless Sensor Networks.

A Novel Cooperative Opportunistic Routing Scheme for Underwater Sensor Networks.

Application-Defined Decentralized Access Control.

Decentralized diagnostics based on a distributed micro-genetic algorithm for transducer networks monitoring large experimental systems.

Performance Analysis of Physical Layer Security of Opportunistic Scheduling in Multiuser Multirelay Cooperative Networks.

Non-Orthogonal Random Access in MIMO Cognitive Radio Networks: Beamforming, Power Allocation, and Opportunistic Transmission.

DATA ACCESS. Sharing by design: Data and decentralized commons.

Decentralized asynchronous learning in cellular neural networks.

An Efficient Distributed Compressed Sensing Algorithm for Decentralized Sensor Network.

The spectrum of opportunistic diseases complicating sarcoidosis.

A soft-hard combination-based cooperative spectrum sensing scheme for cognitive radio networks.

Cooperative spectrum sensing schemes with the interference constraint in cognitive radio networks.

Origin and structure of dynamic cooperative networks.

A Cooperative Framework for Fireworks Algorithm.

Cooperative search and rescue with artificial fishes based on fish-swarm algorithm for underwater wireless sensor networks.

An Energy-Efficient Spectrum-Aware Reinforcement Learning-Based Clustering Algorithm for Cognitive Radio Sensor Networks.

A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks.

A Spectrum Access Based on Quality of Service (QoS) in Cognitive Radio Networks.

Targeted Cooperative Actions Shape Social Networks.

Cooperative Spatial Retreat for Resilient Drone Networks.

Advancing toward quality, collaboration, and public access.

Discovering link communities in complex networks by an integer programming model and a genetic algorithm.

Model-Based Design of Tree WSNs for Decentralized Detection.

Group Structure and Female Cooperative Networks in Australia's Western Desert.