Graph-Based Skill Acquisition For Reinforcement Learning Graph-Based Skill Acquisition

MATHEUS R. F. MENDONÇA,
ARTUR ZIVIANI, and
ANDRÉ M. S. BARRETO,
National Laboratory for Scientific Computing (LNCC), Brazil

ACM Comput. Surv., Vol. 52, No. 1, Article 6, Publication date: January 2019.
DOI: https://doi.org/10.1145/3291045

In machine learning, Reinforcement Learning (RL) is an important tool for creating intelligent agents that learn solely through experience. One particular subarea within the RL domain that has received great attention is how to define macro-actions, which are temporal abstractions composed of a sequence of primitive actions. This subarea, loosely called skill acquisition, has been under development for several years and has led to better results in a diversity of RL problems. Among the many skill acquisition approaches, graph-based methods have received considerable attention. This survey presents an overview of graph-based skill acquisition methods for RL. We cover a diversity of these approaches and discuss how they evolved throughout the years. Finally, we also discuss the current challenges and open issues in the area of graph-based skill acquisition for RL.

CCS Concepts: • Computing methodologies → Reinforcement learning; Cluster analysis; • Mathematics of computing → Spectra of graphs; Graph algorithms;

Additional Key Words and Phrases: Skill acquisition, reinforcement learning, graph analytics, centrality, clustering

ACM Reference format:
Matheus R. F. Mendonça, Artur Ziviani, and André M. S. Barreto. 2019. Graph-Based Skill Acquisition For Reinforcement Learning. ACM Comput. Surv. 52, 1, Article 6 (January 2019), 26 pages. https://doi.org/10.1145/3291045

1 INTRODUCTION

Reinforcement Learning (RL) is a machine-learning approach in which an agent learns to accomplish a given task by simply interacting with the environment through a set of actions and receiving reward signals. This process allows the RL agent to learn how to interact with the environment such that it maximizes the expected discounted sum of rewards received. Different challenges arise in this context, such as delayed rewards and problems with a very large or an infinite number of possible states to analyze, just to name a few. The former challenge mentioned is related to scenarios where the agent receives reward signals several steps after a key action is performed, making it difficult for RL algorithms to associate such rewards with the corresponding action. The latter challenge refers to problems with too many or even an infinite number of states (e.g., continuous state spaces), making it infeasible to use exact representations.

It is believed that temporal abstraction may help tackle some of these issues [59, 60, 67]. Temporal abstractions encapsulate a sequence of actions into a single macro-action, allowing the RL agent to act at different time scales. A macro-action can also be interpreted as a high-level action that achieves a subgoal within an RL problem. In this survey, we focus on temporal abstraction approaches combined with RL and Deep Reinforcement Learning (DRL). More specifically, we focus on the automatic definition of macro-actions, also known as automatic temporal abstractions or automatic skill acquisition.

The particular focus on automatic skill acquisition is justified given its increasing importance in RL, especially considering some recent advances suggesting that it can indeed be beneficial in practice [1, 37, 41, 71]. There are different formalisms for defining macro-actions [18, 25, 57, 67] and several different methods that automatically define skills [1, 12, 37], although how to automatically define such skills is still considered an open problem.

For example, Konidaris and Barto [32] present a method for automatically defining skills in continuous RL domains based on the Options framework [67] by creating a chain of skills. The first created skill is used to reach the goal of the RL problem. It is constructed using a pseudo-reward function that rewards for reaching the goal, and the referred skill can only be used in states inside a state region that manage to reach the goal by triggering it. The following constructed skill uses as its goal the state region where the previously created skill can be used. This process, illustrated in Figure 1, is repeated until a skill is assigned to the state region where the starting state is located.

Fig. 1.
Fig. 1. Skill chaining example based on a figure from Konidaris and Barto [32]. The goal is represented by the large circle pointed by the arrows. In the left figure, the agent hasn't learned any skills yet. In the middle figure, the agent learned two different skills that are able to reach the goal. In the sequence, the right figure shows several skills chained together that allow the agent to reach the goal.

Another recent skill acquisition method, presented by Vezhnevets et al. [70], uses an action-plan matrix to represent macro-actions, where $A$ is the set of possible actions and $T$ is the macro-action duration. The action-plan matrix $\mathbf {P}_t$ thus indicates in column $i$ the probability of taking action $a \in A$ at time step $t+i$. The action-plan matrix is updated through Attentive Planning [24], a technique originally created for image generation.

Some of the pioneering work related to automatic skill acquisition has adopted graph-based metrics and clustering algorithms over the state transition graph (responsible for representing the environment dynamics) of an RL problem in order to create high-level actions. These graph-based approaches achieved good performance in the past [12, 13, 42] and continue to outperform current state-of-the-art methods [37]. This survey covers graph-based skill acquisition methods for reinforcement learning. We present how these graph-based methods evolved over time, discussing different approaches for continuous or discrete state spaces.

Graph-based skill acquisition methods can be classified into three classes depending on the approach they adopt to identify important states or macro-actions in the transition graph (Figure 2): (1) centrality measures, (2) clustering algorithms, and (3) spectral analysis. The first approach uses different centrality measures over the state transition graph in order to identify important states (or state regions) and then create macro-actions that allow the agent to move to these states. Second, clustering approaches use several different clustering algorithms over the state transition graph in order to identify states that share a specific similarity. Macro-actions are then created in such a way that they allow the agent to transition between state clusters. The third approach uses spectral analysis in order to identify macro-actions with their corresponding intrapolicy. There are also some mixed approaches that use spectral clustering for finding macro-actions, showing that these three categories can actually intersect. These mixed approaches are discussed in more detail in Section 6.

Fig. 2.
Fig. 2. Graph-based skill acquisition methods covered in this survey.

To the best of our knowledge, there is no recent survey that thoroughly explores skill acquisition methods, in particular graph-based ones. The only previous survey we know of, by Chiu and Soo [10], is now outdated. In recent years, graph-based methods applied to different domains are also receiving ever-increasing attention, aligned with the rise and consolidation of the interdisciplinary field of network science [3, 7]. Indeed, the relation between RL and graph-based techniques derived from network science emerges as a timely and relevant topic in the current machine-learning landscape. In this context, this survey aims to be a starting point for the interested reader who desires to get familiarized with recent graph-based skill acquisition methods for reinforcement learning.

This survey is organized as follows. Section 2 introduces the basic concepts of Reinforcement Learning, followed by a description of the skill acquisition problem. This section also describes how the state transition graph is created in such skill acquisition methods, presenting two of the most common approaches for creating these graphs. Some basic concepts of graph analysis methods are then introduced in Section 3, where we describe some key centrality measures and clustering algorithms that are often used in graph-based skill acquisition methods throughout the literature. Section 4 presents the most common benchmarks used by graph-based skill acquisition methods. We then go to the core of this survey: (1) Section 5 describes how to create macro-actions using centrality measures over the state transition graph, (2) Section 6 presents approaches based on clustering algorithms, and (3) Section 7 details approaches based on spectral analysis and other graph-based approaches that substantially differ from the classes presented in Figure 2. Section 8 presents some of the challenges and open issues in the area of graph-based skill acquisition. Finally, in Section 9, we summarize the article along with some final considerations.

2 REINFORCEMENT LEARNING (RL)

As previously stated, RL is a framework where an agent learns how to interact with a given environment through experience. The learning process takes place over discrete steps. At each time step $t$, the agent (1) gets an observation $o_t$ from the environment and determines its state $s_t$, (2) interacts with the environment by performing an action $a_t$, and (3) receives a reward signal $R_t$. This process is repeated until a termination state is reached, restarting the process. The discounted sum of rewards received by the agent during an episode, also called the return, is given by

\begin{equation} G_t = \sum _{i=0} \gamma ^i R_{t+i+1}, \end{equation}
where $\gamma$ is the discount factor, used to reduce the value of delayed rewards. The objective is to find a policy $\pi$ that maps actions into states such that it maximizes the expected return $\mathrm{E}_\pi [ G_t | s_t ]$.

The RL process is often modeled using a Markov Decision Process (MDP), defined as $(\mathcal {S}, \mathcal {A}, p, r, \gamma)$, where $\mathcal {S}$ is the set of states; $\mathcal {A}$ is the set of actions the agent can perform; $p$ is the state transition distribution function $p(s^{\prime }|s,a)$ for $a \in \mathcal {A}$, $s^{\prime }\in \mathcal {S}$, and $s\in \mathcal {S}$; and $r$ is the reward function given by $r(s,a,s^{\prime })$ that corresponds to the reward received by the agent when performing action $a\in \mathcal {A}$ while in state $s\in \mathcal {S}$ and transitioning to state $s^{\prime }\in \mathcal {S}$.

As previously mentioned, the objective is to learn a policy $\pi$ that selects an action to perform such that the expected return is maximized. The expected future rewards for performing action $a$ while in state $s$ following a policy $\pi$ is given by the action-value function, defined as

\begin{equation} Q^\pi (s,a) = \mathrm{E}^\pi [r(s,a,S^{\prime }) + \gamma Q^\pi (S^{\prime },\pi (S^{\prime }))], \end{equation}
where the expectation is over the next state $s^{\prime }$, which depends on $\pi$ and on the dynamics of the MDP. Instead of calculating the action-value function $Q^\pi (s,a)$, several RL methods use a value function $V^\pi (s)$ that indicates the expected sum of future rewards for following policy $\pi$ starting from state $s$.

After learning the action-value function $Q^{\pi }(s,a)$, we create a new policy $\pi ^{\prime }$ that selects actions greedily according to $Q^{\pi }(s,a)$, that is, $\pi ^{\prime }(s) \in argmax_a Q^{\pi }(s,a)$. In the exact case, the policy is updated until it converges to an optimal policy $\pi ^*$, such that $Q^{\pi ^*}(s,a) \ge Q^{\pi }(s,a)$ for all $\pi \ne \pi ^*$.

Q-Learning [73] is a very popular method to approximate the action-value function, which updates $Q^\pi (s,a)$ according to

\begin{equation} Q^\pi (s,a) = Q^\pi (s,a) + \alpha \big[r(s,a,s^{\prime }) + \gamma \max _{a^{\prime }\in \mathcal {A}} Q^\pi (s^{\prime },a^{\prime }) - Q^\pi (s,a)\big], \end{equation}
where $\alpha \in (0,1]$ is the learning rate.

The interested reader is encouraged to refer to the book of Sutton and Barto [66] for a more in-depth explanation of the basic RL framework and classical RL methods. In the following subsections, we introduce the required background for understanding the graph-based skill acquisition methods that are surveyed in this article.

2.1 Deep Reinforcement Learning

The action-value function $Q(s,a)$ must map every state-action pair $(s, a)$ to a real value. For small problems, $Q(s,a)$ can be represented by a table with $|\mathcal {S}|$ rows and $|\mathcal {A}|$ columns. However, for complex problems, the number of states $|\mathcal {S}|$ is usually too large to be efficiently manipulated and analyzed using a lookup table. In this case, the exact representation of Q has to be replaced by an approximation. In recent years, Neural Networks (NNs) have become a popular choice of approximator in RL, in particular, the so-called Deep NNs.

Deep NNs are models with several layers and with different architectures. The most notable DL models are Convolutional Neural Networks (CNN) [23, 34, 68] and Recurrent Neural Networks (RNN) [29, 49, 63]. CNNs are typically used to learn image patterns and therefore are commonly adopted for image classification or image segmentation problems. On the other hand, RNNs are capable of identifying temporal patterns in the training data, allowing the network to predict the next input after observing a sequence of inputs. These DL methods, such as CNNs and RNNs, have recently replaced the simple NN models used to predict $Q(s,a)$, thus giving rise to Deep Reinforcement Learning methods [50, 51].

Deep Reinforcement Learning methods are capable of dealing with large state spaces as well as learning complex behaviors. For instance, these methods have already been used for learning how to play games from the Atari 2600 console at an expert level by only considering the current game image (pixel vector) and the eventual reward offered by the game, a feat that could not be accomplished by standard RL methods. This achievement opened many potential perspectives for new applications based on machine learning. This is also the main reason for the ever-increasing interest we observe nowadays in Deep Learning in general, and in Deep Reinforcement Learning in particular.

2.2 Skill Acquisition

The action set of an RL problem may be composed of two types of actions: (1) primitive actions and (2) macro-actions. The former type is composed of atomic actions that take only one time step to be completed, while macro-actions are actions composed of a sequence of primitive actions, with varied lengths. In other words, a macro-action represents a sequence of primitive actions that are chosen based on a separate intrapolicy that aims at achieving a specific subgoal. Therefore, the Skill Acquisition problem is concerned with the definition of methods capable of defining macro-actions in RL problems using primitive actions.

There are several frameworks for constructing macro-actions, such as options [59, 67] and HEXQ [25], among others. We will focus on the options framework, since most of the graph-based skill acquisition approaches found during our survey use it to represent macro-actions.

Options are temporally extended actions composed of a sequence of primitive actions executed according to a closed policy. The agent can only execute an option when it is within a set of states called the initiation set, while it can only terminate an option in another set of states called the termination set. Options are then considered as an action belonging to the action set and are ruled by the policy-over-options $\pi$. An option $o$ is formally defined by the tuple $\lt I, \pi _o, \beta \gt$, where:

  • $I$ is the initiation set, which indicates the set of states from which the agent can initiate the execution of option $o$. The definition of this set depends on the skill acquisition method and will be explored further in this survey.
  • $\pi _o$ is the policy for option $o$, also called intraoption. A common approach to define $\pi _o$ is to use a special reward function (usually called pseudo-reward function) when executing an option, such as rewarding or punishing the agent when it achieves certain states. This reward function is only valid when an option is currently in execution.
  • $\beta$ is the termination condition, which is a probability function $\beta (s)$ that indicates the probability of option $o$ terminating in state $s$. A common approach is to use a termination probability of 1 for important states and 0 for states inside the scope of the given option. The termination condition depends on the problem being solved and the skill acquisition method used, which will be better explained further in this survey.

The update of the action-value function of options differs from the update used for primitive actions. For example, while the update of primitive actions is done by the standard Q-Learning algorithm, the update for options can be performed using the Macro-Q-Learning method [47], given by the following rule:

\begin{equation} Q(s_t, o_t) = Q(s_t, o_t) + \alpha [r_o + \gamma ^k max_{a^{\prime } \in O} Q(S_{t+k}, a^{\prime }) - Q(s_t, o_t)], \end{equation}
where $Q(s_t, o_t)$ is the action-value function for option $o_t$ executed in state $s_t$ at time step $t$, $k$ is the number of steps that option $o_t$ has executed so far, $O$ is the set of all possible actions that can be carried out (primitive actions and options alike), and $r_o$ is the cumulative discounted rewards received while executing option $o_t$ during the past $k$ steps, given by $r_o = \sum _{i=0}^k \gamma ^i r_{t+i+1}$.

When a new option is created (based on whichever method used to discover such options), it is necessary to learn its intrapolicy $\pi _o$ before effectively using it in the RL problem. Otherwise, the agent will not be able to properly execute the new option. A common approach used is to learn the intraoption through the use of the Experience Replay technique [35], where a sequence of actions and states previously experienced by the agent that is within the given option's initiation set is used to learn $\pi _o$. Using a sequence of states previously observed, coupled with a pseudo-reward function, allows the policy $\pi _{o}$ to be quickly learned, allowing the insertion of the newly formed option into the action set of the agent.

2.3 State Transition Graph (STG)

Graph-based skill acquisition methods work with a State Transition Graph (STG) that represents the RL problem at hand. Each node in an STG represents one state, and edges indicate the transition between states by executing an action. In this way, each edge is also associated with a signal indicating the received reward and the action executed that led the agent to experience such a transition.

Given this scenario, there are two main approaches used in graph-based skill acquisition methods for obtaining the STG, which are described as follows:

  • Global STG: This approach explores all possible states and all state transitions that may occur in a given RL problem. With this, a full STG that maps all possible transitions to all states is built. This approach guarantees a reliable STG that fully represents the RL problem at hand. The key problem here is that this approach is only viable for small and simple RL problems, since exploring all possible state transitions may become inviable for large or complex RL problems.
  • Local STG: Instead of exploring all possible state transitions, it is possible to approximate the STG by using past state-action trajectories visited during previous episodes. A state-action trajectory is a list of states visited by the agent in past experiences complemented with the action $a_t$ the agent performed in state $s_t$ at time $t$ and the resulting next observed state $s_{t+1}$. For an RL problem with a deterministic interaction between the agent and the environment, this approach is capable of determining an STG that represents a good approximation of the environment dynamics. On the other hand, for stochastic environments where the same action $a$ performed while in state $s$ during two distinct time steps can result in different next states, this approach may not be able to obtain good approximations.

3 GRAPH ANALYSIS METHODS

Since we will cover several graph-based methods for skill acquisition in this survey, it is important to lay out the basic foundation of some commonly used network analysis metrics used in these approaches. Therefore, in this section, we briefly describe some important centrality measures and clustering algorithms that are widely used in the works covered here. How these methods are used in skill acquisition problems is described later in the survey.

We consider here that a graph $G$ is formally represented by $G = (V, E)$, where $V$ is the set of vertices (nodes) and $E$ is the set of edges in the graph. The number of vertices (nodes) in the graph is $n=|V|$. The graph's connectivity is represented by an adjacency matrix $\mathbf {A}$, where $a_{ij} \in \mathbf {A}$ is the weight of the edge that connects nodes $i$ and $j$. For unweighted graphs, $a_{ij} = 1$ if there exists an edge between nodes $i$ and $j$, and $a_{ij} = 0$ otherwise.

3.1 Centrality Measures

There are several ways to define the relative importance of a given node in a graph. This is done using centrality measures: a score given to each node that measures its importance in a given context, thus leading to a node ranking. There are several possible centrality measures, but we will limit ourselves to describe only those that are more commonly used in the context of skill acquisition.

3.1.1 Degree Centrality. The most trivial centrality measure is based on the degree of each node. The degree $d_i$ of node $i$ is given by $d_i = \sum _{j \in V} a_{ij}$. The higher the degree of a node, the higher the degree centrality. The degree centrality thus allows the identification of highly connected nodes. The problem is that this information alone is sometimes insufficient, therefore requiring more meaningful centrality measures.

3.1.2 Closeness Centrality. The closeness centrality reflects the mean distance from a vertex to all other vertices. Therefore, a node with a high closeness centrality is on average relatively closer to all other nodes in the network. Formally, the closeness centrality of a node $i$ is thus defined as the inverse of the mean distance from that node to all other $n-1$ nodes in the network, i.e.,

\begin{equation} c_i = \frac{n - 1}{\sum _{j \ne i \in V} \delta (i,j)}, \end{equation}
where $\delta (i,j)$ is the distance connecting nodes $i$ and $j$.

3.1.3 Betweenness Centrality. In some scenarios, it is desirable to identify nodes by which many shortest paths pass through. This is achieved by betweenness centrality that measures the relation of the number of shortest paths (between any two nodes) that passes through a specific node $i$ and the number of total shortest paths between these same two nodes (there can be several shortest paths with the same length that differ in the path taken). The betweenness centrality of node $i$ is formally given by

\begin{equation} b_i = \sum _{s \ne i \ne t \in V} \frac{\sigma _{st}(i)}{\sigma _{st}}, \end{equation}
where $\sigma _{st}(i)$ is the number of shortest paths between nodes $s$ and $t$ that passes through $i$ and $\sigma _{st}$ is the total number of shortest paths between $s$ and $t$.

3.1.4 Eigenvector Centrality. In the degree centrality, each node contributes equally to the centrality value of each of its neighbor nodes. In the eigenvector centrality, on the other hand, central nodes contribute more to the centrality of their neighbor nodes. The eigenvector centrality of a node is proportional to the sum of the eigenvector centralities of its neighbor nodes. Therefore, central nodes are usually connected to other central nodes or are connected to a relatively large number of low-centrality nodes. The most central nodes typically have both. The eigenvector centrality $x_i$ of node $i$ is formally defined as

\begin{equation} x_i = \frac{1}{\lambda } \sum _{j \in V} a_{ij} x_j, \end{equation}
where $\lambda$ is an eigenvalue of the adjacency matrix $\mathbf {A}$ associated with the eigenvector $x$. Therefore, the eigenvector centrality of each node in a graph can be calculated by solving the eigenvector problem
\begin{equation} \mathbf {A}x = \lambda x. \end{equation}

3.2 Clustering Algorithms

There are two main groups of clustering algorithms: data clustering and graph clustering. Data clustering is a process where a set of points in a given space is clustered according to their proximity; that is, it aims to define clusters such that it minimizes the distance between points in the same cluster and maximizes the distance between points of different clusters. Graph clustering consists of the process where nodes with similar connections are associated with a single cluster. The objective is to define clusters that maximize the number of intracluster connections while minimizing the number of connections to nodes of other clusters. We will present here two of the most commonly used clustering methods in graph-based approaches for skill acquisition.

3.2.1 K-Means. K-Means is a clustering algorithm that takes as input a set of $n$ points and groups them into $k$ clusters based on their distance, where $k$ is a parameter of the algorithm. It starts by selecting $k$ random points to represent the centroid of each cluster, given by $C_1, \ldots, C_k$. It then enters an iterative stage that (1) assigns each point $p_i$ to the cluster $O_j$ with the nearest centroid $C_j$ (it usually uses the Euclidean distance) and (2) recalculates the centroids $C_1, \ldots, C_k$ using

\begin{equation} C_a = \frac{1}{|O_a|}\sum _{p_i \in O_a} p_i, \forall a \in \lbrace 1,\dots ,k\rbrace . \end{equation}

This process follows until there is no alteration in the configuration of clusters between two consecutive steps. The K-Means algorithm is synthesized in Algorithm 1.

3.2.2 Spectral Clustering. Spectral Clustering is a clustering method that uses as its main tool the eigenvectors of the Laplacian matrix of a graph. Before briefly describing the Spectral Clustering method, let us define the degree matrix and the Laplacian matrix:

  • Degree Matrix: the degree matrix, represented by $\mathbf {D}$, is a diagonal matrix where $\mathbf {D}_{ij} = 0$ for $i \ne j$ and $\mathbf {D}_{ii} = d_i$.
  • Laplacian Matrix: the Laplacian matrix is a special matrix given by $\mathbf {L} = \mathbf {D} - \mathbf {A}$ that has $n$ real and nonnegative eigenvalues (where $n = |V|$, i.e., the number of nodes in G). The number of eigenvalues of $\mathbf {L}$ that equals zero indicates the number of connected components in the corresponding graph $G = (V,E)$ [52].

The Spectral Clustering algorithm, presented in Algorithm 2, operates as follows: given a graph $G = (V,E)$ and a parameter $k$ that indicates the number of desired clusters, it computes the Laplacian matrix $\mathbf {L}$ and then determines the $k$ eigenvectors ($n$ being the number of nodes in the graph $G$) associated with the $k$ smallest eigenvalues of $\mathbf {L}$. It then constructs a matrix $\mathbf {U} \in R^{n\times k}$, where the $i$th column of $\mathbf {U}$ is given by the eigenvector $u_i$. It then considers a set of points $y_i \in R^k$ for $i = 1, \ldots, N$, where $y_i$ is given by the $i$th row of $\mathbf {U}$. The algorithm then uses a clustering method (K-Means, for example) over the set of points $y_i$, resulting in $k$ clusters $C_1, \ldots, C_k$. Finally, a new set of clusters $O_1, \ldots, O_k$ is created such that node $j$ of graph $G$ belongs to $O_i$ if $y_j \in C_i$, that is, $O_i = \lbrace j|y_j \in C_i\rbrace$.

4 BENCHMARKS AND COMPARISON APPROACHES

Before delving into the different graph-based skill acquisition methods, we first describe in this section the common benchmarks used by those methods. The benchmarks described here are usually used to compare the efficiency of one method over a baseline agent with only primitive actions.

The first widely used test environment for skill acquisition methods is the Two-Room domain (Figure 3), defined as follows: The agent starts at a random location in a room inside a grid-based world and must reach a random destination in the adjacent room. These rooms are separated by a wall with a small door connecting them. Other extended variations of this domain soon followed, such as the Four-Room and Six-Room domains, used, for example, by Şimşek [11].

Fig. 3.
Fig. 3. Two-Room test environment. The RL agent starts at a random position inside the left room and its goal is to reach the G position inside the right room. Many skill acquisition approaches use this test environment to compare different approaches. Skills learned in this environment using different techniques tend to focus on reaching the doorway, showing that this position represents a bottleneck.

The taxi domain is another test environment adopted by many authors. It is another grid-world environment where the agent controls a taxi. At the start, the taxi is positioned randomly in the grid, while the passenger and destination are chosen randomly over a fixed set of possible locations. The objective is to move to the location where the passenger is, pick him or her up, take him or her to the drop-off point, and leave him or her there. There are six possible actions for the agent: move north, south, east, west, pick up, and put down. The pick-up action puts the passenger inside the taxi, and it only works when the taxi is in the same location as the passenger. Otherwise, this action has no effect. Similarly, the put-down action drops the passenger at the current location. It only works when the passenger is already inside the taxi and it is positioned in the destination. Otherwise, this action does nothing. The agent receives a reward of +50 for correctly delivering a passenger, −1 for each action, and −10 for each unsuccessful pick-up and put-down.

The Atari Learning Environment (ALE) [51] is one of the most recent benchmarks in the skill acquisition community, as well as in the Reinforcement Learning community in general. The ALE provides a collection of Atari games where the agent can only access the current state of the game, given by the current frame, and the rewards. This makes ALE an ideal benchmark since it provides an easy way to test different methods with other approaches in the literature.

There are different ways to assess the quality of a given graph-based skill acquisition method based on which benchmark is used. In general, there are two main desirable attributes: (1) performance, which measures the total amount of rewards the agent can harness, and (2) training speed. For the Two-Room environment and its variations, authors are more concerned with the training speed and the exploration capacity, given that these environments can already be fully mastered by simple RL methods. Therefore, the objective is to create a method capable of reaching this level of mastery in fewer steps. The same is also valid for the taxi domain. For the ALE, however, the performance is also very important: many Atari games are considered to be very difficult for a simple RL method to master. Therefore, creating skill acquisition methods capable of improving the total score on a specific game is considered a gain. The training speed is also very important for this benchmark, given that the training time required for many of these games is high.

5 SKILL ACQUISITION THROUGH CENTRALITY MEASURES

Many skill acquisition methods using centrality measures that we encountered during our survey adopt a similar overall algorithm for creating macro-actions. This generic algorithm, presented in Algorithm 3. and illustrated in Figure 4, starts by defining a set of options $O$ and initializing it as the empty set. The algorithm then enters an iterative process that is repeated after the execution of each new episode of the RL problem of interest. During each episode, the state-action trajectory is saved. This information is then used to build or update the Local STG at the start of each iteration. With the Local STG updated, the next step consists of finding important states according to a centrality measure, resulting in a list $s^{\prime }_1, \ldots, s^{\prime }_e$ of $e$ important states. The algorithm then proceeds to create an option for reaching each of the important states $s^{\prime }_i, i \in 1,\ldots,e$, which is done by (1) defining the initiation set $I$ based on the adopted centrality measure, (2) using a pseudo-reward function that gives a negative reward for each step and a positive reward for reaching the subgoal state $s^{\prime }_i$, and (3) using as the termination condition $\beta$ a probability of 1 for the subgoal state or when the agent leaves the initiation set and 0 for the states in the initiation set. The newly created option $o = \lt I, \pi _o, \beta \gt$ is then trained using experience replay [35] in order to find a good intrapolicy $\pi _o$ that allows the agent to properly execute $o$.

Fig. 4.
Fig. 4. Illustration of the macro-action discovery process using centrality measures. The process starts with (a) an initial STG of a given problem. It then goes to (b) the identification of the most important states based on a given centrality (in this figure, we considered the betweenness centrality, with the size of a node proportional to its betweenness centrality). Finally, (c) the states with the highest centrality (the top $x$ nodes with the highest centrality or all nodes with a centrality value above a predefined threshold) are identified as subgoals, thus resulting in macro-actions capable of reaching such states.

The main differences between distinct graph-based skill acquisition approaches based on centrality measures, as we show in the remainder of this section, are (1) which centrality measure is used, (2) how the STG is built, and (3) how the initiation set is defined. The remaining steps are usually very similar to each other. Algorithm 3. presents the case in which a Local STG is used, although a Global STG might also be used, as in some approaches presented in the remainder of this section. Another key concept in many centrality-based methods is that the important states $s^{\prime }_1, \ldots, s^{\prime }_e$ are defined as being those with a centrality measure higher than a predefined threshold $t$. Hence, the threshold $t$ indirectly controls how many subgoals are identified within the STG. As a consequence, choosing an appropriate value for $t$ depends on the particular problem being solved, as well as on which centrality measure is used. We now describe existing graph-based skill acquisition approaches using centrality measures, divided into three categories: (1) methods that use a Global STG, (2) methods that use a Local STG, and (3) those that do not use an STG.

5.1 Centrality Measures with a Global STG

One of the pioneering works related to skill acquisition is presented by Digney [19]. In his approach, the importance of a state is measured by the gradient of reinforcement signals received during state transitions and the number of visits. Therefore, each state is associated with two functions: (1) a function $m(s)$ that measures the gradient of reinforcement signals when transitioning to state $s$ and (2) a function $c(s)$ that indicates how often the state $s$ was visited during training. Whenever a state achieves a value for $m(s)$ or $c(s)$ higher than predefined thresholds, this state is considered to be a subgoal, which means that higher reinforcement signals are received when the agent reaches this state. The skills are then considered to be any sequence of actions that leads the agent from a given state to a subgoal. Therefore, Digney [19] introduced a concept analogous to node centrality as seen in complex network problems, even though this was not explicitly declared by the author. Digney [19] also presented the Two-Room environment (Figure 3), where the agent learns that the area that contains the door is associated to important states, since these states are visited often (resulting in high values of $c(s)$). Additionally, the destination states are also considered important states (and subgoals) given their high value of $m(s)$. The main disadvantages of this approach are the following: (1) using visitation frequencies to find important states may prove difficult and unfeasible for problems with a large or continuous state space, and (2) the reinforcement gradient approach may not prove useful for problems with delayed rewards.

Şimşek and Barto [12], Şimşek [11], and Şimşek and Barto [14] present the first approach to use the betweenness centrality to identify important subgoals. Instead of calculating the exact betweenness centrality for each node, the authors only calculate the number of minimum paths $c_s(v)$ starting from a given node $s$ that passes through a node $v$. If $c_s(v) \gt t$, where $t$ is a threshold value, then $v$ is associated with a subgoal. A skill is created for each subgoal state $g$ by setting the initiation set as the states $s$ where $c_s(g) \gt t$. To reduce and filter redundant skills, the authors defined a support measure for each subgoal $g$, given by $p(g) = \sum _{s \in I(g)} c_s(g)$, where $I(g)$ is the initiation set of $g$. Then, all the subgoals previously identified are inserted into a new subgraph $G$. The only subgoals considered are those with the highest support $p(g)$ for each strongly connected component of $G$, resulting in the identification of fewer skills. The authors compared their approach with a basic Q-Learning agent using only primitive actions (called from here on the baseline agent) in the Four-Room and the taxi domain. It was shown that, by using skills, the agent could reach the destination in the Four-Room domain in fewer steps when compared with a baseline agent using only primitive actions, while in the taxi domain, the skilled agent managed to achieve a higher score while executing the same number of steps.

Menache et al. [48] propose an approach inspired by the Max Flow/Min Cut problem, where the objective is to search for a path in a weighted graph that corresponds to the maximum flow between a source node $s$ and a sink node $t$. The skill acquisition problem can then be abstracted to a Max Flow problem by considering the STG. The initial state corresponds to the source node $s$, and the sink node $t$ is represented by the ending state (the definition of the source and sink nodes may differ depending on the application). The procedure, called Q-Cut, starts by updating the weight of each edge based on relative frequencies, obtained after each new episode. This allows the execution of the Max Flow algorithm, which in turn allows the detection of bottleneck states. These states are defined as subgoals. As a result, this approach allows the detection of important states without the need for several episodes, because if the agent crosses a bottleneck at least once, the Q-Cut algorithm will already be able to identify it. After identifying bottleneck nodes, the initiation set is defined as the source states of the identified cuts. Similarly to the previous approach, the agent based on the Q-Cut method was compared with a baseline agent in the Two-Room and Six-Room domains, where it was shown that the agent using skills managed to reach the goal in fewer steps than the baseline agent.

In this subsection, we presented some centrality-based skill acquisition methods that rely on a Global STG. The main difference between these methods is how the subgoal states are chosen: betweenness centrality, visitation frequencies, and bottleneck states in a minimum cut problem. Since these methods use a Global STG, their use is restricted to small problems, given that a Global STG does not scale well for problems with a large state space.

5.2 Centrality Measures with a Local STG

The Relative Novelty algorithm, proposed by Şimşek and Barto [13], defines subgoal states as those that connect many clusters. Each state is associated with a novelty value, which measures how often this state is visited, where values near 1 indicate that a state is rarely visited and values near 0 indicate the opposite. Another key concept is the relative novelty of a state that measures the relation between the novelty of states visited prior to the current state and the novelty of states visited afterward. High values of relative novelty indicate that a state, after visited, leads the agent to undiscovered states. Therefore, subgoals are defined as states with relative novelty values above a predefined threshold. The initiation set of a given option (each one associated to a different subgoal) is defined as the $N$ previous states visited by the agent before reaching the subgoal ($N$ being a parameter of the algorithm). A drawback of this approach is that maintaining past histories as well as the novelty values for each visited state may prove unfeasible for problems with a large, complex, or continuous state space. The results achieved for the proposed method by the skilled agent in the taxi, Two-Room, and Six-Room domains also showed better performance when compared with a baseline agent.

A new node centrality measure also used to identify bottleneck states is proposed by Rad et al. [61]. The idea is to create a centrality measure capable of identifying bottleneck nodes with greater precision than other centralities. The proposed centrality, called Connection Graph Stability (CGS) centrality, is given by

\begin{equation} S_u = \sum _{s\ne u \ne t \in V} \frac{\delta _{u}(s, t)}{\sigma _{st}}, \end{equation}
where $s$, $u$, and $t$ are nodes of the built STG, $V$ is the set of nodes of the graph, $\delta _{u}(s, t)$ is the length of the shortest path between $s$ and $t$ that passes through $u$, and $\sigma _{st}$ is the number of shortest paths linking nodes $s$ and $t$. The CGS centrality, therefore, measures the relation between the length of the shortest paths passing through a given node and the number of total shortest paths. High values of the CGS indicate that a node is used to connect several pairs of nodes, including node pairs that are distant from the measured node. The authors then define a second scoring method for each node that is based on the centrality measure of the node and the centrality of its neighbors, given by
\begin{equation} SX_u = X_u \cdot \left(\frac{X_u}{max_{n\in N(u)} X_n}\right)^2, \end{equation}
where $X_u$ is the centrality measure of node $u$ (any centrality can be used for this scoring function: betweenness, closeness, connection graph stability, and so on) and $N(u)$ is the set of neighbors of node $u$. The bottleneck states are then defined as the $k$ states with the highest $SX_u$ score, where $k$ is a parameter of the algorithm. A baseline agent was compared with an agent using skills with different scoring functions: betweenness, closeness, and the proposed CGS. The adopted benchmarks were the Two-Room and taxi domains, where the agent using skills based on the CGS centrality managed to reach the goal in fewer steps, followed by betweenness and closeness centralities. As expected, the baseline agent required a greater amount of steps to reach the goal.

Concept Bridge Centrality [53] extends the previous concept of Connection Graph Stability centrality by considering local and global information of a given node, allowing it to better filter redundant bottleneck nodes. This also identifies bottleneck nodes. The local information is given by the bridge coefficient that measures the degree of difference between a given node and its neighbors, as follows:

\begin{equation} BR(u) = e^{\alpha | d_u - \frac{1}{d_u} \sum _{u_j \in N(u)} d_{u_j} |}, \end{equation}
where $\alpha \gt 0$ is a constant value. High values of the bridge coefficient indicate that a node has a considerably lower degree with respect to its neighbors, suggesting that it must be a bottleneck node. The global information is measured by any centrality measure; Moradi et al. [53] used the Connection Graph Stability in their experiments. As a result, this approach finds a reduced number of bottleneck states that are then used to create macro-actions through the options framework. The graph and skill construction used in that work is identical to the approach used in the author's previous work [61]. Both previous centrality measures proposed by the authors, namely, the Connection Graph Stability and Concept Bridge centralities, were compared with similar approaches using betweenness [12] and closeness centralities. Results show that these two new centralities are capable of better filtering the important nodes, thus achieving a better performance.

5.3 Centrality Measures without an STG

Some centrality-based methods for skill acquisition use only the information contained in past trajectories in order to define the importance of a state. Hence, these methods do not require an STG. In the sequence, we will describe how each of these methods defines its centrality measure.

McGovern and Barto [46] propose an approach based on the idea of diverse density [43, 44]. Diverse density is a supervised learning technique used for finding important features in a state space by assigning a score to each feature based on how often they appear in positive instances and not very often in negative ones. Each episode of the RL problem is then transformed into a sequence of states visited during this episode and is later assigned to one of two classes: (1) successful and (2) unsuccessful episodes. States with a high diverse density can therefore be interpreted as bottlenecks or subgoals, as they appear regularly in successful episodes and rarely in unsuccessful ones. The authors also suggest that this approach may be used for abstract state spaces, useful for problems with a large or continuous state space, although this is not tested in that work. The option's initiation set is defined as the $t-n$ previously visited states before reaching a subgoal in step $t$ during successful episodes. After each new episode, the diverse density procedure is re-executed and new subgoals are found. Only subgoals found several times are considered in order to prune unwanted subgoals. This method is tested in the Two-Room experiment, as well as in the Four-Room experiment [59], where it is shown that it achieves better results than the RL approach using only primitive actions. As with previous approaches, the authors showed that an agent using the proposed skill acquisition method outperformed a baseline agent using only primitive actions in the Two-Room domain.

Connect-based subgoal discovery [9] follows the idea of the previous work [46]. Instead of only using the visitation frequencies to determine important states, the authors propose an approach capable of identifying nonimportant states with high visitation frequencies. This is accomplished by analyzing the visitation frequency of neighboring states and considering only states whose visitation frequency is higher than their neighbors. This helps to disregard highly visited states that are not important. The authors used the Two-Room environment to show that their approach is capable of identifying only doorways as subgoals, effectively filtering nonimportant states.

Another approach that uses bottleneck states is presented by Goel and Huber [22], where the authors propose to identify bottlenecks by checking the number of predecessor and successor states of a given state. Whenever a state has many predecessors and few successor states, the given state is a bottleneck. The proposed method is tested in a variation of the Four-Room domain, where it only identifies doorways as subgoals. The skilled agent also learned to reach the goal with considerable fewer training steps when compared with a baseline agent using only primitive actions.

Even though the methods presented in the current subsection do not use an STG, they all rely on statistical information of each node. However, in large or continuous problems, it is difficult to associate a set of information with each node, given that there are many or even infinite nodes. Since these approaches were only tested in simple problems, it is difficult to know how well they perform for more complex problems.

6 SKILL ACQUISITION THROUGH CLUSTERING ALGORITHMS

Skill acquisition approaches based on clustering algorithms also present a general algorithm that guides them. Once again, we will present the general algorithm, followed by many of the encountered approaches in the literature. We will consider the use of a Local STG given its wider usage, although a Global STG can also be used. The general algorithm, presented in Algorithm 4. and illustrated in Figure 5, operates as follows: After the execution of each new episode of the RL problem, the STG is built based on the observed trajectories (in case a Local STG is used). With this graph, states are separated into clusters through a clustering algorithm. A new option is then created for each pair of clusters $C_i$ and $C_j$ such that a state in $C_j$ is reachable from a state in $C_i$ (represented here by $C_i \Rightarrow C_j$). The initiation set $I$ is defined as being the states that belong to cluster $C_i$. The adopted pseudo-reward function rewards the agent for reaching a state in the destined cluster $C_j$ and punishes it for each step taken or for entering a state that does belongs to neither $I$ nor $C_j$ ($s \not\in C_j \bigcup I$). The termination probability function $\beta$ is set to 0 for states $s\in I$ and 1 for any state that does not belong to the initiation set ($s \in S - I$). A new option $o = \lt I, \pi _o, \beta \gt$ is then built or updated (in case it already existed from previous episodes) and is then trained using experience replay.

Fig. 5.
Fig. 5. Illustration of the macro-action discovery process using clustering algorithms. The process starts with a, STG of a given problem (a). It then identifies the clusters (b) of the initial graph (a). Each cluster is associated to an abstract state (c), and the connections between abstract states are associated to a macro-action.

The main concept of clustering-based skill acquisition methods is to build an option capable of transitioning the agent from one cluster $C_i$ to another $C_j$ (these cluster transitions are represented by the edges in Figure 5(c)). Since these cluster transitions are controlled by an option, such transitions may take several primitive actions to occur. These primitive actions are chosen based on the intrapolicy $\pi _o$, learned through experience replay using the pseudo-reward function previously described. Given that each cluster represents a collection of similar states, an option that transitions the agent between clusters can be seen as an abstract behavior.

Differently from the centrality-based methods, no skill acquisition approach using clustering algorithms adopts a Global STG. All methods described here use a Local STG. Hence, the main differences between the clustering-based methods are the used clustering algorithm and how to define the importance of a state. Several new clustering techniques were developed in this area to ensure that the STG is well separated. The quality of the resulting clusters directly impacts the quality of the skill acquisition step. In the remainder of this section, we will present the clustering methods proposed for graph-based skill acquisition using Local STGs.

One of the first approaches to actually use clustering algorithms to find skills in an RL problem is presented by Mannor et al. [42]. The adopted clustering function focuses on creating clusters with roughly the same size and with similar reinforcement signal transitions. This approach allows the agent to identify promising clusters by preferring options that lead the agent to clusters with high reinforcement signals. However, a key question arises: when is the best moment to execute the clustering algorithm? This is important since it is infeasible to perform this operation after each episode due to execution time constraints. The author states that this issue is problem related. The approach suggested by the author is to perform the clustering algorithm when no new states are visited after a certain number of episodes (this number is a parameter). The authors compared an agent using their method with a baseline agent adopting only primitive actions in a 19-Room domain, showing that their approach manages to reach the destination in fewer steps.

Following the idea presented by Şimşek and Barto [13] of defining important states as those that connect different regions of the state space, Şimşek et al. [15] propose a new metric to determine the states that have such a characteristic. The authors propose the L-Cut algorithm, where the idea is to determine a cut of the state trajectory graph that separates different regions. An edge's weight is associated with the number of times a given transition was experienced. The graph cut metric used is the Normalized Cut (NCut) [64], which separates the STG into two distinct clusters and measures the relation between the probability of transitioning between clusters and the probability of transitioning within the same cluster. Therefore, the goal here is to find a minimum NCut, which is a cut that effectively identifies the edges that separate the two clusters. Since finding a minimal NCut is an NP-hard problem, the authors first use the spectral clustering method proposed in [64], which outputs an approximation of the minimum NCut and an approximate cluster separation, as we have already shown in Section 3.2.2. The L-Cut algorithm then proceeds to test the possible cuts suggested by the spectral clustering method by analyzing the second eigenvector and choosing the cut with the minimal NCut. This procedure is performed once for each newly obtained state trajectory. Options are designed to take the agent from a given state to a state output by the L-Cut algorithm. The initiation set is defined based on cuts that identified the given state as a subgoal and are represented by the states from the opposite cluster. An agent using the L-Cut algorithm was compared with an agent using the Relative Novelty (RN) method [13] and a baseline Q-Learning agent with no skills in the Two-Room and taxi domain, where the L-Cut method outperforms both other methods, although the gain over RN is considerably smaller.

Kheradmandian and Rahmati [31] propose an approach based on two clustering algorithms over the STG G in order to find subgoal states. The first algorithm is a topology clustering algorithm that iteratively removes specific edges from $G$ in order to identify clusters, resulting in a new graph $G^{\prime }$ with nonconnected clusters. The second is a clustering algorithm based on the value function, performed over $G^{\prime }$. It starts by defining the distance between two states as being the difference between their value functions. It then creates clusters such that the distances between states within the same clusters are minimized, while maximizing the distance between states of different clusters. An option is then built in one of two ways: (1) action sequences that transition the agent between clusters or (2) frequent action sequences. The former method creates one option for each connection between clusters, similar to the approach presented in Algorithm 4., where the shortest path for each transition is learned separately through Dynamic Programming [66]. The latter method uses statistical analysis to identify frequent action sequences used to reach a given subgoal, where each sequence identified as frequent is assigned to an option, where the initiation set is given by the starting state of the sequence, the termination condition is set to 1 for the aimed subgoal state of the specific sequence, and the option's subpolicy is determined by the sequence itself. A drawback of this approach is that the adopted statistical analysis method may not prove useful in large or continuous state spaces, given that the same sequence of actions may hardly occur a second time. The authors used the 19-Room and the taxi domains to test their method, showing that it outperforms the basic Q-Learning algorithm.

Kazemitabar and Beigy [30] present a novel approach where subgoals are identified through Strongly Connected Components (SCCs). Other approaches of skill acquisition have already used SCC in some way [12, 25], but [30] present a skill acquisition method that relies solely on the identification of SCCs. First, Q-Learning is used during a number of iterations in order to obtain a viable Local STG. Second, SCCs are identified using a linear algorithm proposed by the authors. Third, subgoal states are identified as being states that connect two SCCs, similar to previous approaches that consider important states as being those that connect different regions of the state space [13, 46, 48]. The final step consists of creating options for reaching such subgoals, following a similar procedure to Algorithm 4. This approach is tested in the Two-Room domain, in which the skilled agent using the proposed approach outperforms the baseline agent.

A new clustering algorithm created to identify important nodes is presented by Entezari et al. [21]. The authors propose a new metric to measure the quality of a cluster based on the relation of (1) connections between border nodes and other nodes of the given cluster and (2) the number of connections of these same border nodes with their neighbors in different clusters. The metric is given by

\begin{equation} R(X) = \frac{\sum _{i\in X, j \in b(X)} a_{ij}}{\sum _{i\in n(X), j \in b(X)} a_{ij}}, \end{equation}
where $b(X)$ is the set of border nodes of cluster $X$, $n(X)$ corresponds to the neighbor of border nodes from a different cluster, and $a_{ij}$ indicates the element $(i,j)$ of the adjacency matrix. The objective is therefore to maximize $R(X)$. This is accomplished by adding nodes that increase the value of $R(X)$; that is, node $i$ is included in cluster $X$ if $R(X + i) \gt R(X)$. The agent based on the proposed skill acquisition method is tested in the Four-Room domain, once again outperforming the baseline agent.

Davoodabadi and Beigy [16] propose a clustering algorithm that blends two different approaches: Edge Betweenness [54] and Label Propagation [62] algorithms. The former algorithm iteratively removes the edge with the highest betweenness and analyzes the resulting clusters. At each iteration, it calculates a measure that indicates the quality of the resulting cluster, called the modularity measure, given by

\begin{equation} Q = \sum _{i} (e_{ii} - a_i^2), \end{equation}
where $e_{ij}$ is an index of a $k\times k$ matrix ($k$ being the number of existing clusters) and corresponds to the number of edges connecting clusters $i$ and $j$, while $a_i$ is the total number of edges between cluster $i$ and other clusters. The value of $Q$ varies between 0 and 1, where values near 1 indicate a strong community structure, while 0 indicates a random structure. The latter algorithm starts by giving a unique label to each node, and for each iteration, each node receives the most common label among its neighbors. This is repeated until a stopping criterion is satisfied. The algorithm proposed in [16] performs the following steps: (1) run the label propagation algorithm and (2) for each pair of clusters $(i, j)$ resulting from the first step, calculate the modularity variation $\Delta Q(i,j)$ of joining clusters $i$ and $j$ and join clusters $i$ and $j$ associated with the highest $\Delta Q(i,j)$. This is performed until the modularity stops rising. Subgoals are then identified as being the border nodes of each cluster. This approach was then compared with the Edge Betweenness and the Label Propagation algorithms for skill acquisition in RL problems, showing that the proposed new algorithm performed better, since it was capable of achieving better clustering formations. The experimental results achieved follow the same idea as previous skill acquisition methods, where an agent based on the proposed method outperforms the baseline agent in the taxi domain.

Following a similar approach presented in the previous work [16], a new variation of the Label Propagation Algorithm (LPA) [62], proposed by Bacon and Precup [2], is used in order to separate similar states into clusters and then identify subgoal states as border states of these clusters. The novelty proposed by the authors relies on the utilization of a variation of the LPA that considers a weighted Local STG, where the weight of an edge is equal to the number of times a given transition was observed. Therefore, the label associated with a given node is not based on the label occurring in most neighbors of a given node, but instead is the label with the highest cumulative weight of a node's neighbors. After clustering the Local STG, the subgoals are defined as being the border nodes, similar to previous approaches [12, 16]. The proposed skill acquisition method using LPA is compared with the basic Q-Learning method and to other skill acquisition methods using betweenness and closeness centralities in the Four-Room domain. The LPA method outperforms the Q-Learning method but is outperformed by the betweenness and closeness methods. Despite this, the authors show how to extend their LPA approach to continuous environments.

Taghizadeh and Beigy [69] propose two clustering approaches for skill acquisition: the first is based on spectral clustering and a variation of the $k$-means algorithm, called $k^{\prime }$-means [72], while the second approach uses Eigenvector Centrality (EVC) to separate clusters. The presented spectral-based subgoal discovery performs a spectral clustering over the Local STG in order to obtain a set of $k$ clusters, with $k$ being a parameter informed beforehand. Since it is difficult to know a priori the number of clusters in the STG of an RL problem, the authors propose to use the $k^{\prime }$-means algorithm in order to identify clusters in the eigenvector matrix $U$ (Section 3.2.2). The $k^{\prime }$-means algorithm receives as input the maximum number of clusters, given by $k$, and returns $k^{\prime }$ clusters, where $k^{\prime } \lt k$ and $k^{\prime }$ is an approximation to the correct number of clusters. Therefore, the number of clusters does not need to be informed beforehand. The main drawback of this approach, as pointed out by the authors, is the high computational cost. Hence, the authors propose a second approach, called EVC-based subgoal discovery, where the eigenvector centrality of each node is used in order to separate states into different clusters. To do this, the authors consider the EVC variation over the graph and categorize states into three possible classes: (1) cluster center, composed of the states with the highest EVC among its neighbors; (2) cluster member, which encompasses those states with a lower EVC than at least one of its neighbors and only has neighbors of the same cluster; and (3) cluster border, which includes states with an EVC lower than at least one of its neighbors and has neighbors from different clusters. Subgoals are then identified as being cluster border states and skills are then formed in a similar way as in [13, 14]. Finally, both approaches adopt a skill pruning technique in order to avoid learning inefficient or redundant skills. Whenever an option created has an initiation set with a value function higher than the subgoal state, this skill is removed. The EVC-based approach is then compared with other graph-based skill acquisition approaches, namely, node betweenness [14], strongly connected components [30], edge betweenness, and reform label propagation [62]; it was shown that the EVC-based approach outperforms the other mentioned approaches in the Six-Room and taxi domains.

The approach proposed by Mathew et al. [45] and later expanded by Krishnamurthy et al. [33] uses a spectral clustering algorithm called PCCA+ [74] in order to identify abstract states that represent a set of states with similar characteristics. The PCCA+ method receives the transition matrix of a given graph as an input and returns a membership function $X_{i, j}$ that quantifies the membership of a given state $s_i$ to an abstract state $S_j$. Each connection between two abstract states $S_i$ and $S_j$ are used to create an option. This transition between abstract states is accomplished by visiting neighbor states following the positive gradient of the membership function to the destined abstract state. The option ends when a state $s_y \in Sj$ is reached, $S_j$ being the destined abstract state. This method is then coupled with any reinforcement learning algorithm in order to learn the optimal policy for the actions and options. Another novel idea presented in [33] is how to extend the PCCA+ spectral clustering method to define options for Atari games using the Arcade Learning Environment [6]. Since running the PCCA+ method for high-dimensional problems is very expensive, the authors adapted the Deep Learning network architecture used for predicting future frames in Atari games [55] in order to predict future frames using a high-level representation, reducing the problem dimension and thus making the execution of the PCCA+ algorithm more feasible. This method is then coupled with the Deep Q-Network (DQN) [51] RL model to learn when to use each macro-action identified by the proposed algorithm. PCCA+ is compared with the L-Cut algorithm [15] in the Two-Room and taxi domains, where it manages to identify fewer subgoals, resulting in a better overall performance. One of the main novelties of this work is that it is one of the first graph-based skill acquisition approaches to be tested in the ALE domain, where the PCCA+ method is shown to enhance the overall performance of the learning agent in the Seaquest Atari game when compared with the results obtained using DQN.

7 SKILL ACQUISITION THROUGH SPECTRAL ANALYSIS AND OTHER APPROACHES

In this section, we will describe another macro-action formalism that uses graph techniques. We will also present methods based on spectral analysis of the STG in order to automatically define skills.

The HEXQ framework, presented by Hengst [25, 26], is a different approach for skill acquisition. The main characteristic of this approach is that it uses a hierarchy of state variables created by an expert, where the less frequently changed a state variable is, the higher its position in the hierarchy. Each level of this hierarchy is associated with a sub-MDP that disregards the variable associated with its level. The idea is to learn an optimal policy for each sub-MDP, which is simpler than the original MDP. The states of a sub-MDP are transformed to abstract states that correspond to strongly connected components. Whenever a state transition inside a sub-MDP changes the value of the variable of the corresponding level, this transition is called an exit and it also serves as a subgoal. The macro-actions are therefore a sequence of actions that lead the agent to an exit state. This allows the creation of several macro-actions, each one associated with a level in the hierarchy, facilitating the learning procedure. This approach, however, presents a similar drawback as the one discussed in [48]: the sub-MDPs consider all transitions beforehand, making it inviable for large complex problems. Furthermore, this framework requires a state hierarchy developed beforehand by an expert, which may also prove unfeasible for large problems. The authors later improve the HEXQ framework by allowing the state hierarchy to be constructed automatically [27, 58]. To do so, the agent automatically identifies which state variables are more frequently changed by simply interacting with the environment. It then assigns the most frequently changed state variables to lower levels in the hierarchy, until all variables are accounted for. This framework is then tested in a simple grid-world environment.

A novel approach for approximating value functions is presented in [39], where value functions are approximated by a sum of proto-value functions. Proto-value functions are a set of functions that map each state to a real value, in a similar way to value functions (Section 2), and are represented by the eigenvectors $e_1, \dots , e_{|\mathcal {S}|}$ of the Laplacian matrix $L$ derived from the adjacency matrix $A$ (see Section 3.2.2 for the background), which is associated with the STG $G$ of an RL problem. Therefore, it is shown that the value function $V^\pi$ can be approximated according to

\begin{equation} V^\pi = \alpha _1e_1 + \cdots + \alpha _{|\mathcal {S}|} e_{|\mathcal {S}|}, \end{equation}
where $\alpha$ is a vector of size $|\mathcal {S}|$ that must be learned. This idea was then used by Osentoski and Mahadevan [56]. Using the MAXQ framework [18] to construct macro-actions, the authors propose a method for creating different state abstractions for each subtask found. The idea here is that not all state variables are useful for all macro-actions. In some cases, in order to learn the optimal policy for a macro-action, it is desirable to remove some nonimportant state variables, although these same state variables may be useful for other macro-actions. The approach works as follows: First, it is necessary to build an STG based on a state transition history of the agent for the past $n$ episodes. The authors then propose a graph reduction method capable of joining states (represented by nodes) with similar connections and with only one different state variable into a single abstract state that is represented without this state variable. This results in a graph with abstract states represented by different sets of state variables. Proto-value functions are then used over each graph $G_i$ representing a subtask $i$ in order to obtain the basis function $\phi _i(s,a)$. The basis function is given by the first $k$ eigenvectors of the Laplacian of graph $G_i$ associated with the smallest eigenvalues, where $k$ is the number of state variables associated with subtask $i$.

Value function approximation through proto-value functions was also recently used by Machado et al. [37] in order to automatically identify options. Each proto-value function, represented by an eigenvector $e_i$ of $L$, is used to create a reward function for a given option, called eigenpurpose by the authors. After training, each eigenpurpose gives rise to a policy, called eigenbehavior, that maps states to actions for a given policy. Instead of adding the learned options to the agent's action set and learning a policy over options, the authors take a different approach, where options are executed randomly (until completion) in order to better explore the state space, allowing the agent to reach otherwise unaccessible state regions. It is important to note that options are only defined after an initial exploratory step where the agent builds the STG, allowing the construction of the Laplacian matrix $L$. The authors show through a series of experiments that this approach achieves good results and that the options found are not always centered on the idea of reaching bottleneck states, as done by many previous works [22, 46, 48, 61]. It is also shown that the options found by that method perform better than bottleneck options. The authors conclude their work by presenting how to use their method with function approximation for large state spaces: instead of using an adjacency matrix (from where the Laplacian matrix is derived), authors use a matrix , where $t$ is the number of observed transitions during the sampling process and $d$ is the number of state features. Each row of matrix $\mathbf {T}$ stores a transition observed by saving the difference between the feature vectors of the previous and current states. The eigenvectors associated with $T$ are then used to create a set of eigenpurpose and eigenbehavior. This approach is then coupled with a Deep Reinforcement Learning method (Section 2.1) and tested in the Arcade Learning Environment, showing promising results.

The previous work was later improved by using Successor Representation (SR) [17]. In brief, SR indicates the expected future state occupancy for a given state. SR is given by a function that maps a state pair . The value function can then be decomposed as a product between the SR of $(s, s^{\prime })$ and the reward received during this transition. The SR framework was then extended by Barreto et al. [4] to problems with nonenumerated states that use function approximation, called the Successor Features (SF). Stachenfeld et al. [65] showed that there is a strong relationship between the eigenvectors of the matrix representation of the SR and the eigenvectors of the Laplacian associated to the STG. Using this relationship, Machado et al. [38] improved their Laplacian Framework [37] by using SR (or SF when using function approximation) as a feature representation of a state, instead of using raw pixel data. The authors augmented the SF representation by defining SF over a lower-dimension representation of a state given by the state representation learned by the NN responsible for predicting future images of a given Atari state [55], as previously used by Krishnamurthy et al. [33]. Finally, each row of the transition matrix $\mathbf {T}$ is now given by the SF representation of a given state. Options are then defined similarly to the Laplacian Framework [37].

8 CHALLENGES AND OPEN ISSUES

RL is a large and very active research area with several different approaches. Skill acquisition emerges as an important subarea within the RL domain, allowing RL agents to restructure their action set using high-level macro-actions (skills), potentially resulting in a better performance. One way to automatically discover skills in the RL setting is to use graph-based techniques over the STG in order to identify important subgoal states and transitions. This survey discussed several different graph-based skill acquisition methods, including centrality-based, clustering-based, and spectral analysis approaches. We showed how these graph-based methods evolved through time, adapting to more complex problems and current advances in the area, such as Deep Reinforcement Learning (Section 2.1).

Initially, many graph-based skill acquisition methods worked directly with the Global STG of an RL problem [12, 48], since these methods were only applied to relatively simple RL problems. As more complex problems appeared, researchers started focusing on graph-based methods capable of working with a Local STG, therefore requiring only a full transition history [13, 42, 61]. Many other methods started to use an abstract STG, derived from the Local STG, in order to work with smaller graphs [33, 56].

Currently, with the appearance of even more complex problems (ATARI games, general-purpose AI [5], autonomous vehicles, etc.), RL methods became more demanding [50, 51], requiring much more data in order to efficiently learn how to solve a given problem. This results in increasingly large Local STG, to the point where it becomes infeasible to create a graph using the whole transition history. This corresponds to a major challenge for graph-based skill acquisition methods that must be properly addressed; otherwise, graph-based methods will fail in properly defining skills in current complex problems. Therefore, it is necessary to search for new approaches for creating an abstract STG that effectively represent the environment dynamics without requiring the construction of a fully detailed Local STG. These approaches must, for example, unite a state a RL agent is currently visiting to another previously visited state with similar characteristics (approximate state representation, for example), resulting in the construction of an abstract STG while the agent is transitioning between states.

Spectral analysis allows the identification of not only skills but also their respective intrapolicies, achieved through the approximation of the value function for a given reinforcement learning problem [37]. Other interesting spectral analysis methods are also available in the literature; although they focus on different domains, it may be possible to adapt them to the context of skill acquisition. For example, the approach presented in [75] to identify critical nodes for network robustness can be easily adapted to the skill acquisition problem. This method is used for finding critical nodes in a complex network by using a local scoring function for each node that measures how critical the corresponding node is to the network robustness. The score given to a node $v$ is associated with the spectral gap $\lambda _2$ [52] of the subnetwork formed by the nodes reachable within $h$ hops from node $v$. This score varies between [0, 1], where a score $S_v = 1$ indicates that node $v$ is the most locally critical node in its neighborhood. This also allows the navigation from any node to a critical node by just going to the neighbor node with the highest score. Therefore, this approach can then be easily adopted for skill acquisition, where critical nodes represent subgoals and the navigation process can be associated with skills with their respective intrapolicies already defined.

Graph-based skill acquisition applied to complex RL problems results in potentially very large Local STGs, as we previously mentioned. To deal with these potentially very large graphs, one can adopt solutions for processing large-scale graphs that recently flourished in the literature [28, 36, 40]. This allows the graph-based skill acquisition methods to use a more complete STG, helping the learning agent to achieve a better performance. However, this approach would only be justifiable in cases where the agent's learning curve is heavily impacted by skill acquisition methods. Otherwise, the extra processing time might not be worth it. Another approach is to conceive other graph processing methods capable of reducing the Local STG, or even new approaches to directly create reduced STGs, as previously mentioned in the current section. The idea is to allow the agent to use smaller STGs that efficiently approximate the Global STG. These contrasting approaches are still underdeveloped, with plenty of room for improvement.

As a final remark, a common problem not only in the machine-learning area but also in many research areas is how to effectively compare one proposed method with other previous alternatives. This becomes an especially complicated challenge when there is no established framework in the area capable of providing an efficient test bench for researchers. This problem is also present in the skill acquisition area, where (1) each researcher uses different problems to test his or her proposed methods or, in other cases, (2) uses different implementations of the same problem to compare its method to others. As a consequence of the former problem, authors have difficulties in effectively comparing their methods with other methods in the area, given that there is no established benchmark or baseline problem. With each method running for different problems, a direct comparison between concurrent methods becomes a challenge. The latter problem refers to implementation differences of a same method, given that some papers in the area do not fully describe their experimental setup, making it difficult to replicate their results. However, this scenario is slowly changing at least in the RL area with the proposal of different benchmarks that implement several complex problems, such as the Arcade Learning Environment [6], the benchmark for continuous control [20], OpenAI Gym [8], and, more recently, the DeepMind Lab [5]. Therefore, it is important that future work related to skill acquisition adopts such benchmarks in order to allow a fair and easy comparison of any proposal against previous methods.

9 SUMMARY AND OUTLOOK

Skill acquisition allows the RL agent to better understand the environment it is immersed in, contributing to its learning process and impacting its performance. This survey showed several graph-based skill acquisition approaches for RL, detailing their evolution and adaptation to current complex RL problems, as well as their main differences and categorization. We can see through this survey that the first graph-based skill acquisition approaches were focused on identifying important states through centrality measures (Section 5). These first approaches were proved useful in simple grid-world problems, but as new and more complex applications emerged, the focus in the area changed to clustering-based approaches (Section 6). Clustering-based methods were proved more useful, but it was only with the introduction of methods based on spectral analysis that the graph-based skill acquisition methods thrived in recent Deep Reinforcement Learning applications [33, 37].

Indeed, with RL problems becoming increasingly more complex, graph-based skill acquisition methods must adapt to either deal with very large graphs or to create graph abstractions able to generate simpler yet representative STGs. As shown in recent work [33, 37], spectral analysis corresponds to an attractive approach for graph-based skill acquisition for RL. This stems from the ability of spectral analysis to uncover important information that helps RL methods by simply analyzing the underlying STG of an RL problem. If enough abstractions are made while keeping the STG representativeness, these methods can yield promising results even for complex problems.

REFERENCES

Footnote

M. R. F. Mendonça, A. Ziviani and A. M. S. Barreto thank CNPq for its support (grants 141.761/2016-4, 308.729/2015-3 and 461.739/2014-3, respectively) and also acknowledge the INCT in Data Sciences – INCT-CiD (CNPq no. 465.560/2014-8). A. M. S. Barreto is currently with DeepMind, London, UK.

Authors’ addresses: M. R. F. Mendonça, A. Ziviani, and A. M. S. Barreto, National Laboratory for Scientific Computing (LNCC), Petrópolis, RJ, Brazil; emails: mrfm@lncc.br, ziviani@lncc.br, amsb@lncc.br.

ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

©2019 Association for Computing Machinery.
0360-0300/2019/01-ART6 $15.00
DOI: https://doi.org/10.1145/3291045

Publication History: Received January 2018; revised August 2018; accepted October 2018