^{1}

^{*}

^{2}

The advantage of quantum computers over classical computers fuels the recent trend of developing machine learning algorithms on quantum computers, which can potentially lead to breakthroughs and new learning models in this area. The aim of our study is to explore deep quantum reinforcement learning (RL) on photonic quantum computers, which can process information stored in the quantum states of light. These quantum computers can na turally represent continuous variables, making them an ideal platform to create quantum versions of neural networks. Using quantum photonic circuits, we implement Q learning and actor-critic algorithms with multilayer quantum neural networks and test them in the grid world environment. Our experiments show that 1) these quantum algorithms can solve the RL problem and 2) compared to one layer, using three layer quantum networks improves the learning of both algorithms in terms of rewards collected. In summary, our findings suggest that having more layers in deep quantum RL can enhance the learning outcome.

The society of today is rich in data, generated by all sorts of devices at an unprecedented rate. Knowing how to make good use of this big data has become an urgent and interesting topic. Machine learning (ML) is about teaching computers how to learn from data without being explicitly programmed, and as such the first step of ML is obviously the representation of data in terms of features that computers can understand and process. In the old days of ML, hand-crafted feature extraction was the norm. But today, deep learning uses multilayer neural networks to automatically learn the best features that represent the given data. For this purpose, convolutional neural networks (CNN), recurrent neural networks (RNN) and many more were created. Typically, CNNs excel at extracting features of image data, and RNNs for sequential data. Deep learning has demonstrated its power in broad applications, such as speech recognition, computer vision, and natural language processing. A deep neural network is able to process the input data in multiple processing layers. Each layer has a non-linear activation function and the sequence of these layers leads to learning different levels of abstraction of the input data. The nonlinear activation function offers the expressivity of the whole neural network, which has been a challenge for researchers trying to design similar nonlinear functions on a discrete qubit based quantum computer [

In general, ML can be classified into three categories: supervised learning, unsupervised learning, and reinforcement learning (RL) [

Inspired by human and animal behavior of learning from experience, RL is a method of solving sequential decision-making problems with an agent by trial and error in a known (with a model) or unknown (without a model) environment. A model of the environment in RL is an abstraction that an agent can use to predict the responses to its actions from the environment. The aim of RL is to discover optimal policies through interaction with the environment. Given a policy, the agent knows the best action to perform in a specific state. RL enjoys its applications in a wide array of areas including: robotics, recommendation systems, advertising, and stock trading. One of the challenges of RL is that some actions do not get an immediate reward. On the other hand, the reward received by the agent does not completely depend on the current action, but also on some actions in the past. Another difficulty is balancing the exploration vs. exploitation, which means the agent needs to exploit the current best actions to maximize rewards, but also explore new actions that it has not tried to find better actions.

Deep neural networks can approximate any linear or non-linear functions. Based on this property, deep RL uses multilayer neural networks to represent the value function, the policy, and the model in RL [

Quantum computers can make use of the counterintuitive properties of quantum states such as superposition, entanglement, and interference to process quantum information in ways that classical computers cannot. It is well-known that quantum computers can find factors of large integers and search an unstructured database much faster than classical computers. Instead of designing algorithms to process big data on classical computers, there is a new approach to deal with issues of big data using quantum computers. For example, the curse of dimensionality is a serious problem in ML for classical computers. However, quantum computers can solve this problem with ease using superposition of qubits.

In recent years, researchers have investigated different ways of applying quantum computing to improve classical machine learning algorithms [

Deep learning is a powerful technique in ML and especially in RL. To further our understanding of applying quantum computing to the area of deep RL, we implement two popular RL algorithms [

Quantum ML has risen as an exciting and engaging research area. Neural networks are the most versatile ML technique and as such, it has been a long-time desire and challenge to create neural networks on quantum computers. With the recent advances in continuous-variable (CV) quantum models [

RL is a subfield of machine learning, in which an agent learns how to act by its interaction with the environment. RL algorithms can be classified into two categories: Model-based and Model-free. Model-free algorithms can be further divided into: policy-based (actor), value-based (critic), and a combination of both, actor-critic. In model-based RL, the environment is treated as a model for learning.

Q learning and actor-critic are both used in the famous AlphaGo program. One of the advantages of AC models is that they converge faster than value-based approach such as Q learning, although the training of AC is more delicate. In AC algorithms, the agent is divided into two components, an actor and a critic, where the critic evaluates the actions taken by the actor and the actor updates the policy gradient in the direction suggested by the critic. AC methods can be viewed as a generalized policy iteration, which alternates between policy evaluation (obtaining a value for an action) and policy improvement (changing actions according to the value).

This study leverages deep quantum neural networks to estimate and output the state-action value function in Q learning and the state value function and policy in the AC methods with the state as the input to the networks.

The grid world environment is commonly used to evaluate the performance of RL algorithms. It delays the reward until the goal state is reached, which makes the learning harder. Our grid world is similar to the Frozen Lake environment from gym (https://gym.openai.com/envs/FrozenLake-v0/) but with a smaller size of 2 × 3 while the standard size is 4 × 4. It is a 2 × 3 grid which contains four possible areas―Start (S), Frozen (F), Hole (H) and Goal (G) (

agent can move up, down, left, or right directions by one grid square. The agent attempts to walk from state S to state G while avoiding the hole H. The reward for reaching the goal is 1, and all other moves have a reward of 0. The visual description of this grid world is in

The standard Frozen Lake environment can be slippery or not. If it is slippery, any move in the intended direction is successful only at a probability of 1/3 and slides to either perpendicular direction with also a probability of 1/3 (there are two such directions). One episode is defined as a trial by the agent to walk from state S to a terminal state of either H or G. The goal of the agent is to learn what action to take in a given state that maximizes the cumulative reward, which is formulated as a policy that maps states to actions.

A frequently used term in RL is the return, which is defined as follows:

G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ (1)

where R t is a reward at time t and γ is a discount rate ( 0 ≤ γ ≤ 1 ) . The discount used in the return definition implies that current rewards are more significant than those in the future. Another useful concept in RL is the Q function that can be formulated as below:

Q π ( s , a ) = E π [ G t | S t = s , A t = a ] (2)

where the expectation is taken over potentially random actions, transitions, and rewards under the policy π. The value of Q(s, a) represents the maximum discounted future reward when an agent takes action a in state s.

Q learning [

Q ( s t + 1 , a t + 1 ) = Q ( s t , a t ) + α [ R t + 1 + γ max a Q ( s t + 1 , a ) − Q ( s t , a t ) ] (3)

where the max is taken over all the possible actions in state s t + 1 , γ ∈ [0, 1) is the discount factor, and α ∈ (0, 1] is the learning rate. The sum of the first two terms in the square bracket in Equation (3) is called the target Q value and the third term is the predicted Q value.

In simple RL problems, a table is sufficient to represent the Q function, however, in more complicated problems, a neural network is used instead to represent the Q function as Q(s, a; θ ), where θ is a collection of parameters for the network. Deep Q learning employs multilayer neural networks to approximate the Q function.

The training of this network is to reduce the gap between the predicted Q value and the target Q value as a regression problem by optimizing θ. It is clear that updating θ may cause changes to both values. In a difficult RL problem, other heuristic techniques have to be used in order to stabilize the convergence of the network during training such as experience replay, which breaks the correlations between consecutive samples and uses a fixed target network to stabilize the policy [

In RL, the behavior policy is the one that the agent uses to generate the samples, and the target policy is the one that is being learned. When an RL algorithm can use a different behavior policy from its target policy, it is called off-policy, otherwise on-policy. Q learning is off-policy. The AC methods can be made off-policy, which offers the benefit of learning one policy while using a different behavior policy.

RL algorithms usually fall into three categories: actor-only (policy based), critic-only (value based), and actor-critic methods. Actor-only methods usually parameterize the policy, and then optimize it with a policy gradient. This approach can handle continuous states and actions, but the gradient can have high variance. On the other hand, critic-only methods typically estimate the state-action value function, from which we can find an optimal action, but a difficult step when the action space is continuous. When actions are discrete, one strategy to get a policy or to improve the policy is using max a Q ( s , a ) . However, when actions are continuous, this technique requires a global maximization at every step. The alternative is to use the gradient of Q to update the policy instead of maximization of Q.

In AC, the agent is composed of an actor and a critic, interacting with an environment, which provides a structure to explicitly represent the policy independent of the value function. The aim of the learning is to tune the parameters θ and ϕ in the agent in order to receive maximum average reward. The work of the critic reduces the variance of the policy gradient in the actor so the whole algorithm is more stable than a pure policy gradient method. The implementation of AC can be one network with two outputs or two networks, or even the critic can be a lookup table, and training is done in batch or online.

A schematic illustration of the dynamics in the AC algorithms between the actor, the critic, and the environment is shown in

We explain the actor-critic algorithms [

The expression A ( s t , a t ) = target − V ϕ ( s t ) in the AC definition above is usually named advantage in policy gradient algorithms, which suggests the needed amount of change to the gradient, based on the baseline value V ϕ ( s t ) .

For a problem with a small number of states and actions, tables could be used to represent the Q function efficiently, but for larger sizes of state and action spaces, they do not scale well. In this case, neural networks can be employed to approximate the Q function with Q(s, a; θ) parameterized by θ (

The critic learns a value function, which is then used by the actor to update its gradient direction in order to encourage the beneficial actions, consequently improving its policy. Actor-critic methods usually have better convergence properties than critic-only methods. There are two outputs of AC algorithms: one is the Q values for different actions from the actor and the other is the value of being in a state from the critic (

By default, AC learning is on-policy as the critic evaluates whatever policy is currently being followed by the actor. But it can be changed to an off-policy by sampling actions from a different policy and then incorporating that policy into the formula for updating the actor, which provides the advantage of better exploration and the support of experience replay with data generated from another behavior policy [

The REINFORCE algorithm used in [

Our study uses layers of quantum neural networks composed of photonic circuits to implement Q learning and actor-critic. The photonic gates in our networks as shown in

Our quantum Q learning and actor-critic algorithms are evaluated in the grid world environment explained in Section 3.1. The aim of the agent in this grid world is to learn how to navigate from the start state S to the goal state G with a reward of 1 without falling into the hole with a reward of 0. Our purpose is to investigate the power of using quantum neural networks with many layers in RL, so we test our quantum networks with one layer and with three layers respectively.

Actor-only methods are guaranteed to converge; however, this is not always true for the critic-only methods. The use of a critic causes the AC algorithms to converge faster than the actor-only methods. In certain problems where the action space is much smaller than the state space, the updating of the actor and critic needs more fine-tuning as the learning of a larger space takes more time. The critic computes the advantage, which is a measurement of improvement relative to the baseline value in that state that is also beyond the expected value of that state. To coordinate the work of actor and critic is not easy as seen in

We run several numerical experiments using the Strawberryfields software [

Machine learning teaches computer programs how to improve themselves with experience or data. Therefore, using good features to represent data is essential for ML. Fortunately, deep learning has the ability to automatically extract good features from complex and high-dimensional data, which provides the valuable first step to any real-world applications of RL that require learning from raw visual input.

Supported by deep learning, RL programs that use Q learning and actor-critic have demonstrated their superior performance, which justifies the call to study these two algorithms on quantum computers. The advantage of having two components, actor and critic, in the AC methods also brings a difficulty in coordinating the work of the two. Relatively speaking, the actor is more sensitive to changes than the critic, as the changes of critic indirectly impact the policy.

Our work shows how to implement Q learning and actor-critic algorithms using deep quantum neural networks. The results of this report highlight that having more layers in the quantum networks does improve the learning of the agent in the grid world environment, which extends our previous work that uses a single layer of quantum network to solve RL problems [

The authors declare no conflicts of interest regarding the publication of this paper.

Hu, W. and Hu, J. (2019) Reinforcement Learning with Deep Quantum Neural Networks. Journal of Quantum Information Science, 9, 1-14. https://doi.org/10.4236/jqis.2019.91001