^{1}

^{*}

^{2}

Artificial intelligence has permeated all aspects of our lives today. However, to make AI behave like real AI, the critical bottleneck lies in the speed of computing. Quantum computers employ the peculiar and unique properties of quantum states such as superposition, entanglement, and interference to process information in ways that classical computers cannot. As a new paradigm of computation, quantum computers are capable of performing tasks intractable for classical processors, thus providing a quantum leap in AI research and making the development of real AI a possibility. In this regard, quantum machine learning not only enhances the classical machine learning approach but more importantly it provides an avenue to explore new machine learning models that have no classical counterparts. The qubit-based quantum computers cannot naturally represent the continuous variables commonly used in machine learning, since the measurement outputs of qubit-based circuits are generally discrete. Therefore, a continuous-variable (CV) quantum architecture based on a photonic quantum computing model is selected for our study. In this work, we employ machine learning and optimization to create photonic quantum circuits that can solve the contextual multi-armed bandit problem, a problem in the domain of reinforcement learning, which demonstrates that quantum reinforcement learning algorithms can be learned by a quantum device.

In recent years, the research and application of artificial intelligence has experienced a revolution and has sparked an explosion of interest, fueled by its astonishing performance, supported by more powerful computing, more efficient algorithms, and more data. Machine learning as a part of AI is an area in computer science that studies the question of teaching computer models how to learn from data. Among the three major categories of machine learning: supervised, unsupervised, and reinforcement learning, reinforcement learning is closest to what people tend to think of artificial intelligence. When a learning agent is placed in an unknown environment, supervised learning would teach the agent the correct actions to take, while in a reinforcement learning setting, only the rewards for these actions are provided to the agent, which are weaker signals than those in supervised learning. Supervised and unsupervised learning can be considered as learning about the data, but reinforcement learning is learning to behave or how to take actions (

The advances in mathematics, materials science, and computer science have made quantum computing a reality today. Making use of the counterintuitive and distinctive properties of superposition, entanglement, and interference of quantum states, quantum computing is a new computing paradigm based on the laws of quantum mechanics. Quantum computers can process information more efficiently than traditional computers and provide us with a platform to enhance classical machine learning algorithms and to develop new quantum learning algorithms [4 - 15].

The qubit-based quantum computer can represent discrete variables naturally, but cannot represent continuous variables efficiently. The continuous-variable (CV) quantum computing architecture [

Deep learning has impressed people with its AI abilities demonstrated by numerous applications such as AlphaGo from Google. The mathematical structure of deep learning is supported by a multi-layered neural network where the output of one layer is used as an input to the next. Each layer is made of a number of neurons where a linear transformation of the input is conducted, followed by a nonlinear activation function. Mathematically, these neural networks can approximate any continuous functions, which are commonly used in machine learning.

The quest for quantum neural networks has been a long journal. One of the challenges is the design of the nonlinear activation function in each layer of the network while maintaining the unitary property of the operation. In the CV quantum architecture, this problem is solved seamlessly, by using non-Gaussian gates to provide both the nonlinearity and the universality of computation. Quantum neural networks offer a quantum advantage, where in some problems, a classical neural network would require an exponential number of resources to approximate a quantum network. To fully take advantage of the quantum and classical computing, a hybrid quantum-classical technique to create quantum circuits with a variational approach has been proposed [

Quantum machine learning can improve classical machine learning. One well-known classical learning technique is kernel methods, which maps lower dimensional data into higher dimensional space, sometimes infinite-dimensional, but requires lengthy computational time when the dimension is high. Recent works show that quantum devices can do this kind of calculation naturally and efficiently. In a continuous-variable photonic quantum system, a classical data point can be prepared as an input quantum state to a quantum circuit. This quantum state is a vector in an infinite dimensional Fock state, so it is already in an infinite dimensional space without the help of the kernel trick [

The goal of quantum machine learning is to use quantum processors to develop novel quantum algorithms that can dramatically accelerate computational tasks for machine learning. The recent development of hybrid quantum-classical technique fits well with the current state of quantum technologies.

The work in [

Using a combination of Gaussian and non-Gaussian gates, these circuits provide the nonlinearity necessary to create quantum natural networks, unitarity of quantum operations, and universality of computation. They magically maintain highly nonlinear transformations while keeping operations completely unitary. Our work designs a circuit of photonic quantum computers to solve the contextual multi-armed bandit problem [21 - 24] using machine learning and optimization techniques. This circuit is made of optical gates with free continuous parameters optimized by the photonic quantum computer simulator Strawberry Fields [

Our study employs a reinforcement learning technique, a policy gradient, to train the quantum neural network. So we introduce the policy gradient first.

The aim of reinforcement learning is to train a learning agent to discover a good strategy in order to receive the maximum cumulative rewards through interaction with the unknown environment. In the domain of reinforcement learning, the strategy is usually termed as a policy that maps states to actions, either deterministically or stochasticatically. There are two major approaches to learning a good policy: value-based and policy-based methods. The former learns state values V(s) and action-state values Q(s, a) and then based on these functions, find a good policy π ( a | s ) . The latter directly learns a good policy π ( a | s ) , which is the method we use in this study. Although we could define a policy π ( s ) = arg max a Q ( s , a ) if Q(s, a) is found, in general we may have little interest in knowing the exact value of Q(s, a). Another reason for us to find the policy directly is when the action space is continuous or the environment is stochastics, computing Q(s, a) becomes a complicated task.

To explain our work, we only introduce the policy gradient algorithm in the episodic environment. First we introduce a parameter θ to the policy function π ( a | s ) = π θ ( a | s ) = P [ a | s , θ ] then use the gradient of this policy to find a θ that can produce maximum cumulative rewards. Running one episode, the whole trajectory of the agent is recorded as h = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ⋯ , s T , a T , r T } . The policy objective function J ( θ ) is defined as J ( θ ) = ∫ π θ ( h ) r ( h ) d h where r(h) is a reward function. Using a common trick of π θ ∇ θ log π θ = ∇ θ π θ , the policy gradient algorithm REINFORCE [

In the original multi-armed bandit problem, the state remains fixed as there is only one bandit while in the contextual multi-armed bandit problem, the state changes as there are several bandits. The updating rule of REINFORCE encourages actions that receive positive rewards while penalizing those that do not. In general, policy gradient methods work better than the value based such as Q learning since the policy gradient directly optimizes the reward but the training can be a challenge because of the high variance of rewards that makes the algorithm unstable.

Different from the more commonly known qubit-based models, continuous-variable quantum computing is a universal computing model which can process continuous variables. In a CV model, information is stored in the quantum states of bosonic modes, called qumodes and the CV quantum circuits are unitary in the Hilbert space picture, but they can have nonlinear effects in the phase space picture when non-Gaussian gates are used, a fact that is critical for designing CV quantum neural networks.

Inside a CV quantum circuit the quantum gates usually contain free parameters which allow for a variational approach to optimize them for a particular machine learning task (

To introduce the photonic gates, we denote the creation operator by a † and annihilation operator by a. The displacement gate is D ( a ) = exp ( a a † − a ∗ a ) and squeeze, rotation, and Kerr single mode gates are defined as S ( r ) = exp [ r 2 ( a 2 − a † 2 ) ] , R ( ϕ ) = exp ( i ϕ n ^ ) , and K ( κ ) = exp ( i κ n ^ 2 ) respectively, where n ^ = a † a is the number operator. The two mode bean-splitter is B S ( θ , ϕ ) = exp [ θ ( e i ϕ a 1 † a 2 − e − i ϕ a 1 a 2 † ) ]

which creates entanglement between the two modes. The visual representation of the effects of some of these gates is shown in

In this work, we use four qumodes to construct a quantum neural network. The input to the circuit represents the action in the multi-armed bandit problem, and also the state in the contextual multi-armed bandit problem. The output of the circuit is the photon number measurements that represent the four weights on the four arms in the bandit. The goal of the policy gradient training of the circuit is to output the weights that guide the agent to choose the right arm to gain the maximum reward. Our network has a total of 32 gates from a universal set for CV quantum computing. Compared to the multi-layered quantum neural networks in [

Due to the use of non-Gaussian gates, this circuit can produce nonlinear transformation while maintaining its unitary property as a whole. The non-Gaussian gates are also necessary ingredients to build a universal quantum computing model. The Kerr gate is used because it is diagonal in the Fock basis, which leads to faster and more reliable numerical simulations when compared with the cubic phase gate, another well-known non-Gaussian gate.

The multi-armed bandit problem can be described using this analogy. Say there is one slot machine with multiple arms. Each arm has an unknown but fixed probability of giving out a prize. We can try one arm at a time, and our aim is to find a strategy to maximize our cumulative rewards. In this environment, there is just one state, one slot machine, and only the action can vary. The description of contextual bandits requires the concept of the state, which can serve as an extra clue that the agent can use to take more informed actions. In the contextual multi-armed bandit problem, there are N slot machines with multiple arms, which is an extension of the previous problem. In this case the state, the slot machine, can vary as well. Now the goal of the agent is to learn the best action not just for a single bandit, but for any number of them. Contextual bandits can be used to optimize random allocation in clinical trials and enhance the user experience for websites by helping choose which content to display to the user, ranking advertisements, and much more.

To balance the trade-off between exploitation and exploration, we employ the ε-greedy algorithm which means with a small probability ϵ the agent takes a random action, but otherwise it picks the best action according to the output of the quantum neural network. There is already a significant amount of attention given to supervised and unsupervised learning research, but relatively less progress has been made for reinforcement learning [6 , 7]. The main goal of our study is demonstrate that quantum neural networks can be used to solve problems in reinforcement learning, adding a quantum solution to the rich collections of classical methods such as ε-greedy, upper confidence bounds (UCB), and Thompson sampling [22 - 24].

A multi-armed bandit is a tuple (A, R) where A is a known set of actions or arms and R(r|a) = P(r|a) is an unknown probability distribution over rewards. At each step t, the agent selects an action a t ∈ A and the environment generates a reward r t ∈ R ( . | a t ) . The goal of the agent is to find a good strategy in

obtaining the maximum cumulative reward ∑ t = 1 t = T r t . The contextual multi-armed bandit problem can be

defined similarly as a tuple (S, A, R) where S is a collection of states [21 - 24].

In this work, we have conducted two experiments. One is to train a quantum neural network, a learning agent in this study, to solve the multi-armed bandit problem, and the other is to solve the contextual multi-armed bandit problem where the extra dimension is having states in the problem. The training method is ϵ greedy, which means ϵ% of the times, actions are selected at random while the rest of the times the quantum neural network is used to select the actions. In this study, we chose ϵ = 0.1. Each bandit has four arms with each arm having a different but fixed probability to produce a positive reward 1 or a negative reward −1. In the contextual multi-armed bandit problem there are four bandits, and consequently this problem has four states.

In this problem, there is only one bandit, which has four arms in this study. We select two arms of the bandit to have higher probability to give a positive reward than the other two in the first experiment. Then we switch the two highest probabilities to see if the quantum neural network is able to detect the change in the second experiment. The experiments of training the quantum neural network for 500 steps show that it can identify the two arms of the top two positive rewards in each case (

In this problem, there are four bandits, with each having four arms in this study. We select one arm of each bandit to have the highest probability of generating a positive reward than the other three (

Experiment one | Experiment two | ||||||||
---|---|---|---|---|---|---|---|---|---|

Probability | Reward | Probability | Reward | Arm | Probability | Reward | Probability | Reward | Arm |

0.426 | 1 | 0.574 | −1 | 1 | 0.426 | 1 | 0.574 | −1 | 1 |

1.0 | 1 | 0.0 | −1 | 2 | 0.5066 | 1 | 0.4934 | −1 | 2 |

0.5809 | 1 | 0.4191 | −1 | 3 | 0.5809 | 1 | 0.4191 | −1 | 3 |

0.5066 | 1 | 0.4934 | −1 | 4 | 1.0 | 1 | 0.0 | −1 | 4 |

Bandit one | Bandit two | ||||||||
---|---|---|---|---|---|---|---|---|---|

Probability | Reward | Probability | Reward | Arm | Probability | Reward | Probability | Reward | Arm |

0.418 | 1 | 0.582 | −1 | 1 | 0.4547 | 1 | 0.5453 | −1 | 1 |

0.5039 | 1 | 0.4961 | −1 | 2 | 1.0 | 1 | 0.0 | −1 | 2 |

0.5278 | 1 | 0.4722 | −1 | 3 | 0.154 | 1 | 0.846 | −1 | 3 |

1.0 | 1 | 0.0 | −1 | 4 | 0.3972 | 1 | 0.6028 | −1 | 4 |

Bandit three | Bandit four | ||||||||

Probability | Reward | Probability | Reward | Arm | Probability | Reward | Probability | Reward | Arm |

1.0 | 1 | 0.0 | −1 | 1 | 0.3846 | 1 | 0.6154 | −1 | 1 |

0.0708 | 1 | 0.9292 | −1 | 2 | 0.0 | 1 | 1.0 | −1 | 2 |

0.5804 | 1 | 0.4196 | −1 | 3 | 1.0 | 1 | 0.0 | −1 | 3 |

0.4601 | 1 | 0.5399 | −1 | 4 | 0.4987 | 1 | 0.5013 | −1 | 4 |

Quantum computers make use of the properties of quantum physics to process information much faster than their classical counterparts. As a result, quantum technologies provide a fertile ground to explore new ideas and models in computation that could potentially revolutionize the ways of how information is stored and processed. The real benefit of using quantum computing is to efficiently solve certain problems that are extremely expensive for classical computers. Driven by new algorithms, increased computing power, and big data, deep learning structured by multi-layer neural networks has demonstrated its great power in many different areas, thus, bringing in great interest in learning how to create quantum neural networks. Quantum variational algorithms are recently proposed as a hybrid between classical and quantum computing, in which a classical computer varies certain free parameters to control the preparation of quantum states, and then a quantum computer prepares the states.

The quantum analogue of the classical bit is the qubit which can represent discrete variables naturally, but cannot represent continuous variables efficiently. However in machine learning, continuous variables are commonly used so continuous-variable quantum systems are more suitable in the design of quantum neural networks. In a classical neural network, the nonlinear activation function plays an important role in approximating any continuous functions. However, in quantum physics, the operations on quantum states are required to be linear and unitary, a restriction that brings great difficulty when creating quantum neural networks. In a photonic quantum system, this nonlinearity is achieved by the non-Gaussian gates. We need to understand what advantages may arise from generating the superposition, entanglement, and interference of quantum states during operations of the quantum neural networks.

In this report, we showcase the application of variational methods to create photonic quantum neural networks that can solve the contextual multi-armed bandit problem, where the agent is trained with a policy gradient to gain maximum cumulative rewards. Compared to some other problems in reinforcement learning where the rewards are delayed, the rewards in the contextual multi-armed bandit problem are immediate. Our work also highlights that classical machine learning can aid quantum computers in learning in the domain of reinforcement learning, allowing quantum and classical learning to complement each other.

The authors declare no conflicts of interest regarding the publication of this paper.