Reinforcement Learning For Cyber-Physical Systems: Xing Liu, Hansong Xu, Weixian Liao, and Wei Yu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

2019 IEEE International Conference on Industrial Internet (ICII)

Reinforcement Learning for Cyber-Physical Systems


Xing Liu∗ , Hansong Xu∗ , Weixian Liao∗ , and Wei Yu∗
∗ Dept.
of Computer and Information Sciences, Towson University, MD, USA
Emails: {xliu10,hxu2}@students.towson.edu, {wliao,wyu}@towson.edu

Abstract—Cyber-Physical Systems (CPS), including smart in- As CPS intertwine numerous subsystems, the research
dustrial manufacturing, smart transportation, and smart grids, challenges posed by the key characteristics of CPS (high
among others, are envisioned to convert traditionally isolated complexity, dynamics, and heterogeneity) become difficult and
automated critical systems into modern interconnected intelligent
systems via interconnected human, system, and physical assets, costly to solve by using traditional optimization algorithms.
as well as providing significant economic and societal benefits. This calls for the effective use of artificial intelligence (AI) and
The characteristics of CPS include complexity, dynamic variabil- machine learning (ML) schemes [2]. Reinforcement learning,
ity, and heterogeneity, arising from interactions between cyber as one key machine learning paradigm in AI and ML, has been
and physical subsystems. These characteristics introduce critical adopted to solve a number of issues, including robotic control,
challenges in addition to existing and vital safety and reliability
requirements from traditional critical systems. To overcome these spectrum sharing in cognitive radio systems and complex
challenges, Artificial Intelligence (AI) and Machine Learning competitive games [10], [11], among others. Via systematic
(ML) schemes, which have proven effective in numerous fields trial-and-error interactions with an unknown environment,
(robotics, automation, prediction, etc.), can be leveraged as reinforcement learning is capable of automatically computing
solutions for CPS. In particular, reinforcement learning can make optimal actions that gain the greatest reward at different system
precise decisions automatically to maximize cumulative reward
via systematic trial and error in an unknown environment. Yet, states. In particular, a reinforcement learning agent observes
challenges still remain for integrating complex reinforcement system states from unknown environments and takes actions.
learning systems with dynamic and diverse CPS domains. In this The system state changes to a new state due to the action,
paper, we conduct a thorough investigation of existing research which results in a corresponding reward to reinforcement
on reinforcement learning for CPS, and propose a framework learning agent to update its actions. Reinforcement learning
for future research. In addition, we carry out two case studies
on reinforcement learning in transportation CPS and industrial agents repeat this process to learn a policy (i.e., taking actions
CPS to validate the effectiveness of reinforcement learning in at different circumstances) that maximizes the total reward.
targeted applications. Using realistic simulation platforms, we Reinforcement learning has been proven effective in solving
validate the effectiveness of reinforcement learning for decision vital research problems in complex, dynamic and heteroge-
making in routing for transportation CPS and production control neous CPS. For example, as applied to industrial CPS, Jiang
for industrial CPS. Finally, we outline some future research
challenges that remain. et al. [12] leveraged reinforcement learning to conduct ore
Index Terms—Reinforcement learning, Cyber-Physical Sys- grinding plant operational control via sensed pulp level and
tems, Internet of Things feed flow, and improved the production efficiency and quality
in dynamic flotation industrial processes. In transportation
I. I NTRODUCTION CPS, Ferdowsi et al. [13] leveraged Q-learning to address
security issues, in which adversaries conduct reinforcement
Cyber-Physical Systems (CPS) integrate cyber subsystems learning based attacks to manipulate safe inter-vehicle spacing.
of computation, communication, and control into physical A number of surveys exist toward the application of rein-
processes [1]. Such integration empowers intelligence, au- forcement learning in different particular systems. For exam-
tonomy, reliability, efficiency, and so on, in a number of ple, Glavic et al. [14] focused on reinforcement learning for
CPS domains, such as Industry 4.0 (smart manufacturing electrical power system problems (e.g., decision and control)
CPS) [2], ITS (smart transportation CPS) [3], [4], smart in power grids. Luong et al. [15] surveyed the applications of
grid [5]–[7], and others. For example, industrial CPS integrates deep reinforcement learning to solve problems in communica-
networking, computing, and control subsystems to industrial tion and networking systems, including network access, rate
automation and manufacturing processes to keep industrial control, caching and offloading, and so on. Additionally, there
factories and plants safe, efficient, intelligent, and resilient [2]. are other surveys and tutorials about reinforcement learning
In particular, the networking subsystem provides high date rate methodologies and algorithms from different perspectives,
and low-latency communication between sensors, actuators, including reinforcement learning in robotics [10], multiagent
and controllers. The computing subsystem (e.g., cloud/edge reinforcement learning, safe reinforcement learning, and deep
computing) produces knowledge from high volume data to reinforcement learning [16], [17].
assist decision making via data analytics supported by machine Despite these examples, there remains no systematic survey
learning [8], [9]. Control subsystems such as industrial control to explore the potential of adopting reinforcement learning
systems (IDS) produce accurate control commands for timely for diverse problems in the context of CPS. For example,
and accurate actuation. open questions remain about how to adopt reinforcement

978-1-7281-2977-8/19/$31.00 ©2019 IEEE 318


DOI 10.1109/ICII.2019.00063

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
learning to solve problems in control subsystems (e.g., robot The remainder of the paper is organized as follows: In Sec-
control), networking subsystems (e.g., spectrum sharing), and tion II, we extensively review the state-of-the-art in reinforce-
computing subsystems (e.g., computing resource scheduling). ment learning for individual subsystems in CPS. In Section III,
Moreover, the interactions between CPS subsystems, which we extensively review the state-of-the-art in reinforcement
are increasingly difficult to tackle by simply integrating solu- learning for co-design in CPS. Based on an extensive review
tions from individual subsystems, call for further investigation of previous research works, we conduct two case studies and
into reinforcement learning based solutions. outline future research directions in Section IV. Finally, we
In this paper, we systematically investigate reinforcement conclude the paper in Section V.
learning for industrial CPS, transportation CPS, and others.
We survey previous research works that map the problems II. S TATE - OF - THE - ART R EINFORCEMENT L EARNING IN
of individual subsystems in CPS to reinforcement learning- CPS
solvable problems, such as Markov decision process (MDP) In this section, we investigate the state-of-the-art in applying
problem, as shown in Fig. 1. In addition, we review existing reinforcement learning to individual subsystems in CPS (e.g.,
research works on leveraging reinforcement learning to solve control, networking, and computing subsystems) and the co-
problems that are raised by the interactions of CPS subsystems design of subsystems in CPS. We propose a three-dimensional
via co-design, such as networking and control co-design, framework, as shown in Fig. 2, which contains three orthogo-
networking and computing co-design, and computing and nal dimensions. Here, the X dimension indicates the emerging
control co-design. Note that the co-design approach considers CPS domains (e.g., industry, transportation, and others), the Y
parameters from both subsystems to capture tight interactions. dimension shows the targets (i.e., individual subsystems and
Further, we carry out two case studies to validate the effec- co-design of subsystems), and the Z dimension demonstrates
tiveness of reinforcement learning in transportation CPS and two types of reinforcement learning schemes (e.g., model-
Translate CPS problems to reinforcement learning solvable problems
industrial CPS, and outline some future research directions for based and model-free). We investigate research works that can
reinforcement learning for CPS. be mapped to cubes in Fig. 2 by <X1 , Y1 (control subsystem),
Z2 > for <industry, individual subsystems (control system),
vProblem: Trajectory Tracking
model-based>.
State-of-the-art reinforcement learning solutions for CPS
Ø Minimize or maximize system state
Reinforcement
trajectory and desired state trajectory
Reinforcement Learning Learning Methods (Z)
vApproach: Industrial Transportation
Others (X3)
CPS (X1) CPS (X2)
ü model free Individual
Systems (Y1)
Networking Control Computing • black box Model-based (Z2)
Co-design (Y2)
ü model based
• deterministic model Model-free (Z1)
• stochastic model
Co-design
• neural network
Cyber-Physical
Systems (X)

Fig. 1. Reinforcement Learning in CPS Targets (Y)

The main contributions of this paper are two-fold: Fig. 2. A Framework for Reinforcement Learning in CPS
First, we propose a three-dimensional framework to in-
vestigate existing research works that cover CPS domains A. Reinforcement Learning in Control Subsystems
(manufacture, transportation, and others), targets (individual Controllers play a vital role in continuously controlling
subsystems or co-design), and reinforcement learning schemes industrial systems (e.g., process automation) via set points to
(model-based or model-free). We map existing research works maintain stable operation and high production quality. This
to each problem subspace of the three-dimensional framework continuous control, which minimizes the deviations between
to obtain a clear view of research that have been conducted. system status and set points, can be formulated as trajectory
Second, we propose another three-dimensional framework, tracking problems. For instance, for the flotation process
targeting future research directions, which consists of testbed control in the mineral process, the flotation process controller
implementation (individual subsystems or co-design), objec- manipulates the process variables (feed flow, particle size,
tive (Quality of service (QoS) or security), and reinforcement and others) continuously towards the set points prescribed
learning (model-based or model-free) to progress research by experienced experts [18]. In particular, Jiang et al. [18]
in leveraging reinforcement learning for CPS. Based on the proposed an interleaved learning algorithm for reinforcement
framework, we carry out two case studies that leverage rein- learning, which combines policy iteration and value iteration
forcement learning in transportation CPS and industrial CPS, to dynamically assign the set points accurately, automatically,
respectively. Via the experimental results, we validate the and in a timely way to reduce product quality fluctuations
effectiveness of adopting reinforcement learning to improve from manually prescribed set points in dynamic operation
the decision making in CPS, such as routing and control. conditions. This work can be mapped to cube <X1 , Y1
Finally, we discuss some remaining issues for future research. (control subsystem), Z2 >.

319

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
TABLE I
Robot controllers are important for manufacturing robotics, S TATE - OF - THE - ART IN R EINFORCEMENT L EARNING FOR C ONTROL
which are expected to be more intelligent and adaptive to
<X1 , Y1 (control subsystem), Z1 > [24], [25]
conduct complex production tasks, instead of traditionally sim- <X1 , Y1 (control subsystem), Z2 > [18], [23], [26]
ple and repetitive tasks, in a dynamic environment [19], [20]. <X2 , Y1 (control subsystem), Z1 > [27], [28]
Robot controllers measure the behaviors of robot assets and <X2 , Y1 (control subsystem), Z2 > [29], [30]
control them to minimize the error between the system trajec- <X3 , Y1 (control subsystem), Z1 > [31], [32]
tory with the targeted trajectory using a variety of mechanisms. <X3 , Y1 (control subsystem), Z2 > [33]
For example, fuzzy control, neural network based control,
and reinforcement learning based control, among others, have states and actions.
been shown to be effective under complex environments and
demanding requirements [21]. Those control mechanisms may B. Reinforcement Learning for Networking Subsystems
require full, partial, or no prior knowledge about the robot Networking subsystems are another critical component that
system model, depending on the system characteristics and transmit sensor data and control commands (from sensors to
system model availability. controllers and from controllers to actuators) in a closed-loop
Polydoros et al. [20] surveyed model based reinforcement fashion. Reinforcement learning has been adopted to address
learning controllers, which rely on the accurate stochastic some of the critical problems in networking subsystems, in-
or deterministic transition models to conduct optimal control cluding access control, routing, and resource allocation, among
strategies effectively and quickly. The deterministic model others [15], [34]–[36].
contains a number of factors from physics to describe the For example, Al-Rawi et al. [34] reviewed mechanisms
system dynamics, which leads to high analytical complex- that address routing problems using Q-routing, multi-agent
ity [22]. The stochastic model leverages Gaussian processes to reinforcement learning, and partially observable MDP, which
build probabilistic transition functions, where the policy maps have been proven effective for making routing decisions in
actions following a probability distribution [20]. dynamic and high mobility networks. In particular, the routing
Likewise, Kamthe et al. [23] proposed a model based problem has been translated to an MDP problem, in which
reinforcement learning algorithm for probabilistic model pre- states are the destination nodes, actions are next-hop nodes,
dictive control (MPC) controller, which leverages Gaussian and the reward is dependent on routing objectives, such as
processes to learn a probabilistic transition function. With the end-to-end latency, number of hops, throughput, and others.
learned probabilistic transition function (probabilistic model), The proposed Q-learning algorithm assigns Q-values to all
the optimal control problem becomes a deterministic prob- state-action pairs based on end-to-end latency. Then, every
lem. Then, Pontryagin’s maximum principle was leveraged to node updates their Q-value throughout the learning process
compute the gradients of the long-term cost (deviation) for until an optimal policy is learned. Note that the generalized
open-loop optimal control with regard to control constraints. MDP problem can be used to model routing problems in
In addition, the model based reinforcement learning algorithm various wireless networks, including wireless sensor networks,
was designed to reduce the number of interactions with the delay tolerant networks, and cognitive radio networks, among
environment while computing an optimal control sequence others. This work can be mapped to cube <X3 , Y1 (networking
(trajectory) to minimize long-term cost (deviation). The prob- subsystem), Z1 >.
abilistic MPC controller determines the finite-horizon control In addition, Li et al. [35] leveraged the transfer actor-critic
trajectory, which allows the algorithm to repeatedly update algorithm (TACT) to improve network energy efficiency. The
the probabilistic model, making the controller robust to model authors proposed a reinforcement learning framework to effec-
errors asymptotically. This work can be mapped to cube <X1 , tively adapt the network power control (base station on/off) to
Y1 (control subsystem), Z2 >. Moreover, other research works the traffic load variation without full prior knowledge of the
for applying reinforcement learning to control can be found network traffic to improve energy efficiency. In particular, the
in Table I. action set contains the base station on/off switching operations.
Summary: Reinforcement learning often integrates with The states are the traffic load that varies due to the actions
classic controllers, such as MPC, to conduct optimal control taken. The reward function is the energy cost of all base
in control subsystems. Building an accurate predictive model, stations, where the objective is to minimize the overall energy
such as a stochastic model, can improve the performance cost. The authors adopted the actor-critic algorithm to compute
of reinforcement learning algorithms significantly. Due to a stochastic policy, which criticizes bad actions (high cost) and
the delayed reaction of control subsystems on reinforcement updates the policy until convergence. In addition, they applied
learning actions, we need to consider cost functions in the long the transfer learning, adopting learned knowledge to update
term to make reinforcement leaning algorithms effective in a network setting (new traffic load scenario), to improve the
control subsystem. Nonetheless, considering significant long- reinforcement learning performance (i.e., convergence time).
term cost functions will increase the computing complexity of This work can be mapped to cube <X3 , Y1 (networking
reinforcement leaning algorithms, making them inapplicable to subsystem), Z1 >.
some delay sensitive control subsystems. To reduce the com- For mobile wireless networking with high complexity,
puting complexity, one viable strategy is to reduce unnecessary dynamics, and heterogeneity, deep reinforcement learning,

320

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
TABLE II TABLE III
S TATE - OF - THE - ART IN R EINFORCEMENT L EARNING FOR N ETWORKING S TATE - OF - THE - ART IN R EINFORCEMENT L EARNING FOR C OMPUTING

<X1 , Y1 (networking subsystem), Z1 > [44] <X1 , Y1 (computing subsystem), Z1 > [53]
<X1 , Y1 (networking subsystem), Z2 > [45], [46] <X1 , Y1 (computing subsystem), Z2 > [54], [55]
<X2 , Y1 (networking subsystem), Z1 > [47] <X2 , Y1 (computing subsystem), Z1 > [56], [57]
<X2 , Y1 (networking subsystem), Z2 > [41], [48] <X2 , Y1 (computing subsystem), Z2 > [58], [59]
<X3 , Y1 (networking subsystem), Z1 > [34], [35] <X3 , Y1 (computing subsystem), Z1 > [51]
<X3 , Y1 (networking subsystem), Z2 > [49], [50] <X3 , Y1 (computing subsystem), Z2 > [60], [61]

a powerful and effective technique, has been designed to Resource allocation problems in obtaining high resource uti-
tackle a variety of research problems [15], [36], [37]. For lization incur high computational complexity when the system
example, Zhang et al. [36] reviewed deep reinforcement is large and dynamic [51]. Reinforcement learning has shown
learning techniques, which have been applied to solve diverse to be effective in solving these computing resource allocation
mobile and wireless networking problems, including network problems in dynamic environments [51], [52].
optimization, traffic control, tracking, and intrusion detection, For example, Ranadheera et al. [51] proposed a game theory
among others [37], [38]. For instance, considering multiple model (i.e., Minority Game (MG)) for the distributed resource
simultaneous QoS metrics (e.g., latency, packet loss ratio) for allocation problem and deep Q-learning for decision making
routing decision-making is complicated problem, especially in the context of mobile edge computing (MEC). In the MG,
when traffic flows have different requirements. To this end, players are servers that decide to be active or not (accepting
Pham et al. [39] proposed a deep reinforcement learning based computing jobs or not) in consecutive rounds. The minority
routing algorithm to address complex QoS-aware routing in the group of players selecting active or not active win the game.
software-defined network (SDN). In particular, they integrated The payoff (i.e., unit reward) is assigned to the players in
convolutional layers in an actor network and critic network the minority group, which ensure uncrowded resources. Q-
for action function and cost function approximation. Also, the learning was adopted to learn from the outcomes of the game
deep deterministic policy gradient (DDPG) agent was lever- so that an effective solution for the resource allocation problem
aged to explore the mutual impacts of traffic flows to obtain could be obtained. This work can be mapped to cube <X3 ,
better routing configurations. Note that this work extends the Y1 (computing subsystem), Z1 >.
approach of leveraging DDPG that conducts optimal routing Likewise, He et al. [42] studied the cache resource alloca-
to adapt to traffic intensity [40] and the experimental results tion problem and proposed an algorithm to allocate computing
demonstrate a better performance of QoS aware routing. resources accurately based on deep Q-learning for maximizing
This work can be mapped to cube <X3 , Y1 (networking QoE. The QoE model was first established to measure user
subsystem), Z2 >. satisfaction in content-centric IoT with consideration for both
In addition, resource allocation problems (spectrum, power, network side (i.e., network cost) and user side (i.e., satisfaction
computing, and others), which highly impact the quality of score) metrics, in contrast to traditional QoS models. The
experience (QoE) for users, can be solved by using deep deep reinforcement learning based cache resource allocation
reinforcement learning based techniques [41], [42]. Note algorithm was designed. In their design, the system states are
that the quality of experience (QoE) has been developed to the conditions of cache nodes and the transmission rates of
measure user experience of service provided from a user- content chunks (minimal operating units), the actions are the
centric perspective in contrast to provider-centric methods updates (remove or cache) on the content chunks in cache
(e.g., QoS) [43]. Other research works on state-of-the-art nodes, and the reward function is defined by the QoE of both
reinforcement learning for networking subsystems can be network and users. Two neural networks were leveraged to
found in Table II. approximate the correlations of the state-action pair and Q-
Summary: Reinforcement learning often solves routing and value accurately. Finally, the simulation results demonstrated
network configuration problems, especially in heterogenous the effectiveness of the deep reinforcement learning in solving
and dynamic networking systems, such as mobile ad hoc the cache resource allocation with optimal QoE for content-
networks (MANETs). Building an accurate model in hetero- centric IoT. This work can be mapped to cube <X3 , Y1
geneous and dynamic networking environments is extremely (computing subsystem), Z2 >. Other research works on rein-
complex and impractical. One alternative is to combine deep forcement learning for computing can be found in Table III.
neural networks and Q-learning to address routing problems Summary: Levering deep Q-learning and multiagent re-
in heterogeneous and dynamic networking environments. inforcement learning to address QoS and resource allocation
problems in computing subsystems have been proven effective.
C. Reinforcement Learning for Computing Subsystems Deep Q-learning uses DNNs to approximate reward functions,
Moving storage and computing capabilities to the edge which are difficult to obtain in many scenarios, such as
can effectively reduce the latency and cost for time-sensitive heterogeneous cloud/edge computing systems. In addition,
and resource-demanding services due to the computing archi- multiagent reinforcement learning can implement some game
tecture being physically closer to end-users [8], [42], [51]. theory based schemes in a distributed fashion.

321

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
TABLE IV
III. R EINFORCEMENT L EARNING FOR CPS C O - DESIGN R EINFORCEMENT L EARNING FOR N ETWORKING -C ONTROL C O - DESIGN
The performance of complex CPS (e.g., resource efficiency, <X1 , Y2 (networking-control), Z1 > [69]
QoS, and others) depend not only on the individual subsystems <X1 , Y2 (networking-control), Z2 > [65]
(e.g., networking subsystem, control subsystem, and comput- <X2 , Y2 (networking-control), Z1 > [70]
ing subsystem), but also on the tight interactions between <X2 , Y2 (networking-control), Z2 > [71]
them, such as networking-control, networking-computing, and <X3 , Y2 (networking-control), Z1 > [72]
control-computing interactions while achieving CPS objec- <X3 , Y2 (networking-control), Z2 > [73], [74]
tives [60], [62]–[64]. While conducting system design for TABLE V
either networking subsystems, control subsystems or comput- S TATE - OF - THE - ART IN R EINFORCEMENT L EARNING FOR
ing subsystems, co-design approach considers non-negligible N ETWORKING -C OMPUTING C O - DESIGN
interactions between subsystems. For example, while conduct- <X1 , Y2 (networking-computing), Z1 > [75]
ing control subsystem design, networking-control co-design <X1 , Y2 (networking-computing), Z2 > [76]
considers the parameters from the networking subsystem, such <X2 , Y2 (networking-computing), Z1 > [77], [78]
<X2 , Y2 (networking-computing), Z2 > [62], [64], [79]
as delay, which impacts the design of the control subsystem. <X3 , Y2 (networking-computing), Z1 > [80]
<X3 , Y2 (networking-computing), Z2 > [81], [82]
A. Reinforcement Learning for Networking-Control Co-design
Networked control systems (NCS) leverage communication
networks to transmit sensing and actuation data to control source scheduling jointly impact QoS/QoE in vehicle networks
systems. The safe and resilient operations of NCS depend in the context of information-centric (i.e., content-centric)
on the highly coupled control and networking systems (e.g., IoT [62]. For example, Tan et al. [62] leveraged deep Q-
networking subsystem lead to packet delay and loss in control learning to tackle the communication and computing (cache
loops) [65]–[68]. For example, Lu et al. [68] studied the placement) joint optimization problem constrained by the high
wireless network and control co-design for industrial CPS, in mobility and latency requirements in vehicle networks. To
which the wireless networks enable benefits, such as low cost map the joint optimization problem to an MDP problem,
and high flexibility, to industrial control systems. These wire- the system states are determined by the edge server (i.e.,
less networks also raise challenges to control, such as latency road-side unit) availability, vehicle availability, and cache
and fragility. In their work, a sampling rate selection based availability, the action set contains pair selection on edge
optimization problem was addressed with the consideration of servers, the number of packets, and task offload selection (i.e.,
constraints on delay and system stability for wireless network vehicle or edge server). The reward function is defined as
and control co-design. the multiplied cost of the transmission cost (networking) and
In addition, Redder et al. [65] proposed a deep reinforce- computation cost (computing). In addition, DNN was adopted
ment learning based iterative resource allocation algorithm to approximate the Q-values for state-action pairs and the
(DIRA), which assigns control actions (i.e., control signaling) complexity was reduced by using selected samples for training
and scheduling actions (i.e., channel resource scheduling) to and periodical updating from the large action space. The exper-
conduct control and networking co-design for NCS. Their co- imental results showed the effectiveness of proposed scheme
design approach integrates the networking resource scheduling under diverse vehicle network setting scenarios (mobility, data
into the controller design for the control system. They modeled size, and others). This work can be mapped to cube <X2 ,
the co-design as an MDP problem, in which the states are Y2 (networking-computing), Z2 >. In addition, other research
control system states, the actions are the control actions and works on reinforcement learning for networking-computing
scheduling actions, the Q-value is approximated by a neural co-design can be found in Table V.
network, and the reward is negative one-stage cost. In addition,
C. Reinforcement Learning for Control-Computing Co-design
considering the co-design problem in a large-scale system,
the MDP problem was formalized via an extracted inherent Computing systems play a vital role in enabling data
iterative structure to reduce the action space, as well as the analysis and have been adopted to process, store and manage
computing complexity of deep Q-learning for large-scale NCS. information, as well as conduct valuable decision-making in
The numerical results demonstrated the effectiveness of the industrial CPS, and transportation CPS, among others [83],
deep reinforcement learning based co-design scheme and the [84]. Controllers in industrial control systems produce control
scalability of the proposed scheme. This work can be mapped commands based on accurate and timely state estimations,
to cube <X1 , Y2 (networking-control), Z2 >. In addition, other which depend on computing resource availability [85], [86].
research works on reinforcement learning for networking- For example, Pant et al. [85] proposed a co-design frame-
control co-design can be found in Table IV. work to conduct the joint optimization of computing and
control, which captures the interactions of delay error in
B. Reinforcement Learning for Networking-Computing Co- state estimation, control performance, and energy cost of the
design controller. In particular, a stable control system does not
Considering highly coupled networking and computing sys- always require accurate state estimation, as it may burden the
tems, computing resource allocation and communication re- computational resources and cause estimation delay, further

322

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
TABLE VI
S TATE - OF - THE - ART IN R EINFORCEMENT L EARNING FOR IV. C ASE S TUDIES AND F UTURE R ESEARCH
C ONTROL -C OMPUTING C O - DESIGN
In this section, we now provide two case studies that we
<X1 , Y2 (control-computing), Z1 > n/a have conducted on applying Q-learning in transportation CPS
<X1 , Y2 (control-computing), Z2 > [65] and industrial CPS. Preliminary studies, these fulfill the prob-
<X2 , Y2 (control-computing), Z1 > n/a lem subspace of our framework highlighted in blue shown in
<X2 , Y2 (control-computing), Z2 > n/a
Fig. 3. To further expand the effectiveness of the reinforcement
<X3 , Y2 (control-computing), Z1 > n/a
<X3 , Y2 (control-computing), Z2 > [86] learning, we propose a three-dimensional framework in Fig. 3
based on the investigation conducted in Section II. We can
leverage the framework as a guideline to solve other critical
Future Research
and complex problemsfor
inreinforcement
other CPS as learning
well. for
Transportation CPS
causing instability to the control subsystem. In some cases, Reinforcement
Learning Methods (Z)
a lower quality state estimation can be sufficient to achieve Individual
Case Stud
the control objectives. The proposed co-design framework Systems (X1) Co-design (X2) vReinfor
integrates the estimator design and control design, in which QoS (Y1) Routing
Model-based (Z2)
the estimator presents different operating modes (error and Security (Y1)
delay) and the controller adapts to those modes to obtain stable Ø App
control at low energy cost. Model-free (Z1) rout
ad-h
Likewise, Liu et al. [86] adopted deep Q-learning to solve
Testbed Ø Test
the computing (e.g., computing task scheduling) and control Implementation(X)
co-design problem of automatic operation control (e.g., energy
management) for the smart city. In this study, DNN was lever-
Objective (Y)
aged in the cloud to approximate the value function with state
Fig. 3. Roadmap of Reinforcement Learning for CPS
and action pairs and guide the deep reinforcement learning on
edge servers to make better decisions. Two deep reinforcement As shown in Fig. 3, the X dimension indicates the testbed
learning methods were proposed, including deep reinforcement implementation for the CPS targets (i.e., individual subsystems
learning on edge servers and cooperative deep reinforcement and co-design), the Y dimension shows the objectives of the
learning. For the deep reinforcement learning on edge servers, reinforcement learning scheme (QoS or security), and the Z
the states are the demands (requirements) of computing re- dimension demonstrates the types of reinforcement learning
sources for all users, the actions are the selections conducted schemes adopted to achieve the design objectives.
by edge servers to provide service, and the reward is assigned A. Applying Reinforcement Learning in Transportation CPS
based on the minimal energy consumed. For the cooperative
deep reinforcement learning method, they leveraged the DNN In this case study, we apply Q-learning to solve the routing
trained a deep reinforcement learning agent in the cloud to problem for vehicular ad hoc networks in transportation CPS.
schedule edge servers and deep reinforcement learning agents We consider 200 vehicles, random data transmission rate, and
on edge servers to conduct dynamic control. This work can vehicle velocity between 0-60 (Km/H) in three scenarios: one-
be mapped to cube <X3 , Y2 (control-computing), Z2 >. In way road, two-way road, and cross-road. The one-way road
addition, other research works on reinforcement learning for scenario indicates that all vehicles are moving in one direction
control-computing co-design can be found in Table VI. Note and in one lane. The two-way road scenario indicates that
that the “n/a” indicates the lack of research results in this area. vehicles are in two lanes moving in two (opposite) directions.
The cross-road scenario indicates a vehicle junction area. We
Summary: Considering numerous parameters from mul- use SUMO to model vehicle topology and behavior [87]. We
tiple subsystems, such as channel condition and data rate also use MATLAB to simulate the network performance and
in networking subsystems, stability and response time in compute the reinforcement learning based routing decision
control subsystems, and computing power and compatibility model.
in computing subsystems, building a stochastic model to con- We leverage the Q-learning algorithm to assign each action-
duct co-design schemes is extremely complex and difficult to state pair a Q-value. The system states are defined by three
achieve. Recently, popular co-design schemes have leveraged parameters (i.e., vehicle distance, velocity difference, and
deep neural networks to approximate a model and capture bandwidth). We divide these parameters into several intervals.
system interactions. This scheme significantly increases the For example, if the distance between two vehicles is 0-
computing complexity, which eliminates the applicability of 15 m, velocity difference is between 0-10 km/h, and available
reinforcement learning in resource constrained and delay sen- bandwidth is 0-5 kHz, we consider this communication route
sitive systems. To reduce latency, some schemes, such as the is in state 1. In this way, we can map all the vehicle conditions
simplified deep neural network, extract features to reduce the into a limited set of states. The actions are the routing
number of inputs, a potentially effective solution. decisions (modulation types, such as BPSK, QPSK, 16QAM,

323

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
1 0.09 AODV
Control Signal
AODVL
Temperature
0.95 0.08 RLAODV 1
Disturbance

0.9 AODV 0.5


AODVL 0.07
RLAODV 0
0.85
0.06
-0.5
0.8
0.05
-1
20 40 60 80 100 120 140 160 180 200
0.75
Control Signal
20 25 30 35 40 45 50 55
Temperature
0.7 1
20 25 30 35 40 45 50 55 60 One-way Road Disturbance

End-to-End Delay (s)


0.5
One-way Road 0.09
Packet Delivery Ratio

AODV
1 0
AODVL

Temperature (°C)
0.08
RLAODV
0.95 -0.5
0.07
AODV -1
0.9 AODVL 20 40 60 80 100 120 140 160 180 200
0.06
RLAODV 2
0.85 Control Signal
0.05
Disturbance

0.04 1 Temperature
0.8

0.75 0.03
20 30 40 50 60 70 80 90 100
0
0.7 Two-way Road
20 30 40 50 60 70 80 90 100
0.1 -1
Two-way Road AODV 0 20 40 60 80 100 120 140 160 180 200
AODVL
1 Control Signal
RLAODV
0.095 Disturbance
1
0.95 Temperature

AODV 0.5
0.9 0.09
AODVL
RLAODV 0
0.85
0.085 -0.5
0.8

-1
0.75 20 40 60 80 100 120 140 160 180 200
0.08
20 30 40 50 60 70 80 90 100
0.7
20 30 40 50 60 70 80 90 100 Cross Road Time (s)
Cross Road
Maximum Velocity Differernce (Km/H) Maximum Velocity Differernce (Km/H)
Fig. 6. Response of Different Control Systems (From
Fig. 4. A Comparison of RLAODV, AODV and Fig. 5. A Comparison of RLAODV, AODV Top to Bottom): Response of Feedback Control System,
AODVL on Packet Delivery Rate under One- and AODVL on End-to-End Delay under One- Response of Feedforward Control System, Response of
way Road, Two-way Road, and Cross-Road way Road, Two-way Road, and Cross-Road Joint Feedback and Feedforward Control System, and
Scenarios Scenarios Response of Reinforcement Learning Control System

and 64QAM) for all vehicles in the communication distance.


Each action has a Q-value that indicates how good this action In1 Out1
Disturbance D
DISTURBANCE
is. The reward function is defined by the number of hops to In1 Out1 Plant Disturbance Gd

the destination and the delivery rate. At the early stages of Feedforward Control F
Control Signal V
learning phase, the vehicles will estimate the Q-value of each In1 Out1
In1 Out1

action through trial-and-error. Then, based on these Q-values, Feedback Control C


Heat Exchanger Plant Gp
vehicles will choose the appropriate actions. At the same time, Interpreted
MATLAB Fcn
Temperature T

vehicles will continue to explore other actions with a small Reinforcement Learning controller

probability. Reinforcement Learning Controlsignal R

We compare our reinforcement learning based routing algo-


rithm (RLAODV) to other representative routing algorithms,
Fig. 7. Implementation of the Reinforcement Learning Controller-Enabled
namely: ad hoc on-demand distance vector (AODV) and ad Industrial CPS
hoc on-demand distance vector and link quality (AODVL).
Performance is compared based on the metrics of packet
delivery ratio and end-to-end delay. In particular, AODV
considers the distance between vehicles to conduct routing
and AODVL considers both vehicle distance and link quality. speed above 25 Km/H). Finally, in the cross-road scenario,
Observed from Fig. 4, on the one-way road, the packet the packet delivery ratio of RLAODV outperforms AODVL
delivery ratio of our reinforcement learning based routing at velocity differences ranging 40-100 Km/H (vehicle speed
algorithm (RLAODV) outperforms AODV at all velocity dif- above 20 Km/H). As we can see from the figure, the higher
ference (Km/H) values, obtains results close to AODVL at complexity of the condition, the better the performance of
velocity differences ranging 20-40 Km/H, and outperforms our reinforcement learning based routing algorithm. Also,
AODVL at velocity differences ranging 40-60 Km/H (vehicle as observed from Fig. 5, the end-to-end delay of RLAODV
speed above 40 Km/H). This is because the dynamics of outperforms AODV and AODVL on both two-way road
vehicle topology are limited at low velocity. It is sufficient to and cross-road scenarios. In the one-way road scenario, the
consider only the vehicle distance and link quality (AODVL). RLAODV outperforms both AODV and AODVL at velocity
For the two-way road, the maximum velocity difference in- differences ranging 32-60 Km/H. Again, note that this work
creased, the packet delivery ratio of RLAODV outperforms can be mapped to cube <X1 (networking subsystem), Y1 , Z1 >
AODVL at velocity differences ranging 50-100 Km/H (vehicle in Fig. 3 for transportation CPS.

324

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
B. Applying Reinforcement Learning in Industrial CPS wireless cyber-physical simulator (WCPS), including TOSSIM
As the second case study, we apply Q-learning to im- (communication network) and Simulink (control system), for
prove control system performance in industrial CPS. For our industrial CPS [88].
experiments, we used MATLAB/Simulink to implement the Existing reinforcement learning algorithms enable only
reinforcement learning controller for the continuous stirred- machines to conduct trial-and-error interactions with their
tank reactor (CSTR) as the physical plant in our simulation, environment to obtain an optimal policy, and do not consider
as shown in Fig. 7. In CSTR, a feeder supplies raw material human factors. Human knowledge has not been well used
to the reactor. A steam pipe whose flow rate is controlled by while training reinforcement learning models. In addition,
controller will heat the reactor to maintain a target reaction reinforcement learning algorithms cannot automatically adjust
temperature. themselves based on human behaviors, which leads to unde-
For the reinforcement learning controller, we leverage Q- sired performance. For example, in industrial CPS, humans
learning to assign a Q-value to each action-state pair of play a vital role in operating machines. Without consideration
the control system. The system states are defined by the for the dynamics of human behaviors, reinforcement learning
temperature of the physical plant and the trend of temperature cannot reach its full potential. Thus, we shall leverage human
changes. For example, if the current temperature is 1◦ C and knowledge and expertise for reinforcement learning so that
with a falling rate at 0.5◦ C/s, we consider the system to be in the reinforcement learning model can be comprehensive and
state 1. If the current temperature is 0.5◦ C and with a falling flexible.
rate at 0.5◦ C/s, we consider the system to be in state 2, and
so on. The actions are the control signals for the steam pipe. V. F INAL R EMARKS
Instead of using control signals directly, we define our actions Applying AI and ML methods (e.g., reinforcement learning)
as the rate of increase in the control signal. By doing so, to solve problems in complex, dynamic, and heterogeneous
discrete control signals can be approximated as a sequential CPS is an emerging and active research field. In this paper,
control signal. This method can also prevent the controller we proposed a three-dimensional framework to systematically
from generating unrealistic control signals, such as rapidly investigate existing research works that adopt reinforcement
changing discrete signals. The reward function is defined by learning techniques to solve problems in CPS, taking the
the stability of the physical plant (the tendency of temperature perspectives of individual subsystems and the co-design of
to change to a set point). Combined with the MPC controller, subsystems. Based on an extensive review of existing research,
we can predict the temperature of the physical plant in the we proposed a framework for future research toward applying
next few time steps. Then, we can estimate the stability of the reinforcement learning in CPS. As preliminary efforts, we
current system after taking selected actions. have carried out two case studies to validate the effectiveness
As we can observe from Fig. 6, our reinforcement learning of leveraging reinforcement learning in transportation and
controller outperforms typical control techniques, such as industrial CPS. We also outlined several promising future
feedback control, feedforward control, and joint feedback and research directions.
feedforward control. In particular, feedback control conducts
control on input signals to meet the desired performance based ACKNOWLEDGMENT
on the output of the system, feedforward control conducts The work was supported in part by the US National Science
control based on the mathematical model of the process, and Foundation (NSF) under grant: CNS 1350145. Any opinions,
joint feedback and feedforward control takes advantage of findings and conclusions or recommendations expressed in this
both control methods. This work can be mapped to cube <X1 material are those of the authors and do not necessarily reflect
(control subsystem), Y1 , Z1 > in Fig. 3 for industrial CPS. the views of the funding agency.
C. Future Research
R EFERENCES
Based on the three-dimensional framework shown in Fig. 3,
we now consider open research directions toward leveraging [1] J. Lin, W. Yu, N. Zhang, X. Yang, H. Zhang, and W. Zhao, “A survey
reinforcement learning in diverse CPS, including transporta- on internet of things: Architecture, enabling technologies, security and
privacy, and applications,” IEEE Internet of Things Journal, vol. 4, no. 5,
tion CPS, industrial CPS, and others, which we intend to solve. pp. 1125–1142, Oct 2017.
We shall first design reinforcement learning algorithms (e.g., [2] H. Xu, W. Yu, D. Griffith, and N. Golmie, “A survey on industrial
Q-learning) to solve different problems, such as QoS/QoE internet of things: A cyber-physical systems perspective,” IEEE Access,
vol. 6, pp. 78 238–78 259, 2018.
optimization. To validate model effectiveness in a realistic [3] J. Lin, W. Yu, X. Yang, Q. Yang, X. Fu, and W. Zhao, “A novel
environment, we shall implement designed algorithms in cor- dynamic en-route decision real-time route guidance scheme in intelligent
responding testbeds, such as SUMO for vehicular networks, transportation systems,” in Proceedings of the IEEE 35th International
Conference on Distributed Computing Systems, June 2015, pp. 61–72.
MATLAB/Simulink for control systems, etc. In addition, we [4] J. Lin, W. Yu, N. Zhang, X. Yang, and L. Ge, “Data integrity attacks
shall implement co-design testbeds to capture the interactions against dynamic route guidance in transportation-based cyber-physical
of subsystems in a realistic environment, such as OMNET++ systems: Modeling, analysis, and defense,” IEEE Transactions on Ve-
hicular Technology, vol. 67, no. 9, pp. 8738–8753, Sep. 2018.
(communication network), SUMO (transportation system), and [5] X. Yu and Y. Xue, “Smart grids: A cyber–physical systems perspective,”
Veins (integration framework) for transportation CPS [87]; and Proceedings of the IEEE, vol. 104, no. 5, pp. 1058–1070, 2016.

325

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
[6] Q. Yang, J. Yang, W. Yu, D. An, N. Zhang, and W. Zhao, “On false [27] P. Mannion, J. Duggan, and E. Howley, “An experimental review of
data-injection attacks against power system state estimation: Modeling reinforcement learning algorithms for adaptive traffic signal control,”
and countermeasures,” IEEE Transactions on Parallel and Distributed in Autonomic Road Transport Support Systems. Springer, 2016, pp.
Systems, vol. 25, no. 3, pp. 717–729, March 2014. 47–66.
[7] J. Lin, W. Yu, and X. Yang, “Towards multistep electricity prices in smart [28] H. Mirzaei, G. Sharon, S. Boyles, T. Givargis, and P. Stone, “Enhanced
grid electricity markets,” IEEE Transactions on Parallel and Distributed delta-tolling: Traffic optimization via policy gradient reinforcement
Systems, vol. 27, no. 1, pp. 286–302, Jan 2016. learning,” in Proceedings of the 21st International Conference on
[8] W. Yu, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Yang, “A Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 47–52.
survey on the edge computing for the internet of things,” IEEE Access, [29] Y. Lin, X. Dai, L. Li, and F.-Y. Wang, “An efficient deep rein-
vol. 6, pp. 6900–6919, 2018. forcement learning model for urban traffic control,” arXiv preprint
[9] W. G. Hatcher and W. Yu, “A survey of deep learning: Platforms, arXiv:1808.01876, 2018.
applications and emerging research trends,” IEEE Access, vol. 6, pp. [30] P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learn-
24 411–24 432, 2018. ing architecture toward autonomous driving for on-ramp merge,” in
[10] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in Proceedings of the IEEE 20th International Conference on Intelligent
robotics: A survey,” The International Journal of Robotics Research, Transportation Systems (ITSC). IEEE, 2017, pp. 1–6.
vol. 32, no. 11, pp. 1238–1274, 2013. [31] F. Ruelens, B. J. Claessens, S. Quaiyum, B. De Schutter, R. Babuška, and
[11] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van R. Belmans, “Reinforcement learning applied to an electric water heater:
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, from theory to practice,” IEEE Transactions on Smart Grid, vol. 9, no. 4,
M. Lanctot et al., “Mastering the game of go with deep neural networks pp. 3792–3800, 2016.
and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. [32] R. Lu, S. H. Hong, and X. Zhang, “A dynamic pricing demand response
[12] Y. Jiang, J. Fan, T. Chai, and F. L. Lewis, “Dual-rate operational optimal algorithm for smart grid: reinforcement learning approach,” Applied
control for flotation industrial process with unknown operational model,” Energy, vol. 220, pp. 220–230, 2018.
IEEE Transactions on Industrial Electronics, vol. 66, no. 6, pp. 4587– [33] Y. Yang, J. Hao, Y. Zheng, X. Hao, and B. Fu, “Large-scale home energy
4599, 2018. management using entropy-based collective multiagent reinforcement
[13] A. Ferdowsi, U. Challita, W. Saad, and N. B. Mandayam, “Robust learning framework,” in Proceedings of the 18th International Confer-
deep reinforcement learning for security and safety in autonomous ence on Autonomous Agents and MultiAgent Systems, 2019, pp. 2285–
vehicle systems,” in Proceedings of the 21st International Conference on 2287.
Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 307–312. [34] H. A. Al-Rawi, M. A. Ng, and K.-L. A. Yau, “Application of reinforce-
[14] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforcement learning for ment learning to routing in distributed wireless networks: a review,”
electric power system decision and control: Past considerations and Artificial Intelligence Review, vol. 43, no. 3, pp. 381–416, 2015.
perspectives,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 6918–6927, 2017. [35] R. Li, Z. Zhao, X. Chen, J. Palicot, and H. Zhang, “Tact: A transfer
[15] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. actor-critic learning framework for energy saving in cellular radio access
Liang, and D. I. Kim, “Applications of deep reinforcement learning networks,” IEEE transactions on wireless communications, vol. 13,
in communications and networking: A survey,” IEEE Communications no. 4, pp. 2000–2011, 2014.
Surveys & Tutorials, 2019. [36] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and
[16] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforce- wireless networking: A survey,” IEEE Communications Surveys &
ment learning,” Journal of Machine Learning Research, vol. 16, no. 1, Tutorials, 2019.
pp. 1437–1480, 2015. [37] Z. M. Fadlullah, F. Tang, B. Mao, N. Kato, O. Akashi, T. Inoue, and
[17] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, K. Mizutani, “State-of-the-art deep learning: Evolving machine intel-
“A brief survey of deep reinforcement learning,” arXiv preprint ligence toward tomorrow’s intelligent network traffic control systems,”
arXiv:1708.05866, 2017. IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2432–
[18] Y. Jiang, J. Fan, T. Chai, J. Li, and F. L. Lewis, “Data-driven flotation 2455, 2017.
industrial process operational optimal control based on reinforcement [38] K. Zheng, Z. Yang, K. Zhang, P. Chatzimisios, K. Yang, and W. Xiang,
learning,” IEEE Transactions on Industrial Informatics, vol. 14, no. 5, “Big data-driven optimization for mobile networks toward 5g,” IEEE
pp. 1974–1989, 2017. network, vol. 30, no. 1, pp. 44–51, 2016.
[19] A. Cherubini, R. Passama, A. Crosnier, A. Lasnier, and P. Fraisse, [39] T. A. Q. Pham, Y. Hadjadj-Aoul, and A. Outtagarts, “Deep reinforcement
“Collaborative manufacturing with physical human–robot interaction,” learning based qos-aware routing in knowledge-defined networking,” in
Robotics and Computer-Integrated Manufacturing, vol. 40, pp. 1–13, Proceedings of the International Conference on Heterogeneous Network-
2016. ing for Quality, Reliability, Security and Robustness. Springer, 2018,
[20] A. S. Polydoros and L. Nalpantidis, “Survey of model-based rein- pp. 14–26.
forcement learning: Applications on robotics,” Journal of Intelligent & [40] G. Stampa, M. Arias, D. Sanchez-Charles, V. Muntés-Mulero,
Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017. and A. Cabellos, “A deep-reinforcement learning approach for
[21] L. Jin, S. Li, J. Yu, and J. He, “Robot manipulator control using neural software-defined networking routing optimization,” arXiv preprint
networks: A survey,” Neurocomputing, vol. 285, pp. 23–34, 2018. arXiv:1709.07080, 2017.
[22] I. Mordatch, N. Mishra, C. Eppner, and P. Abbeel, “Combining model- [41] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation
based policy search with online model learning for control of physical in v2v communications,” in Proceedings of the IEEE International
humanoids,” in Proceedings of the IEEE International Conference on Conference on Communications (ICC). IEEE, 2018, pp. 1–6.
Robotics and Automation (ICRA). IEEE, 2016, pp. 242–248. [42] X. He, K. Wang, H. Huang, T. Miyazaki, Y. Wang, and S. Guo, “Green
[23] S. Kamthe and M. P. Deisenroth, “Data-efficient reinforcement resource allocation based on deep reinforcement learning in content-
learning with probabilistic model predictive control,” arXiv preprint centric iot,” IEEE Transactions on Emerging Topics in Computing, 2018.
arXiv:1706.06491, 2017. [43] E. Liotou, D. Tsolkas, N. Passas, and L. Merakos, “Quality of experience
[24] M.-B. Radac, R.-E. Precup, and R.-C. Roman, “Model-free control management in mobile cellular networks: key issues and design chal-
performance improvement using virtual reference feedback tuning and lenges,” IEEE Communications Magazine, vol. 53, no. 7, pp. 145–153,
reinforcement q-learning,” International Journal of Systems Science, 2015.
vol. 48, no. 5, pp. 1071–1083, 2017. [44] S. S. Oyewobi, G. P. Hancke, A. M. Abu-Mahfouz, and A. J. Onumanyi,
[25] J. Zhang, H. Zhang, B. Wang, and T. Cai, “Nearly data-based optimal “An effective spectrum handoff based on reinforcement learning for
control for linear discrete model-free systems with delays via reinforce- target channel selection in the industrial internet of things,” Sensors,
ment learning,” International Journal of Systems Science, vol. 47, no. 7, vol. 19, no. 6, p. 1395, 2019.
pp. 1563–1573, 2016. [45] Q. Zhang, M. Lin, L. T. Yang, Z. Chen, and P. Li, “Energy-efficient
[26] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, scheduling for real-time systems based on deep q-learning model,” IEEE
and E. A. Theodorou, “Information theoretic mpc for model-based Transactions on Sustainable Computing, 2017.
reinforcement learning,” in Proceedings of the IEEE International Con- [46] B. Demirel, A. Ramaswamy, D. E. Quevedo, and H. Karl, “Deepcas:
ference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1714– A deep reinforcement learning algorithm for control-aware scheduling,”
1721. IEEE Control Systems Letters, vol. 2, no. 4, pp. 737–742, 2018.

326

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.
[47] A. Pressas, Z. Sheng, F. Ali, D. Tian, and M. Nekovee, “Contention- industrial cyber-physical systems,” Proceedings of the IEEE, vol. 104,
based learning mac protocol for broadcast vehicle-to-vehicle communi- no. 5, pp. 1013–1024, 2015.
cation,” in Proceedings of the IEEE Vehicular Networking Conference [69] S. C. Lokhande and H. Xu, “Optimal self-triggered control and net-
(VNC). IEEE, 2017, pp. 263–270. work co-design for networked multi-agent system via adaptive dynamic
[48] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning programming,” in Proceedings of the IEEE Symposium Series on Com-
based resource allocation for v2v communications,” arXiv preprint putational Intelligence (SSCI). IEEE, 2017, pp. 1–8.
arXiv:1805.07222, 2018. [70] S. Li, C. He, M. Liu, Y. Wan, Y. Gu, J. Xie, S. Fu, and K. Lu, “The
[49] S. Chinchali, P. Hu, T. Chu, M. Sharma, M. Bansal, R. Misra, M. Pavone, design and implementation of aerial communication using directional
and S. Katti, “Cellular network traffic scheduling with deep reinforce- antennas: Learning control in unknown communication environments,”
ment learning,” in Proceedings of the Thirty-Second AAAI Conference IET Control Theory & Applications, 2019.
on Artificial Intelligence, 2018. [71] D. Zhang, F. R. Yu, and R. Yang, “A machine learning approach for
[50] L. Zhao, J. Wang, J. Liu, and N. Kato, “Routing for crowd management software-defined vehicular ad hoc networks with trust management,” in
in smart cities: A deep reinforcement learning perspective,” IEEE Proceedings of the IEEE Global Communications Conference (GLOBE-
Communications Magazine, vol. 57, no. 4, pp. 88–93, 2019. COM). IEEE, 2018, pp. 1–6.
[51] S. Ranadheera, S. Maghsudi, and E. Hossain, “Mobile edge computa- [72] Y. Li, K. K. Chai, Y. Chen, and J. Loo, “Smart duty cycle control
tion offloading using game theory and reinforcement learning,” arXiv with reinforcement learning for machine to machine communications,”
preprint arXiv:1711.09012, 2017. in Proceedings of the IEEE International Conference on Communication
[52] Z. Wei, B. Zhao, J. Su, and X. Lu, “Dynamic edge computation Workshop (ICCW). IEEE, 2015, pp. 1458–1463.
offloading for internet of things with energy harvesting: A learning [73] L. Zhu, Y. He, F. R. Yu, B. Ning, T. Tang, and N. Zhao,
method,” IEEE Internet of Things Journal, 2018. “Communication-based train control system performance optimization
[53] E. Barrett, E. Howley, and J. Duggan, “Applying reinforcement learning using deep reinforcement learning,” IEEE Transactions on Vehicular
towards automating resource allocation and application scalability in Technology, vol. 66, no. 12, pp. 10 705–10 717, 2017.
the cloud,” Concurrency and Computation: Practice and Experience, [74] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient
vol. 25, no. 12, pp. 1656–1674, 2013. uav control for effective and fair communication coverage: A deep
[54] M. Liu, R. Yu, Y. Teng, V. Leung, and M. Song, “Performance reinforcement learning approach,” IEEE Journal on Selected Areas in
optimization for blockchain-enabled industrial internet of things (iiot) Communications, vol. 36, no. 9, pp. 2059–2070, 2018.
systems: A deep reinforcement learning approach,” IEEE Transactions [75] P. Jinqi, P. Taiyang, and R. Lei, “The supply chain network on cloud
on Industrial Informatics, 2019. manufacturing environment based on coin model with q-learning algo-
[55] L. Mai, N.-N. Dao, and M. Park, “Real-time task assignment approach rithm,” in Proceedings of the 5th International Conference on Enterprise
leveraging reinforcement learning with evolution strategies for long-term Systems (ES). IEEE, 2017, pp. 52–57.
latency minimization in fog computing,” Sensors, vol. 18, no. 9, p. 2830, [76] Y. Wang, K. Wang, H. Huang, T. Miyazaki, and S. Guo, “Traffic and
2018. computation co-offloading with reinforcement learning in fog computing
[56] C. Wu, T. Yoshinaga, Y. Ji, T. Murase, and Y. Zhang, “A reinforcement for industrial applications,” IEEE Transactions on Industrial Informatics,
learning-based data storage scheme for vehicular ad hoc networks,” vol. 15, no. 2, pp. 976–986, 2019.
IEEE Transactions on Vehicular Technology, vol. 66, no. 7, pp. 6336– [77] S. Park and Y. Yoo, “Real-time scheduling using reinforcement learning
6348, 2016. technique for the connected vehicles,” in Proceedings of the IEEE 87th
[57] M. A. Salahuddin, A. Al-Fuqaha, and M. Guizani, “Reinforcement Vehicular Technology Conference (VTC Spring). IEEE, 2018, pp. 1–5.
learning for resource provisioning in the vehicular cloud,” IEEE Wireless [78] C. An, C. Wu, T. Yoshinaga, X. Chen, and Y. Ji, “A context-aware edge-
Communications, vol. 23, no. 4, pp. 128–135, 2016. based vanet communication scheme for its,” Sensors, vol. 18, no. 7, p.
[58] J. Wang, J. Hu, G. Min, W. Zhan, Q. Ni, and N. Georgalas, “Computation 2022, 2018.
offloading in multi-access edge computing using a deep sequential model [79] Y. He, F. R. Yu, N. Zhao, V. C. Leung, and H. Yin, “Software-defined
based on reinforcement learning,” IEEE Communications Magazine, networks with mobile edge computing and caching for smart cities: A
vol. 57, no. 5, pp. 64–69, 2019. big data deep reinforcement learning approach,” IEEE Communications
[59] Z. Ning, P. Dong, X. Wang, J. Rodrigues, and F. Xia, “Deep reinforce- Magazine, vol. 55, no. 12, pp. 31–37, 2017.
ment learning for vehicular edge computing: An intelligent offloading [80] M. G. R. Alam, Y. K. Tun, and C. S. Hong, “Multi-agent and reinforce-
system,” ACM Trans. Intell. Syst. Technol., vol. 25, p. 1, 2019. ment learning based code offloading in mobile fog,” in Proceedings
[60] H. He, H. Shan, A. Huang, Q. Ye, and W. Zhuang, “Reinforcement of the International Conference on Information Networking (ICOIN).
learning-based computing and transmission scheduling for lte-u-enabled IEEE, 2016, pp. 285–290.
iot,” in Proceedings of the IEEE Global Communications Conference [81] L. Huang, S. Bi, and Y.-J. A. Zhang, “Deep reinforcement learning
(GLOBECOM). IEEE, 2018, pp. 1–6. for online offloading in wireless powered mobile-edge computing net-
[61] M. Chen, W. Li, G. Fortino, Y. Hao, L. Hu, and I. Humar, “A works,” arXiv preprint arXiv:1808.01977, 2018.
dynamic service migration mechanism in edge cognitive computing,” [82] Y. He, Z. Zhang, F. R. Yu, N. Zhao, H. Yin, V. C. Leung, and
ACM Transactions on Internet Technology (TOIT), vol. 19, no. 2, p. 30, Y. Zhang, “Deep-reinforcement-learning-based optimization for cache-
2019. enabled opportunistic interference alignment wireless networks,” IEEE
[62] L. T. Tan and R. Q. Hu, “Mobility-aware edge caching and computing Transactions on Vehicular Technology, vol. 66, no. 11, pp. 10 433–
in vehicle networks: A deep reinforcement learning,” IEEE Transactions 10 445, 2017.
on Vehicular Technology, vol. 67, no. 11, pp. 10 190–10 203, 2018. [83] P. Leitão, A. W. Colombo, and S. Karnouskos, “Industrial automation
[63] G. Zhao, M. A. Imran, Z. Pang, Z. Chen, and L. Li, “Toward real-time based on cyber-physical systems technologies: Prototype implementa-
control in future wireless networks: communication-control co-design,” tions and challenges,” Computers in Industry, vol. 81, pp. 11–25, 2016.
IEEE Communications Magazine, vol. 57, no. 2, pp. 138–144, 2019. [84] M. Yigit, V. C. Gungor, and S. Baktir, “Cloud computing for smart grid
[64] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and com- applications,” Computer Networks, vol. 70, pp. 312–329, 2014.
puting for connected vehicles: A deep reinforcement learning approach,” [85] Y. V. Pant, H. Abbas, K. Mohta, T. X. Nghiem, J. Devietti, and
IEEE Transactions on Vehicular Technology, vol. 67, no. 1, pp. 44–55, R. Mangharam, “Co-design of anytime computation and robust control,”
2017. in Proceedings of the IEEE Real-Time Systems Symposium. IEEE, 2015,
[65] A. Redder, A. Ramaswamy, and D. E. Quevedo, “Deep reinforcement pp. 43–52.
learning for scheduling in large-scale networked control systems,” arXiv [86] Y. Liu, C. Yang, L. Jiang, S. Xie, and Y. Zhang, “Intelligent edge
preprint arXiv:1905.05992, 2019. computing for iot-based energy management in smart cities,” IEEE
[66] X. Ge, F. Yang, and Q.-L. Han, “Distributed networked control systems: Network, vol. 33, no. 2, pp. 111–117, 2019.
A brief overview,” Information Sciences, vol. 380, pp. 117–131, 2017. [87] H. Noori, “Realistic urban traffic simulation as vehicular ad-hoc network
[67] Y. Ma and C. Lu, “Efficient holistic control over industrial wireless (vanet) via veins framework,” in Proceedings of the 12th Conference of
sensor-actuator networks,” in Proceedings of the IEEE International Open Innovations Association (FRUCT). IEEE, 2012, pp. 1–7.
Conference on Industrial Internet (ICII). IEEE, 2018, pp. 89–98. [88] Y. Ma, D. Gunatilaka, B. Li, H. Gonzalez, and C. Lu, “Holistic cyber-
[68] C. Lu, A. Saifullah, B. Li, M. Sha, H. Gonzalez, D. Gunatilaka, C. Wu, physical management for dependable wireless control systems,” ACM
L. Nie, and Y. Chen, “Real-time wireless sensor-actuator networks for Transactions on Cyber-Physical Systems, vol. 3, no. 1, p. 3, 2018.

327

Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 24,2020 at 11:10:29 UTC from IEEE Xplore. Restrictions apply.

You might also like