Recurrent Neural Networks for Prediction
Authored by Danilo P. Mandic, Jonathon A. Chambers
Copyright
c
2001 John Wiley & Sons Ltd
ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic)
1
Introduction
Artificial neural network (ANN) models have been extensively studied with the aim
of achieving human-like performance, especially in the field of pattern recognition.
These networks are composed of a number of nonlinear computational elements which
operate in parallel and are arranged in a manner reminiscent of biological neural inter-
connections. ANNs are known by many names such as connectionist models, parallel
distributed processing models and neuromorphic systems (Lippmann 1987). The ori-
gin of connectionist ideas can be traced back to the Greek philosopher, Aristotle, and
his ideas of mental associations. He proposed some of the basic concepts such as that
memory is composed of simple elements connected to each other via a number of
different mechanisms (Medler 1998).
While early work in ANNs used anthropomorphic arguments to introduce the meth-
ods and models used, today neural networks used in engineering are related to algo-
rithms and computation and do not question how brains might work (Hunt et al.
1992). For instance, recurrent neural networks have been attractive to physicists due
to their isomorphism to spin glass systems (Ermentrout 1998). The following proper-
ties of neural networks make them important in signal processing (Hunt et al. 1992):
they are nonlinear systems; they enable parallel distributed processing; they can be
implemented in VLSI technology; they provide learning, adaptation and data fusion
of both qualitative (symbolic data from artificial intelligence) and quantitative (from
engineering) data; they realise multivariable systems.
The area of neural networks is nowadays considered from two main perspectives.
The first perspective is cognitive science, which is an interdisciplinary study of the
mind. The second perspective is connectionism, which is a theory of information pro-
cessing (Medler 1998). The neural networks in this work are approached from an
engineering perspective, i.e. to make networks efficient in terms of topology, learning
algorithms, ability to approximate functions and capture dynamics of time-varying
systems. From the perspective of connection patterns, neural networks can be grouped
into two categories: feedforward networks, in which graphs have no loops, and recur-
rent networks, where loops occur because of feedback connections. Feedforward net-
works are static, that is, a given input can produce only one set of outputs, and hence
carry no memory. In contrast, recurrent network architectures enable the informa-
tion to be temporally memorised in the networks (Kung and Hwang 1998). Based
on training by example, with strong support of statistical and optimisation theories
2 SOME IMPORTANT DATES IN THE HISTORY OF CONNECTIONISM
(Cichocki and Unbehauen 1993; Zhang and Constantinides 1992), neural networks
are becoming one of the most powerful and appealing nonlinear signal processors for
a variety of signal processing applications. As such, neural networks expand signal
processing horizons (Chen 1997; Haykin 1996b), and can be considered as massively
interconnected nonlinear adaptive filters. Our emphasis will be on dynamics of recur-
rent architectures and algorithms for prediction.
1.1 Some Important Dates in the History of Connectionism
In the early 1940s the pioneers of the field, McCulloch and Pitts, studied the potential
of the interconnection of a model of a neuron. They proposed a computational model
based on a simple neuron-like element (McCulloch and Pitts 1943). Others, like Hebb
were concerned with the adaptation laws involved in neural systems. In 1949 Donald
Hebb devised a learning rule for adapting the connections within artificial neurons
(Hebb 1949). A period of early activity extends up to the 1960s with the work of
Rosenblatt (1962) and Widrow and Hoff (1960). In 1958, Rosenblatt coined the name
‘perceptron’. Based upon the perceptron (Rosenblatt 1958), he developed the theory
of statistical separability. The next major development is the new formulation of
learning rules by Widrow and Hoff in their Adaline (Widrow and Hoff 1960). In
1969, Minsky and Papert (1969) provided a rigorous analysis of the perceptron. The
work of Grossberg in 1976 was based on biological and psychological evidence. He
proposed several new architectures of nonlinear dynamical systems (Grossberg 1974)
and introduced adaptive resonance theory (ART), which is a real-time ANN that
performs supervised and unsupervised learning of categories, pattern classification and
prediction. In 1982 Hopfield pointed out that neural networks with certain symmetries
are analogues to spin glasses.
A seminal book on ANNs is by Rumelhart et al. (1986). Fukushima explored com-
petitive learning in his biologically inspired Cognitron and Neocognitron (Fukushima
1975; Widrow and Lehr 1990). In 1971 Werbos developed a backpropagation learn-
ing algorithm which he published in his doctoral thesis (Werbos 1974). Rumelhart
et al. rediscovered this technique in 1986 (Rumelhart et al. 1986). Kohonen (1982),
introduced self-organised maps for pattern recognition (Burr 1993).
1.2 The Structure of Neural Networks
In neural networks, computational models or nodes are connected through weights
that are adapted during use to improve performance. The main idea is to achieve
good performance via dense interconnection of simple computational elements. The
simplest node provides a linear combination of N weights w
1
, ,w
N
and N inputs
x
1
, ,x
N
, and passes the result through a nonlinearity Φ, as shown in Figure 1.1.
Models of neural networks are specified by the net topology, node characteristics
and training or learning rules. From the perspective of connection patterns, neural
networks can be grouped into two categories: feedforward networks, in which graphs
have no loops, and recurrent networks, where loops occur because of feedback con-
nections. Neural networks are specified by (Tsoi and Back 1997)
INTRODUCTION 3
1
1
2
N
0ii
N
2
i
0
node
+1
=(xwy
x
x
x
w
w
w
w
+w )
.
.
.
ΦΣ
Figure 1.1 Connections within a node
• Node: typically a sigmoid function;
• Layer: a set of nodes at the same hierarchical level;
• Connection: constant weights or weights as a linear dynamical system, feedfor-
ward or recurrent;
• Architecture: an arrangement of interconnected neurons;
• Mode of operation: analogue or digital.
Massively interconnected neural nets provide a greater degree of robustness or fault
tolerance than sequential machines. By robustness we mean that small perturbations
in parameters will also result in small deviations of the values of the signals from their
nominal values.
In our work, hence, the term neuron will refer to an operator which performs the
mapping
Neuron: R
N +1
→ R (1.1)
as shown in Figure 1.1. The equation
y = Φ
N
i=1
w
i
x
i
+ w
0
(1.2)
represents a mathematical description of a neuron. The input vector is given by x =
[x
1
, ,x
N
, 1]
T
, whereas w =[w
1
, ,w
N
,w
0
]
T
is referred to as the weight vector of
a neuron. The weight w
0
is the weight which corresponds to the bias input, which is
typically set to unity. The function Φ : R → (0, 1) is monotone and continuous, most
commonly of a sigmoid shape. A set of interconnected neurons is a neural network
(NN). If there are N input elements to an NN and M output elements of an NN, then
an NN defines a continuous mapping
NN: R
N
→ R
M
. (1.3)
4 PERSPECTIVE
1.3 Perspective
Before the 1920s, prediction was undertaken by simply extrapolating the time series
through a global fit procedure. The beginning of modern time series prediction was
in 1927 when Yule introduced the autoregressive model in order to predict the annual
number of sunspots. For the next half century the models considered were linear, typ-
ically driven by white noise. In the 1980s, the state-space representation and machine
learning, typically by neural networks, emerged as new potential models for prediction
of highly complex, nonlinear and nonstationary phenomena. This was the shift from
rule-based models to data-driven methods (Gershenfeld and Weigend 1993).
Time series prediction has traditionally been performed by the use of linear para-
metric autoregressive (AR), moving-average (MA) or autoregressive moving-average
(ARMA) models (Box and Jenkins 1976; Ljung and Soderstrom 1983; Makhoul 1975),
the parameters of which are estimated either in a block or a sequential manner with
the least mean square (LMS) or recursive least-squares (RLS) algorithms (Haykin
1994). An obvious problem is that these processors are linear and are not able to
cope with certain nonstationary signals, and signals whose mathematical model is
not linear. On the other hand, neural networks are powerful when applied to prob-
lems whose solutions require knowledge which is difficult to specify, but for which
there is an abundance of examples (Dillon and Manikopoulos 1991; Gent and Shep-
pard 1992; Townshend 1991). As time series prediction is conventionally performed
entirely by inference of future behaviour from examples of past behaviour, it is a suit-
able application for a neural network predictor. The neural network approach to time
series prediction is non-parametric in the sense that it does not need to know any
information regarding the process that generates the signal. For instance, the order
and parameters of an AR or ARMA process are not needed in order to carry out the
prediction. This task is carried out by a process of learning from examples presented
to the network and changing network weights in response to the output error.
Li (1992) has shown that the recurrent neural network (RNN) with a sufficiently
large number of neurons is a realisation of the nonlinear ARMA (NARMA) process.
RNNs performing NARMA prediction have traditionally been trained by the real-
time recurrent learning (RTRL) algorithm (Williams and Zipser 1989a) which pro-
vides the training process of the RNN ‘on the run’. However, for a complex physical
process, some difficulties encountered by RNNs such as the high degree of approxi-
mation involved in the RTRL algorithm for a high-order MA part of the underlying
NARMA process, high computational complexity of O(N
4
), with N being the number
of neurons in the RNN, insufficient degree of nonlinearity involved, and relatively low
robustness, induced a search for some other, more suitable schemes for RNN-based
predictors.
In addition, in time series prediction of nonlinear and nonstationary signals, there
is a need to learn long-time temporal dependencies. This is rather difficult with con-
ventional RNNs because of the problem of vanishing gradient (Bengio et al. 1994).
A solution to that problem might be NARMA models and nonlinear autoregressive
moving average models with exogenous inputs (NARMAX) (Siegelmann et al. 1997)
realised by recurrent neural networks. However, the quality of performance is highly
dependent on the order of the AR and MA parts in the NARMAX model.
INTRODUCTION 5
The main reasons for using neural networks for prediction rather than classical time
series analysis are (Wu 1995)
• they are computationally at least as fast, if not faster, than most available
statistical techniques;
• they are self-monitoring (i.e. they learn how to make accurate predictions);
• they are as accurate if not more accurate than most of the available statistical
techniques;
• they provide iterative forecasts;
• they are able to cope with nonlinearity and nonstationarity of input processes;
• they offer both parametric and nonparametric prediction.
1.4 Neural Networks for Prediction: Perspective
Many signals are generated from an inherently nonlinear physical mechanism and have
statistically non-stationary properties, a classic example of which is speech. Linear
structure adaptive filters are suitable for the nonstationary characteristics of such
signals, but they do not account for nonlinearity and associated higher-order statistics
(Shynk 1989). Adaptive techniques which recognise the nonlinear nature of the signal
should therefore outperform traditional linear adaptive filtering techniques (Haykin
1996a; Kay 1993). The classic approach to time series prediction is to undertake an
analysis of the time series data, which includes modelling, identification of the model
and model parameter estimation phases (Makhoul 1975). The design may be iterated
by measuring the closeness of the model to the real data. This can be a long process,
often involving the derivation, implementation and refinement of a number of models
before one with appropriate characteristics is found.
In particular, the most difficult systems to predict are
• those with non-stationary dynamics, where the underlying behaviour varies with
time, a typical example of which is speech production;
• those which deal with physical data which are subject to noise and experimen-
tation error, such as biomedical signals;
• those which deal with short time series, providing few data points on which to
conduct the analysis, such as heart rate signals, chaotic signals and meteorolog-
ical signals.
In all these situations, traditional techniques are severely limited and alternative
techniques must be found (Bengio 1995; Haykin and Li 1995; Li and Haykin 1993;
Niranjan and Kadirkamanathan 1991).
On the other hand, neural networks are powerful when applied to problems whose
solutions require knowledge which is difficult to specify, but for which there is an
abundance of examples (Dillon and Manikopoulos 1991; Gent and Sheppard 1992;
Townshend 1991). From a system theoretic point of view, neural networks can be
considered as a conveniently parametrised class of nonlinear maps (Narendra 1996).
6 STRUCTURE OF THE BOOK
There has been a recent resurgence in the field of ANNs caused by new net topolo-
gies, VLSI computational algorithms and the introduction of massive parallelism into
neural networks. As such, they are both universal function approximators (Cybenko
1989; Hornik et al. 1989) and arbitrary pattern classifiers. From the Weierstrass The-
orem, it is known that polynomials, and many other approximation schemes, can
approximate arbitrarily well a continuous function. Kolmogorov’s theorem (a neg-
ative solution of Hilbert’s 13th problem (Lorentz 1976)) states that any continuous
function can be approximated using only linear summations and nonlinear but contin-
uously increasing functions of only one variable. This makes neural networks suitable
for universal approximation, and hence prediction. Although sometimes computation-
ally demanding (Williams and Zipser 1995), neural networks have found their place
in the area of nonlinear autoregressive moving average (NARMA) (Bailer-Jones et
al. 1998; Connor et al. 1992; Lin et al. 1996) prediction applications. Comprehensive
survey papers on the use and role of ANNs can be found in Widrow and Lehr (1990),
Lippmann (1987), Medler (1998), Ermentrout (1998), Hunt et al. (1992) and Billings
(1980).
Only recently, neural networks have been considered for prediction. A recent compe-
tition by the Santa Fe Institute for Studies in the Science of Complexity (1991–1993)
(Weigend and Gershenfeld 1994) showed that neural networks can outperform conven-
tional linear predictors in a number of applications (Waibel et al. 1989). In journals,
there has been an ever increasing interest in applying neural networks. A most com-
prehensive issue on recurrent neural networks is the issue of the IEEE Transactions of
Neural Networks, vol. 5, no. 2, March 1994. In the signal processing community, there
has been a recent special issue ‘Neural Networks for Signal Processing’ of the IEEE
Transactions on Signal Processing, vol. 45, no. 11, November 1997, and also the issue
‘Intelligent Signal Processing’ of the Proceedings of IEEE, vol. 86, no. 11, November
1998, both dedicated to the use of neural networks in signal processing applications.
Figure 1.2 shows the frequency of the appearance of articles on recurrent neural net-
works in common citation index databases. Figure 1.2(a) shows number of journal and
conference articles on recurrent neural networks in IEE/IEEE publications between
1988 and 1999. The data were gathered using the IEL Online service, and these publi-
cations are mainly periodicals and conferences in electronics engineering. Figure 1.2(b)
shows the frequency of appearance for BIDS/ATHENS database, between 1988 and
2000,
1
which also includes non-engineering publications. From Figure 1.2, there is a
clear growing trend in the frequency of appearance of articles on recurrent neural
networks. Therefore, we felt that there was a need for a research monograph that
would cover a part of the area with up to date ideas and results.
1.5 Structure of the Book
The book is divided into 12 chapters and 10 appendices. An introduction to connec-
tionism and the notion of neural networks for prediction is included in Chapter 1. The
fundamentals of adaptive signal processing and learning theory are detailed in Chap-
ter 2. An initial overview of network architectures for prediction is given in Chapter 3.
1
At the time of writing, only the months up to September 2000 were covered.
INTRODUCTION 7
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
0
20
40
60
80
100
120
140
Number of journal and conference papers on Recurrent Neural Networks via IEL
Year
Number
(a) Appearance of articles on Recurrent Neural Networks in
IEE/IEEE publications in period 1988–1999
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
0
10
20
30
40
50
60
70
Number of journal and conference papers on Recurrent Neural Networks via BIDS
Year
Number
(b)
(b) Appearance of articles on Recurrent Neural Networks in
BIDS database in period 1988–2000
Figure 1.2 Appearance of articles on RNNs in major citation databases. (a) Appearance
of articles on recurrent neural networks in IEE/IEEE publications in period 1988–1999. (b)
App earance of articles on recurrent neural networks in BIDS database in period 1988–2000.
8 READERSHIP
Chapter 4 contains a detailed discussion of activation functions and new insights are
provided by the consideration of neural networks within the framework of modu-
lar groups from number theory. The material in Chapter 5 builds upon that within
Chapter 3 and provides more comprehensive coverage of recurrent neural network
architectures together with concepts from nonlinear system modelling. In Chapter 6,
neural networks are considered as nonlinear adaptive filters whereby the necessary
learning strategies for recurrent neural networks are developed. The stability issues
for certain recurrent neural network architectures are considered in Chapter 7 through
the exploitation of fixed point theory and bounds for global asymptotic stability are
derived. A posteriori adaptive learning algorithms are introduced in Chapter 8 and
the synergy with data-reusing algorithms is highlighted. In Chapter 9, a new class
of normalised algorithms for online training of recurrent neural networks is derived.
The convergence of online learning algorithms for neural networks is addressed in
Chapter 10. Experimental results for the prediction of nonlinear and nonstationary
signals with recurrent neural networks are presented in Chapter 11. In Chapter 12,
the exploitation of inherent relationships between parameters within recurrent neural
networks is described. Appendices A to J provide background to the main chapters
and cover key concepts from linear algebra, approximation theory, complex sigmoid
activation functions, a precedent learning algorithm for recurrent neural networks, ter-
minology in neural networks, a posteriori techniques in science and engineering, con-
traction mapping theory, linear relaxation and stability, stability of general nonlinear
systems and deseasonalising of time series. The book concludes with a comprehensive
bibliography.
1.6 Readership
This book is targeted at graduate students and research engineers active in the areas
of communications, neural networks, nonlinear control, signal processing and time
series analysis. It will also be useful for engineers and scientists working in diverse
application areas, such as artificial intelligence, biomedicine, earth sciences, finance
and physics.
. optimisation theories
2 SOME IMPORTANT DATES IN THE HISTORY OF CONNECTIONISM
(Cichocki and Unbehauen 1993; Zhang and Constantinides 1992), neural networks
are. the perceptron. The
work of Grossberg in 1976 was based on biological and psychological evidence. He
proposed several new architectures of nonlinear dynamical