Extreme Learning Machine (ELM)

CHAPTER III. TRAINING ALGORITMS FOR SINGLE HIDDEN

3.2 Extreme Learning Machine (ELM)

Unlike popular implements for the SLFNs, the basic concept of ELM algorithm is that the input weights and hidden layer biases are randomly assigned, and the output weights of the SLFN can be analytically determined by simple inverse operation of the hidden layer output matrices. Clearly that an SLFN with ẹ hidden units can approximate ẹ patterns (N= ẹ) with zero error. This means there exist w, α, and b

such that

tji=oji=hjãαi. (3.3) This equation could be written as

HA=T, (3.4)

where H is called the hidden layer output matrix of the SLFNs and defined as [37, 50]

1 1 1 1

1 1

( ) (

T N

N N

N N N N

f b f b

f b f

⎡ ⎤

⎢ ⎥

= ⎢ ⎥

⎢ ⎥

⎣ ⎦

⋅ + … ⋅ +

⎡ ⎤

⎢ ⎥

= ⎢ ⎥

⎢ ⋅ + … ⋅ + ⎥

⎣ ⎦

h H

w x w x

% %

O M

)

(3.5)

T=[t1, t2, …, tN]T (3.6)

and

A=[ α1, α2, …, αc] (3.7)

There are significant results proven by Huang et al. [10] as follows:

Lemma 3.1 ([10]) Given a standard SLFN with ẹ hidden units and activation

function f in R which is infinitely differentiable in any interval, for ẹ distinct patterns {(xj, tj)| xj∈Rp, tj∈Rc, j=1,…, ẹ }, for any wm and bm randomly assigned from any intervals of Rp and R, respectively, according to any continuous probability distribution, then with probability one, the hidden layer output matrix H of SLFN is invertible (full rank) and ||HA-T||=0.

Lemma 3.2 ([10]) Given any small positive value ε>0 and activation function f in R

which is infinitely differentiable in any interval, there exist ẹ ≤N such that for N

arbitrary distinct patterns {(xj, tj)| xj∈Rp, tj∈Rc, j=1,…, N }, for any wm and bm

randomly assigned from any intervals of Rp and R, respectively, according to any continuous probability distribution, then with probability one, ||HA-T||<ε.

These two lemmas showed that the input weights and hidden layer biases are not necessarily tuned. With random choice of them and the activation function f, which is infinitely differentiable in any interval of R, the number of hidden units required for SLFNs to learn N arbitrary distinct patterns with arbitrary small error is

ẹ ≤N. For fixed input weights and biases, from equation (3.4) we see that training

SLFNs is simply equivalent to the estimation of A from the linear system HA=T.

There are many methods for resolving this problem, which were reviewed in chapter 2 (section 2.1). They are dependent on the optimal criterion, assumption about the distribution of error as well as the distribution of output weights A. However, in ELM algorithm, Huang et al. [10] proposed a minimum norm least-squares solution

Â for the linear system (3.4). This means

ˆ − =min −

HA T A HA T , (3.8)

and a solution Â is said to be minimum norm least-squares solution of a linear system if it has the smallest norm among all the least-squares solutions.

If the number of hidden units ẹ equals the number of N distinct training patterns, the hidden layer output matrix H is invertible according to the lemma 3.1.

Therefore, the solution of (3.4) is given by inverse of H multiply by T. However, in most applications the number of hidden units is much less than the number of distinct training patterns. In order to find a minimum norm least-squares solution we first review the following results:

Definition 3.1 Moore-Penrose generalized inverse ([51, 52]). A matrix G is the

Moore-Penrose generalized inverse of matrix H if

HGH=H, GHG=G, (HG)T=HG, (GH)T=GH.

The Moore-Penrose (MP) generalized inverse of matrix H is denote as H†.

Lemma 3.3 ([51, 52]). Assuming there exist a matrix G such that GT is a minimum

norm least-squares solution of a linear system HA=T. Then it is necessary and sufficient that G=H†, the MP generalized inverse of matrix H.

Thus, the smallest norm least-squares solution of (3.4) is given by

Â=H†T. (3.9)

Note that we already know the estimation of parameters in linear system by using least squares solution shown in chapter 2 (section 2.1c), in which the estimation is given by equation (2.17). From (2.17) and (3.9) we have H†=(HTH)-1HT, this is an orthogonal projection method and it is also used in [53]. However, HTH is not always

nonsingular and it may tend to be singular in some applications. Hence, it may not perform well in all applications. There are other methods for calculating MP generalized inverse such as orthogonal methods, iterative methods, and singular value decomposition (SVD), in which SVD can be used to calculate the MP generalized inverse in all cases.

We can summarize the extreme learning machine (ELM) algorithm as follows:

Algorithm ELM: Given a training set S={(xj,tj) | j=1,…,N}, activation function f(x), and number of hidden node ẹ.

- Randomly assigning the input weights w and biases b , m=1, 2, …, ẹ.

- Determining the output matrix H of the hidden layer by equation 3.5.

- Determining the output weight matrix A by equation 3.9.

We see that this algorithm can reduce the learning time by avoiding iteration for tuning the input weights and hidden-layer biases. In addition, it also can obtain good generalization performance in many applications. When the whole training set is not available, a development of ELM called online sequential extreme learning machine (OS-ELM) was proposed by N.Y. Liang et al. [54]. It is an online sequential learning algorithm for SLFNs based on the ELM and can learn one-by-one or block-by-block of data. In OS-ELM, the input weights and hidden layer biases are also randomly chosen and the output weights can be updated by the arriving data.

Because of random selection, the input weights and hidden layer biases in both ELM and OS-ELM algorithms might be non-optimal which tends to require more hidden units than conventional tuning-based algorithms in many applications. An approach called evolutionary extreme learning machine (E-ELM) was proposed by Q.-Y. Zhu et al. to overcome this problem [55].

Approximation Capabilities of Feedforward networks and SLFNs

Review of Hematocrit and Previous Measurement Methods