A textbook of Computer Based Numerical and Statiscal Techniques part 5 ppsx

1.8 FLOATING POINT ARITHMETIC AND THEIR COMPUTATION The computer performed five basic arithmetic operations such as addition, subtraction, multiplication and division.. Mostly computers

Trang 1

The basic operations specified by IEEE arithmetic are first and foremost addition, subtraction, multiplication, and division Square roots and remainders are also included The default rounding

for these operationsis “to nearest even” This means that the floating point result fl (a op b) of the exact operation (a op b) is the nearest floating point number to (a op b), breaking ties by rounding

to the floating point number whose bottom bit is zero (the “even” one) It is also possible to round

up, round down, or truncate (round towards zero) Rounding up and down are useful interval arithmetic, which can provide guaranteed error bounds; unfortunately most languages and/or compilers provide no access to the status flag which can select the rounding direction When the result of floating point operation is not representable as a normalized floating point number, and exception occurs

1.8 FLOATING POINT ARITHMETIC AND THEIR COMPUTATION

The computer performed five basic arithmetic operations such as addition, subtraction, multiplication and division The decimal numbers are converted to machine numbers The machine number consists of only the digit 0 and 1 with a base It’s base depending on the computer If the base is two the number system is called the binary number system, if the base is eight it is called octal number system and if the base is sixteen it is called hexadecimal number system respectively The decimal number system has the base 10 In numerical computation, there are mainly two types of arithmetic operations present in the system

(a) Integer arithmetic, which deals with integer operands and

(b) Real or Floating-point arithmetic, which deals with fractional part of a number as operands.

Mostly computers carried out scientific calculations in floating point arithmetic to avoid the difficulty of keeping every number less than 1 in magnitude during computation A floating point

number is characterized by three parameters—the base b, the number of digit n and the exponent range (m, M).

An n-digit floating-point number with base b has the form:

1 2 (0 )n b e

x= ± d d d b where d1, d2, d3, , d n are integers and satisfies 0≤ <d, b and the exponent e is such that

≤ <

m e M Also (0, d1d2d3 d n)b is a b-fraction called the mantissa, and it lies between +1 and

–1 The number 0 is written as:

+ 0.000 0 × b e

The floating-point number is said to be normalized if d1≠0 or else d1 = d2 =

= d n = 0 If dl, d n ≠0 the number is said to have an n significant digits.

There are two commonly used ways to translate any given real number x into an n b-digit floating-point number f p (x), rounding and chopping.

A floating-point number x = ±(0, d1d2 d n ) b b e is in n-digit mantissa standard form if it

is normalized and its mantissa consists of exactly n-digit If a number x can be represented by

x = (0.d1d2d3 d n d n+1 )b b e then the floating-point number can be in chopping form

and if it can be written as f p (x) = (0.d1d2d3 d n ) n b e then the floating point number is in

rounding form If it can be written as 1 2 1

1 ( ) 0

2

f x = d d d d+ + b

used to write a floating-point number

Trang 2

Example 1 Digit normalized form of 2 3

3

p

( )

p

f x = 2

3

p

f   

 = 0.6666666; Result after chopping

In computers, each location called word in memory stores only a finite numbers of digits

If we assume computer memory store 6 digits in each location and also store one or more signs then to represent real number, computer assumed a fixed position for the decimal point and all numbers are stored after appropriate shifting with an assumed decimal point For that, the maximum possible numbers are stored as 9999.99 and the minimum possible numbers are stored

as 0000.01 These maximum and minimum limits for numbers are in magnitude For this purpose, preserve the maximum number of significant digits in a real number and increase the range of values for that real number This type of representation is called the normalized floating-point mode

Example 2 The number 58.72 × 10 5 is represented as 0.5872 × 10 7 or 0.5872e7.

Sol Here mantissa is 0.5872 and the exponent is 7 Also shifting of the mantissa to the left

to its most significant digit, is nonzero, is called normalization

1.8.1 Arithmetic Operations on Floating Point Numbers

Basically there are four arithmetic operations such as addition, subtraction, multiplication and division These operations applied on floating point numbers as follows:

Example 3 Add the following floating-point numbers 0.4546e3 and 0.5433e7.

Sol This problem contains unequal exponent To add these floating-point numbers, take operands with the largest exponent as,

0.5433e7 + 0.0000e7 = 0.5433e7

(Because 0.4546e3 changes in the same operand as 0.0000e7).

Sol This problem has an equal exponent but on adding we get 1.1279e3, that is, mantissa

has 5 digits and is greater than 1, that’s why it is shifted right one place Hence we get the

resultant value 0.l127e4.

Sol In this example, mantissa is shifted right and exponent is increased by 1, resulting is

a value of 100 for the exponent (because sum of mantissa exceeds by 1) This condition is called

an overflow condition overflow condition overflow condition because exponent cannot store more than two digits

Example 6 Find the sum of 0.l23e3 and 0.456e2 and write the result in three digit mantissa form.

Sol Sum is = 0.123e3 + 0.456e2,

= 0 123e3 + 0.0456e3 = 0.168e3 Result after chopping

Sum is = 0.123e3 + 0.456e2 ,

= 0.123e3 + 0.0456e3 = 0.169e3 Result after rounding.

Above examples (3 to 6) shows the addition of floating point numbers in different ways

Trang 3

Example 7 Subtract the floating-point number 0.36132346 × 10 7 from 0.36143447 × 10 7

Sol The number 0.36132346 × 107 after subtracting from 0.36143447×107 gives 0.00011101 × 107 On shifting the fractional part three places to the left we have 0.11101 × 104

which is obviously a floating-point number Also 0.00011101 × 107 is a floating-point number but not in the normalized form

Example 8 Subtract the following floating-point numbers:

1 0.5424e – 99 From 0.5452e – 99

2 0.3862e – 7 From 0.9682e – 7

Sol On subtracting we get 0.0028e – 99 Again this is a floating-point number but not in the

normalized form To convert it in normalized form, shift the mantissa to the left by 1 Therefore

we get 0.028e – 100 This condition is called an underflow conditionunderflow conditionunderflow condition

Similarly, after subtraction we get 0.5820e – 7.

Above examples (7 and 8) shows the subtraction of floating points numbers with underflow condition Therefore we say that, if two numbers represented in normalized floating-point notation then for addition and subtraction it is required that the exponent of the numbers must be equal,

if it is not then made be equal and shift the mantissa appropriately

Example 9 Multiply the following floating point numbers:

1 0.1111e74 and 0.2000e80

2 0.I234e – 49 and 0.1111e – 54

Sol 1 On multiplying 0.1111e74 × 0.2000e80 we have 0.2222e153 This

Shows overflow condition of normalized floating-point numbers

2 Similarly second multiplication gives 0.1370e – 104, which shows the underflow

condition of floating-point number

This example represent that two numbers are multiplied by multiplying the mantissa and

by adding the exponent of given normalized floating-point representation Similarly division is evaluated by division of mantissa of the numerator by that of the denominator and denominator exponent is subtracted from the numerator exponent The resultant exponent is obtained by adjusting it appropriately and using previous results normalizes the quotient mantissa

Example 10 Calculate the sum of given floating-point numbers:

1 0.4546e5 and 0.5433e7

2 0.4546e5 and 0.5433e5

Sol 1 When the exponent is not equal, the operand is kept with large exponent number

That is 0.5433e7 + 0.0045e7 = 0.5878e7.

2 Here mantissas are added because exponent numbers are equal That is,

0.4546e5 + 0.5433e5 = 0.9979e5.

Example 11 Subtract the floating-point number 0.5424e3 from 0.5452e3.

Sol While subtracting 0.5424e3 from 0.5452e3 we get 0.0028e3 It can also be written as

0.28el using normalized floating point representation because mantissa is greater than or equal

to 0.1

Trang 4

Example 12 Calculate the value of e x when x = 0.5250e1 and e = 2.7183 The expression for e x

! !

e 1 x

2 3

Sol We have e x = e 0.5250e1 = e5 × e.25

e5 = (.2718el) × (.2718e1)× (.27I8e1)× (.27I8e1)× (.2718e1)

= 1484e3 Also, we find e.25.

Therefore e.25 = 1 + (.25) +( ) ( )2 2

= 1.25 + 03125 + 002604 = 1284e1

Hence e .5250e1 = (.1484e3) × (.1284e1) = l905e3

Example 13 Compute the middle value of the number a = 4.568 and b = 6.762 using the four-digit

arithmetic and compare the result by taking c = a +  − 

b a 2 .

Sol Since a = 4568el , b = 6762e1 and c be the middle value of the numbers a and b,

therefore .4568 1 6762 1 .1133 2 5665 1

If we use the formula c = a +

2

b a−

 , we get c = 4568e1 +

.6762 1 4568 1 2000 1

e

−

or 4568e1 + 1097e1 = 5665e1 which is similar result as first result.

Example 14 Evaluate 1 – cos x at x = 0.1396 radian Assume cos(0.1396) = 0.9903 and compare

it when evaluated 2 sin 2 x

2 Also assumes in (0.0698) = 0.6794e – 1.

Sol Since x = 0.1396

Therefore l – cos(0.1396) = 0.1000el – 0.9903e0

= 0.1000e1 – 0.0990e1 = 0.1000e1 – 1

2

x = sin(0.0698) = 0.6974e – l

2sin2 2

x

= (0.2000e1) × (0.6974e – 1) × (0.6974e – 1) = 0.9727e – 2

The value obtained by alternate formula is close to the true value 0.9728e – 2.

Example 15 Evaluate the following floating-point numbers:

1 0.5334e9 × 0.l132e – 25

2 0.1111el0 × 0.1234e15

3 0.9998e – 5 ÷ 0.1000e98

4 0.1111e51 × 0.4444e50

5 0.1000e5 ÷ 0.9999e3

Trang 5

6 0.5543e12 × 0.4111e – 15

7 0.9998el + 0.l000e – 99

Sol 1 0.5334e9 × 0.l132e – 25 = 0.6038e –17, this result shows the underflow condition underflow condition underflow condition of

floating point numbers

2 0.1111e10 × 0.1234e15 = 0.1370e24

3 0.9998e – 5 ÷ 0.1000e98 = 0.9998e – 104, this result shows the underflow conditionunderflow condition

of floating point numbers

4 0.1111e51× 0.4444e50 = 0.4937e100 Hence the resultant shows an overflow condition.overflow condition

5 0.1000e5 ÷ 0.9999e3 = 0.1000e2

6 0.5543e12 × 0.411le – 15 = 0.2278e – 3

7 0.9998e1 ÷ 0.1000e – 99 = 0.9998e101, this shows an overflow conditionoverflow conditionoverflow condition of floating numbers

Example 16 For x = 0.4845 and y = 0.4800, calculate the value of −+

x y using normalized floating point arithmetic Compare this with the value of (x – y).

Sol Since x = 0.4845, y = 0.4800

Hence x + y = 0.4845e0 + 0.4800e0 or 0.9645e0.

Again,

x2 = (0.4845e0) × (0.4845e0) = 0.2347e0

y2 = (0.4800e0) × (0.4800e0) = 0.2304e0

x2 – y2 = 0.2347e0 – 0.2304e0 = 0.0043e0

Therefore,

2 2

x y

−

0.0043 0 0.9645 0

e

e = 0.4458e – 2

Also, x – y = 0.4845e0 – 0.4800e0 = 0.4500e – 2

Example 17 Find the solution of the following equation using floating-point arithmetic with 4-digit

mantissa x 2 – 1000x + 25 = 0.

Sol Given that, x2 – 1000x + 25 = 0

2

Now 106 = 0.000e7 and 102 = 0.1000e3

Therefore 106 – 102 = 0.1000e7 ⇒ 106−102 =0.1000 4e

Hence roots are: 0.1000 4 0.1000 4 and 0.1000 4 0.1000 4

which are 0.1000e4 and 0.0000e4 respectively One of the roots becomes zero due to the limited precision allowed in computation We know that in quadratic equation ax2 + bx + c, the product

of the roots is given by c

a , the smaller root may be obtained by dividing (c/a) by the largest root.

Trang 6

Therefore first root is given by 0.1000e4 and second root is as

25 0.2500 2

0.2500 1

0.1000 4 0.1000 4

e

Example 18 Associative and distributive laws are not always valid in case of normalized

floating-point representation Give example to prove this statement.

Sol According to the consequence of the normalized floating-point representation the associative and the distributive laws of arithmetic are not always valid The example given below proves the above statement:

Let a = 0.5555e1, b = 0.4545e1, c = 0.4535e1 then

(b – c) = 0.0010e1 = 0.1000e – l a(b – c) = (0.5555e1) × (0.1000e – 1)

= (0.0555e0) = 0.5550e – 1

ab = (0.5555e1) × (0.4545e1) = 0.2524e2

ac = (0.5555e1) × (0.4535e1) = 0.2519e2

Therefore ab – ac = 0.0005e2 = 0.5000e – 1

Thus, a(b – c) ≠ab – ac

This proves the non-distributivity of arithmetic

Again let a = 0.5665e1, b = 0.5556e – 1, c = 0.5644e1

Therefore a + b = 0.5665e1 + 0.5556e – 1

= 0.5665e1 + 0.0055e1 = 0.5720e1 (a + b) – c = 0.5720e1 – 0.5644e1 = 0.0076e1 = 0.7600e –1

a – c = 0.5665e1 – 0.5644e1 = 0.0021e1 = 0.2100e –1 (a–c) + b = 0.2100e – 1 + 0.5556e – 1 = 0.7656e – 1

Thus, (a+b) – c ≠ (a – c) + b

This proves the non-associativity of arithmetic

Example 19 Calculate the smaller root of the equation x 2 – 400x + 1 = 0 using 4-digit arithmetic.

Sol Roots of the equation ax2 + bx + c = 0 are

2 1

4 2

x

a

2 2

4 2

x

a

=

Here b2 >>|4ac| and product of roots are c

a

Therefore smaller root is

2

/ 4 2

c a

a

or

2

2 4

c

b+ b − ac

a = 1 = 0.1000e1,

According to the equation b = 400 = 0.4000e3,

c = 1 = 0.1000e1

Trang 7

Therefore b2 – 4ac = 0.1600e6 – 0.4000 e1 = 0.1600e6

4

b − ac = 0.4000e3

Hence smaller root is = 2 (0.1000 1) 0.2000 1 0.25 2 0.0025

0.4000 3 0.4000 3 0.8000 3

e

PROBLEM SET 1.2

1 Round off the following numbers to four significant figures:

38.46235,

0.70029,

0.0022218,

2 Round off the following numbers to two decimal places:

48.21416,

2.385,

52.275,

81.255,

3 Obtain the range of values within which the exact value of 1.265(10.21 7.54)

47

− lies, if all the numerical quantities are rounded off [Hint on taking e a < 1%] [ Ans 0.06186 <x< 0.8186]

4 Calculate the value of 102− 101correct to four significant figures [Ans 0.04963]

5 Represent 44.85 × l06 in normalized floating-point mode [Ans 0.4485e8]

6 Explain Machine Epsilon in floating point arithmetic

7 Calculate the value of x2 + 2x – 2 and (2x – 2) + x2 where x = 0.7320e0, using normalized

point arithmetic and proves that they are not the same Compare with the value of

8 Find the value of sin 3 5

x x for x = 0.2000e0 using normalized floating point

arithmetic with 4-digit mantissa [Ans 0.1987e0 (taking ea = 0.005)]

9 The following numbers are given in a decimal computer with a four digit normalized mantissa:

(a) 0.4523e – 4, (b) 0.2115e – 3, (c) 0.2583e1

Perform the following operations, and indicate the error in the result, assuming symmetric rounding:

1 (a) + (b) + (c) 2 (a) – (b) – (c) 3 (a)/(c)

[Ans 1 0.2585e1 2 0.2581e1 3 1.7511e–8

4 0.3717e–8 5 –0.1663e–3 6 0.1823e3]

Trang 8

10 Give example to show that most of the laws of arithmetic fail to hold for floating-point arithmetic

11 Find the root of smaller magnitude of the equation x2 + 0.4002e0x + 0.8e – 4 = 0 Work in

floating-point arithmetic using a four decimal place mantissa [Ans –0.2 e–3]

12 Give the normalized floating-point representation for the following:

4 93

3

[Ans 1 0.3143e1 2 –0.2275e2 3 1e–2

4 0.9375e1 5 0.5 e0 6 –0.4688e–1]

13 Using 5-digit arithmetic with rounding, calculate the sum of two numbers x = 0.78596e – 2

14 Compute 403000 × 0.197 by 3-digit arithmetic with rounding [Ans 0.7939e5]

15 Evaluate f x( )=1 cos− x

x for x = 0.01, using five-digit decimal arithmetic [Ans 0.1 e–1]

16 Calculate the value of the polynomial P3(x) = 2.75x3 – 2.95x2 + 3.16x – 4.67 for x = 1.07

using both chopping and rounding off to three digits, proceeding through the polynomial

GGG

Trang 9

CHAPTER 2

Algebraic and Transcendental Equation

2.1 INTRODUCTION

We have seen that expression of the form

f(x) = a0x n + a1x n – 1 + + a n – 1 x + a n

where a’s are constant (a0≠ 0) and n is a positive integer, is called a polynomial in x of degree

n, and the equation f (x) = 0 is called an algebraic equation of degree n If f (x) contains some other

functions like exponential, trigonometric, logarithmic etc., then f (x) = 0 is called a transcendental

equation For example,

x3 – 3x + 6 = 0, x5 – 7x4 + 3x2 + 36x – 7 = 0 are algebraic equations of third and fifth degree, whereas x2 – 3 cos x + 1 = 0, xe x – 2 = 0,

x log10 x = 1.2 etc., are transcendental equations In both the cases, if the coefficients are pure

numbers, they are called numerical equations

In this chapter, we shall describe some numerical methods for the solution of f(x) = 0 where

f(x) is algebraic or transcendental or both.

Method for finding the root of an equation can be classified into following two parts:

(1) Direct methods

(2) Iterative methods

2.2.1 Direct Methods

In some cases, roots can be found by using direct analytical methods For example, for a quadratic

equation ax2 + bx + c = 0, the roots of the equation, obtained by

and

=

x

These are called closed form solution Similar formulae are also available for cubic and biquadratic polynomial equations but we rarely remember them For higher order polynomial equations and non-polynomial equations, it is difficult and in many cases impossible, to get

34

Trang 10

closed form solutions Besides this, when numbers are substituted in available closed form solutions, rounding errors reduce their accuracy

2.2.2 Iterative Methods

These methods, also known as trial and error methods, are based on the idea of successive

approximations, i.e., starting with one or more initial approximations to the value of the root, we

obtain the sequence of approximations by repeating a fixed sequence of steps over and over again till we get the solution with reasonable accuracy These methods generally give only one root at

a time

For the human problem solver, these methods are very cumbersome and time consuming, but on other hand, more natural for use on computers, due to the following reasons:

(1) These methods can be concisely expressed as computational algorithms

(2) It is possible to formulate algorithms which can handle class of similar problems For

example, algorithms to solve polynomial equations of degree n may be written.

(3) Rounding errors are negligible as compared to methods based on closed form solutions

Convergence of an iterative method is judged by the order at which the error between successive approximations to the root decreases

The order of convergence of an iterative method is said to be kth order convergent if k is

the largest positive real number such that

1 lim i k

e A e

+

Where A, is a non-zero finite number called asymptotic error constant and it depends on derivative of f(x) at an approximate root x e i and e i + 1 are the errors in successive approximation

In other words, the error in any step is proportional to the kth power of the error in the previous step Physically, the kthorder convergence means that in each iteration, the number of

significant digits in each approximation increases k times.

This is one of the simplest iterative method and is strongly based on the property of intervals To

find a root using this method, let the function f(x) be continuous between a and b For definiteness, let f(a) be negative and f(b) be positive Then there is a root of f(x) = 0, lying between a and b Let the first approximation be x1 = 1

2 (a + b) (i.e., average of the ends of the range).

Now of f(x1) = 0 then x1 is a root of f(x) = 0 Otherwise, the root will lie between a and x1

or x1 and b depending upon whether f(x1) is positive or negative

Định dạng
Số trang	10
Dung lượng	138,82 KB