1.8 FLOATING POINT ARITHMETIC AND THEIR COMPUTATION The computer performed five basic arithmetic operations such as addition, subtraction, multiplication and division.. Mostly computers
Trang 1The basic operations specified by IEEE arithmetic are first and foremost addition, subtraction, multiplication, and division Square roots and remainders are also included The default rounding
for these operationsis “to nearest even” This means that the floating point result fl (a op b) of the exact operation (a op b) is the nearest floating point number to (a op b), breaking ties by rounding
to the floating point number whose bottom bit is zero (the “even” one) It is also possible to round
up, round down, or truncate (round towards zero) Rounding up and down are useful interval arithmetic, which can provide guaranteed error bounds; unfortunately most languages and/or compilers provide no access to the status flag which can select the rounding direction When the result of floating point operation is not representable as a normalized floating point number, and exception occurs
1.8 FLOATING POINT ARITHMETIC AND THEIR COMPUTATION
The computer performed five basic arithmetic operations such as addition, subtraction, multiplication and division The decimal numbers are converted to machine numbers The machine number consists of only the digit 0 and 1 with a base It’s base depending on the computer If the base is two the number system is called the binary number system, if the base is eight it is called octal number system and if the base is sixteen it is called hexadecimal number system respectively The decimal number system has the base 10 In numerical computation, there are mainly two types of arithmetic operations present in the system
(a) Integer arithmetic, which deals with integer operands and
(b) Real or Floating-point arithmetic, which deals with fractional part of a number as operands.
Mostly computers carried out scientific calculations in floating point arithmetic to avoid the difficulty of keeping every number less than 1 in magnitude during computation A floating point
number is characterized by three parameters—the base b, the number of digit n and the exponent range (m, M).
An n-digit floating-point number with base b has the form:
1 2 (0 )n b e
x= ± d d d b where d1, d2, d3, , d n are integers and satisfies 0≤ <d, b and the exponent e is such that
≤ <
m e M Also (0, d1d2d3 d n)b is a b-fraction called the mantissa, and it lies between +1 and
–1 The number 0 is written as:
+ 0.000 0 × b e
The floating-point number is said to be normalized if d1≠0 or else d1 = d2 =
= d n = 0 If dl, d n ≠0 the number is said to have an n significant digits.
There are two commonly used ways to translate any given real number x into an n b-digit floating-point number f p (x), rounding and chopping.
A floating-point number x = ±(0, d1d2 d n ) b b e is in n-digit mantissa standard form if it
is normalized and its mantissa consists of exactly n-digit If a number x can be represented by
x = (0.d1d2d3 d n d n+1 )b b e then the floating-point number can be in chopping form
and if it can be written as f p (x) = (0.d1d2d3 d n ) n b e then the floating point number is in
rounding form If it can be written as 1 2 1
1 ( ) 0
2
f x = d d d d+ + b
used to write a floating-point number
Trang 2Example 1 Digit normalized form of 2 3
3
p
( )
p
f x = 2
3
p
f
= 0.6666666; Result after chopping
In computers, each location called word in memory stores only a finite numbers of digits
If we assume computer memory store 6 digits in each location and also store one or more signs then to represent real number, computer assumed a fixed position for the decimal point and all numbers are stored after appropriate shifting with an assumed decimal point For that, the maximum possible numbers are stored as 9999.99 and the minimum possible numbers are stored
as 0000.01 These maximum and minimum limits for numbers are in magnitude For this purpose, preserve the maximum number of significant digits in a real number and increase the range of values for that real number This type of representation is called the normalized floating-point mode
Example 2 The number 58.72 × 10 5 is represented as 0.5872 × 10 7 or 0.5872e7.
Sol Here mantissa is 0.5872 and the exponent is 7 Also shifting of the mantissa to the left
to its most significant digit, is nonzero, is called normalization
1.8.1 Arithmetic Operations on Floating Point Numbers
Basically there are four arithmetic operations such as addition, subtraction, multiplication and division These operations applied on floating point numbers as follows:
Example 3 Add the following floating-point numbers 0.4546e3 and 0.5433e7.
Sol This problem contains unequal exponent To add these floating-point numbers, take operands with the largest exponent as,
0.5433e7 + 0.0000e7 = 0.5433e7
(Because 0.4546e3 changes in the same operand as 0.0000e7).
Example 4 Add the following floating-point numbers 0.6434e3 and 0.4845e3.
Sol This problem has an equal exponent but on adding we get 1.1279e3, that is, mantissa
has 5 digits and is greater than 1, that’s why it is shifted right one place Hence we get the
resultant value 0.l127e4.
Example 5 Add the following floating-point numbers 0.6434e99 and 0.4845e99.
Sol In this example, mantissa is shifted right and exponent is increased by 1, resulting is
a value of 100 for the exponent (because sum of mantissa exceeds by 1) This condition is called
an overflow condition overflow condition overflow condition because exponent cannot store more than two digits
Example 6 Find the sum of 0.l23e3 and 0.456e2 and write the result in three digit mantissa form.
Sol Sum is = 0.123e3 + 0.456e2,
= 0 123e3 + 0.0456e3 = 0.168e3 Result after chopping
Sum is = 0.123e3 + 0.456e2 ,
= 0.123e3 + 0.0456e3 = 0.169e3 Result after rounding.
Above examples (3 to 6) shows the addition of floating point numbers in different ways
Trang 3Example 7 Subtract the floating-point number 0.36132346 × 10 7 from 0.36143447 × 10 7
Sol The number 0.36132346 × 107 after subtracting from 0.36143447×107 gives 0.00011101 × 107 On shifting the fractional part three places to the left we have 0.11101 × 104
which is obviously a floating-point number Also 0.00011101 × 107 is a floating-point number but not in the normalized form
Example 8 Subtract the following floating-point numbers:
1 0.5424e – 99 From 0.5452e – 99
2 0.3862e – 7 From 0.9682e – 7
Sol On subtracting we get 0.0028e – 99 Again this is a floating-point number but not in the
normalized form To convert it in normalized form, shift the mantissa to the left by 1 Therefore
we get 0.028e – 100 This condition is called an underflow conditionunderflow conditionunderflow condition
Similarly, after subtraction we get 0.5820e – 7.
Above examples (7 and 8) shows the subtraction of floating points numbers with underflow condition Therefore we say that, if two numbers represented in normalized floating-point notation then for addition and subtraction it is required that the exponent of the numbers must be equal,
if it is not then made be equal and shift the mantissa appropriately
Example 9 Multiply the following floating point numbers:
1 0.1111e74 and 0.2000e80
2 0.I234e – 49 and 0.1111e – 54
Sol 1 On multiplying 0.1111e74 × 0.2000e80 we have 0.2222e153 This
Shows overflow condition of normalized floating-point numbers
2 Similarly second multiplication gives 0.1370e – 104, which shows the underflow
condition of floating-point number
This example represent that two numbers are multiplied by multiplying the mantissa and
by adding the exponent of given normalized floating-point representation Similarly division is evaluated by division of mantissa of the numerator by that of the denominator and denominator exponent is subtracted from the numerator exponent The resultant exponent is obtained by adjusting it appropriately and using previous results normalizes the quotient mantissa
Example 10 Calculate the sum of given floating-point numbers:
1 0.4546e5 and 0.5433e7
2 0.4546e5 and 0.5433e5
Sol 1 When the exponent is not equal, the operand is kept with large exponent number
That is 0.5433e7 + 0.0045e7 = 0.5878e7.
2 Here mantissas are added because exponent numbers are equal That is,
0.4546e5 + 0.5433e5 = 0.9979e5.
Example 11 Subtract the floating-point number 0.5424e3 from 0.5452e3.
Sol While subtracting 0.5424e3 from 0.5452e3 we get 0.0028e3 It can also be written as
0.28el using normalized floating point representation because mantissa is greater than or equal
to 0.1
Trang 4Example 12 Calculate the value of e x when x = 0.5250e1 and e = 2.7183 The expression for e x
! !
e 1 x
2 3
Sol We have e x = e 0.5250e1 = e5 × e.25
e5 = (.2718el) × (.2718e1)× (.27I8e1)× (.27I8e1)× (.2718e1)
= 1484e3 Also, we find e.25.
Therefore e.25 = 1 + (.25) +( ) ( )2 2
= 1.25 + 03125 + 002604 = 1284e1
Hence e .5250e1 = (.1484e3) × (.1284e1) = l905e3
Example 13 Compute the middle value of the number a = 4.568 and b = 6.762 using the four-digit
arithmetic and compare the result by taking c = a + −
b a 2 .
Sol Since a = 4568el , b = 6762e1 and c be the middle value of the numbers a and b,
therefore .4568 1 6762 1 .1133 2 5665 1
If we use the formula c = a +
2
b a−
, we get c = 4568e1 +
.6762 1 4568 1 2000 1
e
−
or 4568e1 + 1097e1 = 5665e1 which is similar result as first result.
Example 14 Evaluate 1 – cos x at x = 0.1396 radian Assume cos(0.1396) = 0.9903 and compare
it when evaluated 2 sin 2 x
2 Also assumes in (0.0698) = 0.6794e – 1.
Sol Since x = 0.1396
Therefore l – cos(0.1396) = 0.1000el – 0.9903e0
= 0.1000e1 – 0.0990e1 = 0.1000e1 – 1
2
x = sin(0.0698) = 0.6974e – l
2sin2 2
x
= (0.2000e1) × (0.6974e – 1) × (0.6974e – 1) = 0.9727e – 2
The value obtained by alternate formula is close to the true value 0.9728e – 2.
Example 15 Evaluate the following floating-point numbers:
1 0.5334e9 × 0.l132e – 25
2 0.1111el0 × 0.1234e15
3 0.9998e – 5 ÷ 0.1000e98
4 0.1111e51 × 0.4444e50
5 0.1000e5 ÷ 0.9999e3
Trang 56 0.5543e12 × 0.4111e – 15
7 0.9998el + 0.l000e – 99
Sol 1 0.5334e9 × 0.l132e – 25 = 0.6038e –17, this result shows the underflow condition underflow condition underflow condition of
floating point numbers
2 0.1111e10 × 0.1234e15 = 0.1370e24
3 0.9998e – 5 ÷ 0.1000e98 = 0.9998e – 104, this result shows the underflow conditionunderflow condition
of floating point numbers
4 0.1111e51× 0.4444e50 = 0.4937e100 Hence the resultant shows an overflow condition.overflow condition
5 0.1000e5 ÷ 0.9999e3 = 0.1000e2
6 0.5543e12 × 0.411le – 15 = 0.2278e – 3
7 0.9998e1 ÷ 0.1000e – 99 = 0.9998e101, this shows an overflow conditionoverflow conditionoverflow condition of floating numbers
Example 16 For x = 0.4845 and y = 0.4800, calculate the value of −+
x y using normalized floating point arithmetic Compare this with the value of (x – y).
Sol Since x = 0.4845, y = 0.4800
Hence x + y = 0.4845e0 + 0.4800e0 or 0.9645e0.
Again,
x2 = (0.4845e0) × (0.4845e0) = 0.2347e0
y2 = (0.4800e0) × (0.4800e0) = 0.2304e0
x2 – y2 = 0.2347e0 – 0.2304e0 = 0.0043e0
Therefore,
2 2
x y
−
0.0043 0 0.9645 0
e
e = 0.4458e – 2
Also, x – y = 0.4845e0 – 0.4800e0 = 0.4500e – 2
Example 17 Find the solution of the following equation using floating-point arithmetic with 4-digit
mantissa x 2 – 1000x + 25 = 0.
Sol Given that, x2 – 1000x + 25 = 0
2
Now 106 = 0.000e7 and 102 = 0.1000e3
Therefore 106 – 102 = 0.1000e7 ⇒ 106−102 =0.1000 4e
Hence roots are: 0.1000 4 0.1000 4 and 0.1000 4 0.1000 4
which are 0.1000e4 and 0.0000e4 respectively One of the roots becomes zero due to the limited precision allowed in computation We know that in quadratic equation ax2 + bx + c, the product
of the roots is given by c
a , the smaller root may be obtained by dividing (c/a) by the largest root.
Trang 6Therefore first root is given by 0.1000e4 and second root is as
25 0.2500 2
0.2500 1
0.1000 4 0.1000 4
e
e
Example 18 Associative and distributive laws are not always valid in case of normalized
floating-point representation Give example to prove this statement.
Sol According to the consequence of the normalized floating-point representation the associative and the distributive laws of arithmetic are not always valid The example given below proves the above statement:
Let a = 0.5555e1, b = 0.4545e1, c = 0.4535e1 then
(b – c) = 0.0010e1 = 0.1000e – l a(b – c) = (0.5555e1) × (0.1000e – 1)
= (0.0555e0) = 0.5550e – 1
ab = (0.5555e1) × (0.4545e1) = 0.2524e2
ac = (0.5555e1) × (0.4535e1) = 0.2519e2
Therefore ab – ac = 0.0005e2 = 0.5000e – 1
Thus, a(b – c) ≠ab – ac
This proves the non-distributivity of arithmetic
Again let a = 0.5665e1, b = 0.5556e – 1, c = 0.5644e1
Therefore a + b = 0.5665e1 + 0.5556e – 1
= 0.5665e1 + 0.0055e1 = 0.5720e1 (a + b) – c = 0.5720e1 – 0.5644e1 = 0.0076e1 = 0.7600e –1
a – c = 0.5665e1 – 0.5644e1 = 0.0021e1 = 0.2100e –1 (a–c) + b = 0.2100e – 1 + 0.5556e – 1 = 0.7656e – 1
Thus, (a+b) – c ≠ (a – c) + b
This proves the non-associativity of arithmetic
Example 19 Calculate the smaller root of the equation x 2 – 400x + 1 = 0 using 4-digit arithmetic.
Sol Roots of the equation ax2 + bx + c = 0 are
2 1
4 2
x
a
2 2
4 2
x
a
=
Here b2 >>|4ac| and product of roots are c
a
Therefore smaller root is
2
/ 4 2
c a
a
or
2
2 4
c
b+ b − ac
a = 1 = 0.1000e1,
According to the equation b = 400 = 0.4000e3,
c = 1 = 0.1000e1
Trang 7Therefore b2 – 4ac = 0.1600e6 – 0.4000 e1 = 0.1600e6
4
b − ac = 0.4000e3
Hence smaller root is = 2 (0.1000 1) 0.2000 1 0.25 2 0.0025
0.4000 3 0.4000 3 0.8000 3
e
PROBLEM SET 1.2
1 Round off the following numbers to four significant figures:
38.46235,
0.70029,
0.0022218,
2 Round off the following numbers to two decimal places:
48.21416,
2.385,
52.275,
81.255,
3 Obtain the range of values within which the exact value of 1.265(10.21 7.54)
47
− lies, if all the numerical quantities are rounded off [Hint on taking e a < 1%] [ Ans 0.06186 <x< 0.8186]
4 Calculate the value of 102− 101correct to four significant figures [Ans 0.04963]
5 Represent 44.85 × l06 in normalized floating-point mode [Ans 0.4485e8]
6 Explain Machine Epsilon in floating point arithmetic
7 Calculate the value of x2 + 2x – 2 and (2x – 2) + x2 where x = 0.7320e0, using normalized
point arithmetic and proves that they are not the same Compare with the value of
8 Find the value of sin 3 5
x x for x = 0.2000e0 using normalized floating point
arithmetic with 4-digit mantissa [Ans 0.1987e0 (taking ea = 0.005)]
9 The following numbers are given in a decimal computer with a four digit normalized mantissa:
(a) 0.4523e – 4, (b) 0.2115e – 3, (c) 0.2583e1
Perform the following operations, and indicate the error in the result, assuming symmetric rounding:
1 (a) + (b) + (c) 2 (a) – (b) – (c) 3 (a)/(c)
[Ans 1 0.2585e1 2 0.2581e1 3 1.7511e–8
4 0.3717e–8 5 –0.1663e–3 6 0.1823e3]
Trang 810 Give example to show that most of the laws of arithmetic fail to hold for floating-point arithmetic
11 Find the root of smaller magnitude of the equation x2 + 0.4002e0x + 0.8e – 4 = 0 Work in
floating-point arithmetic using a four decimal place mantissa [Ans –0.2 e–3]
12 Give the normalized floating-point representation for the following:
4 93
3
[Ans 1 0.3143e1 2 –0.2275e2 3 1e–2
4 0.9375e1 5 0.5 e0 6 –0.4688e–1]
13 Using 5-digit arithmetic with rounding, calculate the sum of two numbers x = 0.78596e – 2
14 Compute 403000 × 0.197 by 3-digit arithmetic with rounding [Ans 0.7939e5]
15 Evaluate f x( )=1 cos− x
x for x = 0.01, using five-digit decimal arithmetic [Ans 0.1 e–1]
16 Calculate the value of the polynomial P3(x) = 2.75x3 – 2.95x2 + 3.16x – 4.67 for x = 1.07
using both chopping and rounding off to three digits, proceeding through the polynomial
GGG
Trang 9CHAPTER 2
Algebraic and Transcendental Equation
2.1 INTRODUCTION
We have seen that expression of the form
f(x) = a0x n + a1x n – 1 + + a n – 1 x + a n
where a’s are constant (a0≠ 0) and n is a positive integer, is called a polynomial in x of degree
n, and the equation f (x) = 0 is called an algebraic equation of degree n If f (x) contains some other
functions like exponential, trigonometric, logarithmic etc., then f (x) = 0 is called a transcendental
equation For example,
x3 – 3x + 6 = 0, x5 – 7x4 + 3x2 + 36x – 7 = 0 are algebraic equations of third and fifth degree, whereas x2 – 3 cos x + 1 = 0, xe x – 2 = 0,
x log10 x = 1.2 etc., are transcendental equations In both the cases, if the coefficients are pure
numbers, they are called numerical equations
In this chapter, we shall describe some numerical methods for the solution of f(x) = 0 where
f(x) is algebraic or transcendental or both.
Method for finding the root of an equation can be classified into following two parts:
(1) Direct methods
(2) Iterative methods
2.2.1 Direct Methods
In some cases, roots can be found by using direct analytical methods For example, for a quadratic
equation ax2 + bx + c = 0, the roots of the equation, obtained by
and
=
x
These are called closed form solution Similar formulae are also available for cubic and biquadratic polynomial equations but we rarely remember them For higher order polynomial equations and non-polynomial equations, it is difficult and in many cases impossible, to get
34
Trang 10closed form solutions Besides this, when numbers are substituted in available closed form solutions, rounding errors reduce their accuracy
2.2.2 Iterative Methods
These methods, also known as trial and error methods, are based on the idea of successive
approximations, i.e., starting with one or more initial approximations to the value of the root, we
obtain the sequence of approximations by repeating a fixed sequence of steps over and over again till we get the solution with reasonable accuracy These methods generally give only one root at
a time
For the human problem solver, these methods are very cumbersome and time consuming, but on other hand, more natural for use on computers, due to the following reasons:
(1) These methods can be concisely expressed as computational algorithms
(2) It is possible to formulate algorithms which can handle class of similar problems For
example, algorithms to solve polynomial equations of degree n may be written.
(3) Rounding errors are negligible as compared to methods based on closed form solutions
Convergence of an iterative method is judged by the order at which the error between successive approximations to the root decreases
The order of convergence of an iterative method is said to be kth order convergent if k is
the largest positive real number such that
1 lim i k
e A e
+
Where A, is a non-zero finite number called asymptotic error constant and it depends on derivative of f(x) at an approximate root x e i and e i + 1 are the errors in successive approximation
In other words, the error in any step is proportional to the kth power of the error in the previous step Physically, the kthorder convergence means that in each iteration, the number of
significant digits in each approximation increases k times.
This is one of the simplest iterative method and is strongly based on the property of intervals To
find a root using this method, let the function f(x) be continuous between a and b For definiteness, let f(a) be negative and f(b) be positive Then there is a root of f(x) = 0, lying between a and b Let the first approximation be x1 = 1
2 (a + b) (i.e., average of the ends of the range).
Now of f(x1) = 0 then x1 is a root of f(x) = 0 Otherwise, the root will lie between a and x1
or x1 and b depending upon whether f(x1) is positive or negative