Tài liệu Thuật toán Algorithms (Phần 31) docx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	10
Dung lượng	49,9 KB

Nội dung

FlLE COMPRESSION 293 the message: this means that we need to save the tree along with the message in order to decode it. Fortunately, this does not present any real difficulty. It is actually necessary only to store the code array, because the radix search trie which results from inserting the entries from that array into an initially empty tree is the decoding tree. Thus, the storage savings quoted above is not entirely accurate, because the message can’t be decoded without the trie and we must take into account the cost of storing the trie (i.e., the code array) along with the message. Huffman encoding is therefore only effective for long files where the savings in the message is enough to offset the cost, or in situations where the coding trie can be precomputed and used for a large number of messages. For example, a trie based on the frequencies of occurrence of letters in the English language could be used for text documents. For that matter, a trie based on the frequency of occurrence of characters in Pascal programs could be used for encoding programs (for example, “;” is likely to be near the top of such a trie). A Huffman encoding algorithm saves about 23% when run on the text for this chapter. As before, for truly random files, even this clever encoding scheme won’t work because each character will occur approximately the same number of times, which will lead to a fully balanced coding tree and an equal number of bits per letter in the code. I 294 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Exercises Implement compression and expansion procedures for the run-length encoding method for a fixed alphabet described in the text, using Q as the escape character. Could “QQ” occur somewhere in a file compressed using the method described in the text? Could “QQ&” occur? Implement compression and expansion procedures for the binary file encoding method described in the text. The letter “q” given in the text can be processed as a sequence of five- bit characters. Discuss the pros and cons of doing so in order to use a character-based run-length encoding method. Draw a Huffman coding tree for the string “ABRACADABRA.” How many bits does the encoded message require? What is the Huffman code for a binary file? Give an example showing the maximum number of bits that could be used in a Huffman code for a N-character ternary (three-valued) file. Suppose that the frequencies of the occurrence of all the characters to be encoded are different. Is the Huffman encoding tree unique? Huffman coding could be extended in a straightforward way t,o encode in two-bit characters (using 4-way trees). What would be the main advantage and the main disadvantage of doing so? What would be the result of breaking up a Huffman-encoded string into five-bit characters and Huffman encoding that string? Implement a procedure to decode a Huffman-encoded string, given the code and len arrays. 23. Cryptology In the previous chapter we looked at methods for encoding strings of characters to save space. Of course, there is another very important reason to encode strings of characters: to keep them secret. Cryptology, the study of systems for secret communications, consists of two competing fields of study: cryptography, the design of secret communications systems, and cryptanalysis, the study of ways to compromise secret communications systems. The main application of cryptology has been in military and diplomatic communications systems, but other significant applications are becoming apparent. Two principal examples are computer file systems (where each user would prefer to keep his files private) and “electronic funds transfer” systems (where very large amounts of money are involved). A computer user wants to keep his computer files just as private as papers in his file cabinet, and a bank wants electronic funds transfer to be just as secure as funds transfer by armored car. Except for military applications, we assume that cryptographers are “good guys” and cryptanalysts are “bad guys”: our goal is to protect our computer files and our bank accounts from criminals. If this point of view seems somewhat unfriendly, it must be noted (without being over-philosophical) that by using cryptography one is assuming the existence of unfriendliness! Of course, even “good guys” must know something about cryptanalysis, since the very best way to be sure that a system is secure is to try to compromise it yourself. (Also, there are several documented instances of wars being brought to an end, and many lives saved, through successes in cryptanalysis.) Cryptology has many close connections with computer science and algorithms, especially the arithmetic and string-processing algorithms that we have studied. Indeed, the art (science?) of cryptology has an intimate relationship with computers and computer science that is only beginning to be fully understood. Like algorithms, cryptosystems have been around far longer 295 296 CHAPTER 23 than computers. Secrecy system design and algorithm design have a common heritage, and the same people are attracted to both. It is not entirely clear which branch of cryptology has been affected most by the availability of computers. The cryptographer now has available a much more powerful encryption machine than before, but this also gives him more room to make a mistake. The cryptanalyst has much more powerful tools for breaking codes, but the codes to be broken are more complicated than ever before. Cryptanalysis can place an incredible strain on computational resources; not only was it among the first applications areas for computers, but it still remains a principal applications area for modern supercomputers. More recently, the widespread use of computers has led to the emergence of a variety of important new applications for cryptology, as mentioned above. New cryptographic methods have recently been developed appropriate for such applications, and these have led to the discovery of a fundamental relationship between cryptology and an important area of theoretical computer science that we’ll examine briefly in Chapter 40. In this chapter, we’ll examine some of the basic characteristics of cryptographic algorithms because of the importance of cryptography in modern computer systems and because of close relationships with many of the algorithms we have studied. We’ll refrain from delving into detailed implementa- tions: cryptography is certainly a field that should be left to experts. While it’s not difficult to “keep people honest” by encrypting things with a simple cryptographic algorithm, it is dangerous to rely upon a method implemented by a non-expert. Rules of the Game All the elements that go into providing a means for secure communications between two individuals together are called a cryptosystem. The canonical structure of a typical cryptosystem is diagramed below: “attack at dawn” The sender (S) wishes to send a message (called the plaintezt) to the receiver (R). To do so, he transforms the plaintext into a secret form suitable CRYPTOLOGY for transmission (called the ciphertext) using a cryptographic algorithm (the encryption method) and some key (K) parameters. To read the message, the receiver must have a matching cryptographic algorithm (the decryption method) and the same key parameters, which he can use to transform the ciphertext back into the plaintext, the message. It is usually assumed that the ciphertext is sent over insecure communications lines and is available to the cryptanalyst (A). It also is usually assumed that the encryption and decryption methods are known to the cryptanalyst: his aim is to recover the plaintext from the ciphertext without knowing the key parameters. Note that the whole system depends on some separate prior method of communication between the sender and receiver to agree on the key parameters. As a rule, the more key parameters, the more secure the cryptosystem is but the more inconvenient it is to use. This situation is akin to that for more conventional security systems: a combination safe is more secure with more numbers on the combination lock, but it is harder to remember the combination. The parallel with conventional systems also serves as a reminder that any security system is only as secure as the trustworthiness of the people that have the key. It is important to remember that economic questions play a central role in cryptosystems. There is an economic motivation to build simple encryption and decryption devices (since many may need to be provided and complicated devices cost more). Also, there is an economic motivation to reduce the amount of key information that must be distributed (since a very secure and expensive method of communications must be used). Balanced against the cost of implementing cryptographic algorithms and distributing key information is the amount of money the cryptanalyst would be willing to pay to break the system. For most applications, it is the cryptographer’s aim to develop a low-cost system with the property that it would cost the cryptanalyst much more to read messages than he would be willing to pay. For a few applications, a “provably secure” cryptosystem may be required: one for which it can be ensured that the cryptanalyst can never read messages no matter what he is willing to spend. (The very high stakes in some applications of cryptology naturally imply that very large amounts of money are used for cryptanalysis.) In algorithm design, we try to keep track of costs to help us choose the best algorithms; in cryptology, costs play a central role in the design process. Simple Methods Among the simplest (and among the oldest) methods for encryption is the Caesar cipher: if a letter in the plaintext is the Nth letter in the alphabet, replace it by the (N + K)th letter in the alphabet, where K is some fixed integer (Caesar used K = 3). For example, the table below shows how a message is encrypted using this method with K = 1: 298 CHAPTER 23 Plaintext: ATTACK AT DAWN Ciphertext: BUUBDLABUAEB X 0 This method is weak because the cryptanalyst has only to guess the value of K: by trying each of the 26 choices, he can be sure that he will read the message. A far better method is to use a general table to define the substitution to be made: for each letter in the plaintext, the table tells which letter to put in the ciphertext. For example, if the table gives the correspondence ABCDEFGHI JKLMNOPQRSTUVWXYZ THE QUICKBROWNFXJMPDVRLAZYG then the message is encrypted as follows: Plaintext: ATTACK AT DAWN Ciphertext: HWH OTHVTQHAF This is much more powerful than the simple Caesar cipher because the cryptanalyst would have to try many more (about 27! > 1028) tables to be sure of reading the message. However, “simple substitution” ciphers like this are easy to break because of letter frequencies inherent in the language. For example, since E is the most frequent letter in English text, the cryptanalyst could get good start on reading the message by looking for the most frequent letter in the ciphertext and assuming that it is to be replaced by E. While this might not be the right choice, he certainly is better off than if he had to try all 26 letters. He can do even better by looking at two-letter combinations (“digrams”): certain digrams (such as QJ) never occur in English text while others (such as ER) are very common. By examining frequencies of letters and combinations of letters, a cryptanalyst can very easily break a simple substitution cipher. One way to make this type of attack more difficult is to use more than one table. A simple example of this is an extension of the Caesar cipher called the Vigenere cipher: a small repeated key is used to determine the value of K for each letter. At each step, the key letter index is added to the plaintext letter index to determine the ciphertext letter index. Our sample plaintext, with the key ABC, is encrypted as follows: Key: ABCABCABCABCAB Plaintext: ATTACK AT DAWN Ciphertext: BVWBENACWAFDX P CRYPTOLOGY 299 For example, the last letter of the ciphertext is P, the 16th letter of the alphabet, because the corresponding plaintext letter is N (the 14th letter), and the corresponding key letter is B (the 2nd letter). The Vigenere cipher can obviously be made more complicated by using different general tables for each letter of the plaintext (rather than simple offsets). Also, it is obvious that the longer the key, the better. In fact, if the key is as long as the plaintext, we have the V&am cipher, more commonly called the one-time pad. This is the only provably secure cryptosystem known, and it is reportedly used for the Washington-Moscow hotline and other vital applications. Since each key letter is used only once, the cryptanalyst can do no better than try every possible key letter for every message position, an obviously hopeless situation since this is as difficult as trying all possible messages. However, using each key letter only once obviously leads to a severe key distribution problem, and the one-time pad is only useful for relatively short messages which are to be sent infrequently. If the message and key are encoded in binary, a more common scheme for position-by-position encryption is to use the “exclusive-or” function: to encrypt the plaintext, “exclusive-or” it (bit by bit) with the key. An attractive feature of this method is that decryption is the same operation as encryption: the ciphertext is the exclusive-or of the plaintext and the key, but doing another exclusive-or of the ciphertext and the key returns the plaintext. Notice that the exclusive-or of the ciphertext and the plaintext is the key. This seems surprising at first, but actually many cryptographic systems have the property that the cryptanalyst can discover the key if he knows the plaintext. Encryption/Decryption Machines Many cryptographic applications (for example, voice systems for military communications) involve the transmission of large amounts of data, and this makes the one-time pad infeasible. What is needed is an approximation to the one-time pad in which a large amount of “pseudo-key” can be generated from a small amount of true key to be distributed. The usual setup in such situations is as follows: an encryption machine is fed some cryptovariables (true key) by the sender, which it uses to generate a long stream of key bits (pseudo-key). The exclusive-or of these bits and the plaintext forms the ciphertext. The receiver, having a similar machine and the same cryptovariables, uses them to generate the same key stream to exclusive-or against the ciphertext and to retrieve the plaintext. Key generation in this context is obviously very much like random number generation, and our random number generation methods are appropriate for key generation (the cryptovariables are the initial seeds of the random number 300 CHAPTER 23 generator). In fact, the linear feedback shift registers that we discussed in Chapter 3 were first developed for use in encryption/decryption machines such as described here. However, key generators have to be somewhat more complicated than random number generators, because there are easy ways to attack simple linear feedback shift registers. The problem is that it might be easy for the cryptanalyst to get some plaintext (for example, silence in a voice system), and therefore some key. If the cryptanalyst can get enough key that he has the entire contents of the shift register, then he can get all the key from that point on. Cryptographers have several ways to avoid such problems. One way is to make the feedback function itself a cryptovariable. It is usually assumed that the cryptanalyst knows everything about the structure of the machine (maybe he stole one) except the cryptovariables, but if some of the cryptovariables are used to “configure” the machine, he may have difficulty finding their values. Another method commonly used to confuse the cryptanalyst is the product cipher, where two different machines are combined to produce a complicated key stream (or to drive each other). Another method is nonlinear substitution; here the translation between plaintext and ciphertext is done in large chunks, not bit-by-bit. The general problem with such complex methods is that they can be too complicated for even the cryptographer to understand and that there always is the possibility that things may degenerate badly for some choices of the cryptovariables. Public-Key Cryptosystems In commercial applications such as electronic funds transfer and (real) computer mail, the key distribution problem is even more onerous than in the traditional applications of cryptography. The prospect of providing long keys (which must be changed often) to every citizen, while still maintain- ing both security and cost-effectiveness, certainly inhibits the development of such systems. Methods have recently been developed, however, which promise to eliminate the key distribution problem completely. Such systems, called public-key cryptosystems, are likely to come into widespread use in the near future. One of the most prominent of these systems is based on some of the arithmetic algorithms that we have been studying, so we will take a close look at how it works. The idea in public-key cryptosystems is to use a “phone book” of encryption keys. Everyone’s encryption key (denoted by P) is public knowledge: a person’s key could be listed, for example, next to his number in the telephone book. Everyone also has a secret key used for decryption; this secret key (denoted by S) is not known to anyone else. To transmit a message M, the sender looks up the receiver’s public key, uses it to encrypt the message, and then transmits the message. We’ll denote the encrypted message (ciphertext) CRYPTOLOGY by C=P(M). The receiver uses his private decryption key to decrypt and read the message. For this system to work we must have at least the following properties: (4 S(P(M))=M for every message M. (ii) All (S,P) pairs are distinct. (iii) Deriving S from P is as hard as reading M. (iv) Both S and P are easy to compute. The first of these is a fundamental cryptographic property, the second two provide the security, and the fourth makes the system feasible for use. This general scheme was outlined by W. Diffie and M. Hellman in 1976, but they had no method which satisfied all of these properties. Such a method was discovered soon afterwards by R. Rivest, A. Shamir, and L. Adleman. Their scheme, which has come to be known as the RSA public- key cryptosystem, is based on arithmetic algorithms performed on very large integers. The encryption key P is the integer pair (N,p) and the decryption key S is the integer pair (N,s), where s is kept secret. These numbers are intended to be very large (typically, N might be 200 digits and p and s might be 100 digits). The encryption and decryption methods are then simple: first the message is broken up into numbers less than N (for example, by taking lg N bits at a time from the binary string corresponding to the character encoding of the message). Then these numbers are independently raised to a power modulo N: to encrypt a (piece of a) message M, compute C = P(M) = MPmod N, and to decrypt a ciphertext C, compute M = S(C) = C”mod N. This computation can be quickly and easily performed by modifying the elementary exponentiation algorithm that we studied in Chapter 4 to take the remainder when divided by N after each multiplication. (No more than 2 log N such operations are required for each piece of the message, so the tot,al number of operations (on 100 digit numbers!) required is linear in the number of bits in the message.) Property (iv) above is therefore satisfied, and property (ii) can be easily enforced. We still must make sure that the cryptovariables N, p, and s can be chosen so as to satisfy properties (i) and (iii). To be convinced of these requires an exposition of number theory which is beyond the scope of this book, but we can outline the main ideas. First, it is necessary to generate three large (approximately loo-digit) “random” prime numbers: the largest will be s and we’ll call the other two x and y. Then N is chosen to be the product of x and y, and p is chosen so that ps mod (x - l)(y - 1) = 1. It is possible to prove that, with N, p, and s chosen in this way, we have Mps mod N = M for all messages M. More specifically, each large prime can be generated by generating a large random number, then testing successive numbers starting at that point until 302 CHAPTER 23 a prime is found. One simple method performs a calculation on a random number that, with probability l/2, will “prove” that the number to be tested is not prime. (A number which is not prime will survive 20 applications of this test less than one time out of a million, 30 applications less than 1 time out of a billion.) The last step is to compute p: it turns out that a variant of Euclid’s algorithm (see Chapter 1) is just what is needed. Furthermore, s seems to be difficult to compute from knowledge of p (and N), though no one has been able to prove that to be the case. Apparently, finding p from s requires knowledge of x and y, and apparently it is necessary to factor N to calculate x and y. But factoring N is thought to be very difficult: the best factoring algorithms known would take millions of years to factor a 200-digit number, using current technology. An attractive feature of the RSA system is that the complicated com- putations involving N, p, and s are performed only once for each user who subscribes to the system, which the much more frequent operations of encryption and decryption involve only breaking up the message and applying the simple exponentiation procedure. This computational simplicity, combined with all the convenience features provided by public-key cryptosystems, make this system quite attractive for secure communications, especially on computer systems and networks. The RSA method has its drawbacks: the exponentiation procedure is actually expensive by cryptographic standards, and, worse, there is the linger- ing possibility that it might be possible to read messages encrypted using the method. This is true with many cryptosystems: a cryptographic method must withstand serious cryptanalytic attacks before it can be used with confidence. Several other methods have been suggested for implementing public-key cryptosystems. Some of the most interesting are linked to an important class of problems which are generally thought to be very hard (though this is not known for sure), which we’ll discuss in Chapter 40. These cryptosystems have the interesting property that a successful attack could provide insight on how to solve some well-known difficult unsolved problems (as with factoring for the RSA method). This link between cryptology and fundamental topics in computer science research, along with the potential for widespread use of public-key cryptography, have made this an active area of current research. rl . computer science and algorithms, especially the arithmetic and string-processing algorithms that we have studied. Indeed, the art (science?) of cryptology has. computers and computer science that is only beginning to be fully understood. Like algorithms, cryptosystems have been around far longer 295 296 CHAPTER 23 than

Ngày đăng: 26/01/2014, 14:20

Xem thêm