1. Trang chủ
  2. » Công Nghệ Thông Tin

professional perl programming wrox 2001 phần 10 doc

120 212 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Unicode 1053 The text outside the quotes has an embedding level of 0, whereas the text within the quotes (shown in ALL CAPITALS and assumed to be in the Arabic script) has an embedding level of 1. Each level has a default direction called the embedding direction. This direction is L (left to right) if the level number is even and R (right to left) if the level number is odd. Every paragraph has a default embedding level, and thus a default direction associated with it. This is also called the base direction of the paragraph. For example: A paragraph with a beginning like this in the Latin script would have a default embedding level as Level 0, and hence its base direction would be left to right. What the bidi Algorithm Does The bidi algorithm uses all these formatting codes and embedding levels for analyzing text to decide how it should be rendered. Here is briefly how it goes about doing it: ❑ It breaks up the text into paragraphs by locating the paragraph separators. This is necessary because all the directional formatting codes are only effective within a paragraph. Furthermore this is where the base direction is set. The rest of the algorithm treats the text on a paragraph-by-paragraph basis. ❑ The directional character types and the explicit formatting codes are used to resolve all the levels of embedding in the text. ❑ The text is then broken up into lines, and the characters are re-ordered on a line-by-line basis for rendering on the screen. Perl and bidi Since Perl is a language frequently used for text processing, it is natural that Perl should have bidi capabilities. We have an implementation of the bidi algorithm on Linux that can be used by Perl. We require a C library named FriBidi, which is basically a free implementation of the bidi algorithm, written by Dov Grobgeld. A Perl module has also been written by the same author, acting as an interface to the C library and is available as FriBidi-0.03.tar.gz from http://imagic.weizmann.ac.il/~dov/freesw/FriBidi. The FriBidi module enables us to do the following: ❑ Convert an ISO 8859-8 string to a FriBidi Unicode string: iso8859_8_to_unicode($string); ❑ Perform a logical to visual transformation. In other words, run the string obtained above through the bidi algorithm: log2vis($UniString, $optionalBaseDirection); This calculates the base direction if not passed as the second argument, returns the re-ordered string in scalar context, and additionally returns the base direction as the second element of the list in an array context. TEAMFLY Team-Fly ® Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 25 1054 ❑ Convert the string obtained above to an ISO 8859-8 character set: unicode_to_iso8859_8($toDisplay); This makes sure that it is in a 'ready-to-display' format, assuming the terminal can display ISO 8859-8 characters (such as xterm). ❑ Translate a string from a FriBidi Unicode string to capRTL and vice versa: caprtl_to_unicode($capRTLString); unicode_to_caprtl($fribidiString); The capRTL format is where the CAPITAL LETTERS are mapped as having a strong right to left character property (RTL). This format is frequently used for illustrating bidi properties on displays with limited ability, such as ASCII-only displays. The following is a small example to demonstrate FriBidi's capabilities. First, we create a small file with the following text, named bidisample: THUS, SAID THE CAMEL TO THE MEN, " there is more than one way to do it." AND THE MEN REPLIED " now we see what you mean by bidi", RISING WITH CONTENTMENT WRIT ON THEIR FACES. This is the code to render the above file in bidi fashion: #!/usr/bin/perl # bidirender.pl use warnings; use strict; use FriBidi; my ($uniStr, $visStr, $outStr); open (BIDISAMPLE,"bidisample"); while(<BIDISAMPLE>){ chop; # remove line separator $uniStr = caprtl_to_unicode ( $_ ); # convert line to FriBidi string $visStr = log2vis ( $uniStr ); # run it through the bidi algorithm $outStr = unicode_to_caprtl ( $visStr ); # convert it back to format # that can be displayed on # usual ASCII terminal print $outStr,"\n"; } > perl bidirender.pl "theresmorethanonewaytodoit ",NEMEHTOTLEMACEHTDIASSUHT ,"now we see what you mean by bidi " DEILPER NEM EHT DNA .SECAF RIEHT NO TIRW TNEMTNETNOC HTIW GNISIR Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Unicode 1055 Perl, I18n and Unicode Now let us take a brief look at a solution to the problem of language barriers. A more extensive view of internationalization can be found in Chapter 26. Unicode helps us out in this matter, by providing a uniform way of representing all possible characters of all the living languages in this world. We are about to see how easy it is to enable people all over the world to understand what we are saying in their own language. This example may be tried out by anyone with a day or two of Perl experience. Although it is in no way complete, with no error checking and pretense of handling any real-world complexity, it demonstrates the ease with which Perl handles Unicode. Let us imagine the following scenario: An airport wants to have information kiosks at various locations outside the arrival lounge for foreign tourists. They need the information to be displayed in Arabic, Japanese, Russian, Greek, English, Spanish, Portuguese, and a whole host of other languages. They would like the kiosks to enable the user to view information about the city, events, weather, flight schedule, sight-seeing tours, and also be able to make and confirm reservations in affiliated hotels. Our task here is obviously to create a Perl program that is able to handle Unicode and, therefore, to an extent, solve this problem. The first thing we need to do is create a template HTML file containing a few HTML tags, but with the text replaced by text 'markers' – M1, M2, M3, and so on. We one file for each language, in the following format (obviously, all the files should contain Unicode text encoded in UTF-8): M1:charset "string corresponding to charset" M2:title "string corresponding to title" M3:heading "string corresponding to heading" M4:text "text string" To put the task in another way, we need to write a program that takes in the language name as the input and accordingly generates a file called outfile.html by filling in the template file with the strings in the language requested. The outputted file should be UTF-8 encoded Unicode. This involves a few things such as installing Unicode fonts, installing a Unicode editor, creating template HTML files, writing scripts, and so on. Let us look at these step-by-step. Installing Unicode Fonts For UNIX with the X Window System and Netscape Navigator, information regarding Unicode fonts for X11 in general can be found on http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html. The latest version of the UCS fonts package is available from http://www.cl.cam.ac.uk/~mgk25/download/ucs- fonts.tar.gz. For Windows with IE 5.5, Unicode fonts can be selected during installation or can be downloaded from http://www.microsoft.com/typography/multilang/default.htm. Another good place for links to fonts is http://www.ccss.de/slovo/unifonts.htm. Installing a Unicode Editor For UNIX with the X Window System Yudit is a good choice of an editor that supports UTF-8. Available from http://www.yudit.org/. For Windows 95 and 98 Sharmahd Computing's UniPad is a good editor, available from http://www.sharmahd.com/unipad/. For Windows NT and 2000, Notepad is able to handle Unicode. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 25 1056 Creating the HTML Template Now we can go about creating the template HTML files and the string resource files. The next script is simply an HTML template, called templateLeft.html that we will use with our program: <html> <head> <meta http-equiv="Content-Type" Content="text/html" Charset=M1:charset> <title>M2:title</title> </head> <body> <h1>M3:heading</h1><hr> <p>M4:text</p> </body> </html> Note that this is intuitively correct for languages written from left to right but for languages written the other way, we need a modified template to follow suit. So for languages such as Arabic, we can simply right-justify the displayed text using the ALIGN attribute and setting its value to RIGHT. This should be done for all text in the body of the document to be displayed in the correct direction, that is from right to left. This is the template templateRight.html that we will use with right-to-left languages: <html> <head> <meta http-equiv="Content-Type" Content="text/html" Charset=M1:charset> <title>M2:title</title> </head> <body> <h1 ALIGN=RIGHT>M3:heading</h1><hr> <p ALIGN=RIGHT>M4:text</p> </body> </html> This next image is the sample string resource file for the English language using UniPad: Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Unicode 1057 The following image is a screenshot of a sample string file for the Arabic language. Note the rendering of Arabic text from left to right: Processing the Resource Files The fourth stage in our solution to the problem is creating the Perl script. This script will process the resource files it is given and generate the localized pages: #!/usr/bin/perl # Xlate.pl use warnings; use strict; my ($langname,$filename, $marker, $mark, $value, $wholefile, $thisval, $template, %valueof); print "Enter the language for the output html file: \n"; $langname = lc<>; # get language name and turn it into lowecase chomp $langname; $filename = $langname . ".str"; # generate filename from language name open(LANGFILE, "$filename"); #read in the markers & values in a hash while(<LANGFILE>) { chomp($_); ($marker, $value) = split("\t", $_); $valueof{$marker} = $value; } close(LANGFILE); # use the correct template if ($langname =~ /arabic|hebrew/) { $template = 'templateRight.html'; } else {$template = 'templateLeft.html'} open(TMPLT,$template); open(OUTFILE, ">$langname.html"); $wholefile=join('', <TMPLT>); # slurp entire file into a string Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 25 1058 close TMPLT; foreach $mark (keys %valueof) { $thisval = $valueof{$mark}; # get the value related to the marker $wholefile =~ s/$mark/$thisval/g; # do the replacement } print OUTFILE $wholefile; # write out complete langname.html file print "output written to $langname.html \n"; close OUTFILE; This is the big surprise – the script looks too simple. In fact, no extra processing is required to handle Unicode. Running the Script Now we can execute the code in the usual way and provide the language required: > perl Xlate.pl Enter the language for the output html file: ENGLISH output written to english.html > perl Xlate.pl Enter the language for the output html file: Arabic output written to arabic.html The Output Files After running the script and having it produce the *.html files, we can open them in a browser and see what has been written. The following is a screenshot of the english.html file generated by the script: Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Unicode 1059 Next, is the arabic.html file generated by the script. Note the direction of Arabic script as rendered by the browser: There are more examples of the same phrase written in different languages at http://www.trigeminal.com/samples/provincial.html, thanks to Michael Kaplan. A few more such sites are http://www.columbia.edu/kermit/utf8.html (hosted by the Kermit Project), http://www.unicode.org/unicode/standard/WhatIsUnicode.html (hosted by the Unicode Consortium) and http://hcs.harvard.edu/~igp/glass.html (hosted by the IGP). This simple method of replacing text markers is still widely used. However, localizing a large web site takes much more than just being able to handle Unicode strings. Things such as cultural preferences, date format, currency (which are covered in Chapter 26) need to be taken into consideration. This means we should probably turn to using methods such as HTML::Template, HTML::Mason, HTML::Embperl or maybe something like XML::Parser in order to create an industrial strength multilingual site. Work in Progress There are still a few things about Unicode support in Perl that are under development. For instance, it is not possible right now to determine if an arbitrary piece of data is UTF-8 encoded or not. We cannot force the encoding to be used when performing I/O to anything other than UTF-8, and will have a problem if the pattern during a match does not contain Unicode, but the string to be matched at runtime does. Also the use utf8 pragma is on its way out. In order to follow the current state of Unicode support in Perl, one can join the Perl-unicode mailing list by sending a blank message to majordomo@perl.org with a subject line saying subscribe perl-unicode. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 25 1060 Summary All said and done, we should not need to worry ourselves about the support of Unicode within Perl unless we really have to. All code we create will work just as well as it did before Unicode support came on the scene. So we only need to use it when dealing with foreign scripts for example, but more on using Perl around the world in Chapter 26. In this chapter we have looked at the details concerning the use of Perl with non-standard characters such as symbols or some foreign writings. We began with the problems we face as Perl is used across the globe and then we looked at what can be done to provide solutions. As a summary, we have: ❑ Seen how people have tackled the issue of providing an international coding system. ❑ Looked at how Unicode can be used in regular expressions and tried our hand at writing our own character property. ❑ Demonstrated how Perl can be used to deal with texts in languages that are written from right to left as opposed to left to right such as English. ❑ Provided a real world example of how we can deal with language barriers across the world. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Unicode 1061 Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Chapter 25 1062 Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com [...]... writing is 0 .10. 35 It is directly available from http://www.gnu.org/software/gettext/gettext.html and is supported by several tools, and an EMACS mode For more information regarding this subject, refer to a book such as Professional Linux Programming from Wrox Press, ISBN 18 6100 3013 ❑ Locale::Maketext – a complete and purely Perl based solution At the time of writing the latest version is 0.18 The documentation... in fact the same, the only difference being that the latter has an acute accent, so they should go together To help solve this problem, we can find a useful tutorial included with Perl itself, by typing: > perldoc perllocale 106 4 Locale and Internationalization Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com This should tell us most of what we need to know about locale, a method... etc > perl invest.pl Arnaldo firms_money.txt Holá Arnaldo, aquí está tu informe para 13 dic 2000 14:56:22 CET Andalia 4567987.00 ESP 52530160.34 USD $ Ántico 100 0000.00 ESP 11499630.00 USD $ Cántaliping 46168.50 ESP 530920.67 USD $ Cantamornings 6669876.00 ESP 7670 1106 .15 USD $ Chilindrina 2000.35 ESP 23003.28 USD $ Cflab.org 123456.70 ESP 1419706.37 USD $ Zinzun.com 33445.00 ESP 384605.13 USD $ > perl. .. Ántico 100 0000.00 ESP 100 0000.00 ESP Pts Cántaliping 46168.50 ESP 46168.50 ESP Pts Cantamornings 6669876.00 ESP 6669876.00 ESP Pts Chilindrina 2000.35 ESP 2000.35 ESP Pts Cflab.org 123456.70 ESP 123456.70 ESP Pts Zinzun.com 33445.00 ESP 33445.00 ESP Pts > perl invest.pl Orlando firms_money.txt Holá Orlando, aquí está tu informe para 13 dic 2000 14:57:59 CET Andalia 4567987.00 ESP 23995.27 USD $ Ántico 100 0000.00... 2000.35 ESP 10. 51 USD $ Cflab.org 123456.70 ESP 648.51 USD $ Zinzun.com 33445.00 ESP 175.68 USD $ 106 9 Chapter 26 Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com > perl invest.pl Joe firms_money.txt Hi Joe, here's your report for Wed 13 Dec 2000 02:59:49 PM CET Andalia 4567987.00 ESP 24019.30 USD $ Cantamornings 6669876.00 ESP 35071.41 USD $ Chilindrina 2000.35 ESP 10. 52 USD... reason, it seems to be the only one to use the 'traditional' Spanish ordering 106 6 Locale and Internationalization Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com The problem with setting locales using this function (instead of the pragma), is that despite what is said in the perllocale documentation, the Perl functions cmp and sort do not use it by default For this reason we need... alphabetical order We call this file firms.txt: Chilindrina Ántico Cflab.org Cantamornings Cántaliping Andalia Zinzun.com We can use a simple Perl command line that prints a sorted list on the screen (note that on Windows the special characters are not displayed properly): > perl -e 'print sort ;' firms.txt Andalia Cantamornings Cflab.org Chilindrina Cántaliping Zinzun.com Ántico This is not ideal though... of writing the latest version is 0.18 The documentation (available by typing > Perldoc Locale::Maketext, after installation) includes a synopsis It is important to note that at present, Locale::Maketext is still in its early stages For the purposes of our site, we have decided upon gettext, since it has very good support in Perl The first thing we have to do is to create a so-called Portable Object (PO)... development kit, which can be obtained from http://www.kde.org For the time being, we opt for the Perl way, and choose to use Locale::PO, an object oriented class for creating PO files The following program is written using this module: #!/usr/bin /perl # pocreate.pl use warnings; use strict; use Locale::PO; my $i; 107 1 Chapter 26 Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com my... $hour hours $min minutes $sec seconds ); 107 7 Chapter 26 Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com When issued with the following example commands, the program outputs something similar to: > perl zones.pl Cocos Indian You are in zone CCT Difference with respect to GMT is 8 hours And local time is 14 hours 34 minutes 38 seconds > perl zones.pl Tokyo Asia You are in zone . as Professional Linux Programming from Wrox Press, ISBN 18 6100 3013. ❑ Locale::Maketext – a complete and purely Perl based solution. At the time of writing the latest version is 0.18. The documentation. tutorial included with Perl itself, by typing: > perldoc perllocale Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com Locale and Internationalization 106 5 This should tell. Internationalization 106 7 The problem with setting locales using this function (instead of the pragma), is that despite what is said in the perllocale documentation, the Perl functions cmp and

Ngày đăng: 12/08/2014, 23:23

Xem thêm: professional perl programming wrox 2001 phần 10 doc

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN