Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 100 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
100
Dung lượng
777,88 KB
Nội dung
5.2.1 Character sets
223
Table 221.2:
Relative frequency (most common to least common, with parenthesis used to bracket extremely rare letters) of letter
usage in various human languages (the English ranking is based on the British National Corpus). Based on Kelk.
[729]
Language Letters
English etaoinsrhldcumfpgwybvkxjqz
French esaitnrulodcmpộvqfbghjxốyờzõỗợựụỷùkởw
Norwegian erntsilakodgmvfupbhứjyồổcwzx(q)
Swedish eantrsildomkgvọfhupồửbcyjxwzộq
Icelandic anriestulgmkfhvoỏỵớdjúbyổỳửpộ
`
ycxwzq
Hungarian eatlnskomzrigỏộydbvhj
ofupửúcuớỳỹxw(q)
222
The representation of each member of the source and execution basic character sets shall t in a byte. basic char-
acter set
t in a byte
Commentary
This is a requirement on the implementation. The denition of character already species that it ts in a byte.
59 character
single-byte
However, a character constant has type
int
; which could be thought to imply that the value representation of
883 character
constant
type
characters need not t in a byte. This wording claries the situation. The representation of members of the
basic execution character set is also required to be a nonnegative value.
478 basic char-
acter set
positive if stored in
char object
C
++
1.7p1
A byte is at least large enough to contain any member of the basic execution character set and . . .
This requirement reverses the dependency given in theC Standard, but the effect is the same.
Common Implementations
On hosts where characters have a width 16 or 32 bits, that choice has usually been made because of
addressability issues (pointers only being able to point at storage on 16- or 32-bit address boundaries). It is
not usually necessary to increase the size of a byte because of representational issues to do with the character
set.
In the EBCDIC character set, the value of
a
is 129 (in Ascii it is 97). If the implementation-dened
value of
CHAR_BIT
is 8, then this character, and some others, will not be representable in the type
signed
307 CHAR_BIT
macro
char
(in most implementations the representation actually used is the negative value whose least signicant
eight bits are the same as those of the corresponding bits in the positive value, in the character set). In such
implementations the type char will need to have the same representation as the type unsigned char.
The ICL 1900 series used a 6-bit byte. Implementing this requirement on such a host would not have
been possible.
Coding Guidelines
A general principle of coding guidelines is to recommend against the use of representation information. In
569.1 represen-
tation in-
formation
using
this case the standard is guaranteeing that a character will t within a given amount of storage. Relying on
this requirement might almost be regarded as essential in some cases.
Example
1 void f(void)
2 {
3 char C_1 = W; /
*
Guaranteed to fit in a char.
*
/
4 char C_2 = $; /
*
Not guaranteed to fit in a char.
*
/
5 signed char C_3 = W; /
*
Not guaranteed to fit in a signed char.
*
/
6 }
June 24, 2009 v 1.2
5.2.1 Character sets
224
223
In both the source and execution basic character sets, the value of each character after 0 in the above list ofdigit characters
contiguous
decimal digits shall be one greater than the value of the previous.
Commentary
This is a requirement on the implementation. The Committee realized that a large number of existing
programs depended on this statement being true. It is certainly true for the two major character sets used in
the English-speaking world, Ascii, EBCDIC, and all of the human language digit encodings specified in
Unicode, see Table 797.1. The Committee thus saw fit to bless this usage.
Not only is it possible to perform relational comparisons on the digit characters (e.g,
’0’<’1’
is always
true) but arithmetic operations can also be performed (e.g.,
’0’+1 == ’1’
). A similar statement for the
alphabetic characters cannot be made because it would not be true for at least one character set in common
use (e.g., EBCDIC).
C
++
The above wording has been proposed as the response to C
++
DR #173.
Other Languages
Most languages that have not recently had their specifications updated do not specify any representational
properties for the values of their execution character sets. Java specifies the use of the Unicode character set
(newer versions of the language specify newer versions of the Unicode Standard; all of which are the same
as Ascii for their first 128 values), so this statement also holds true. Ada specifies the subset of ISO 10646
known as the Basic Multilingual Plane (the original language standard specified ISO 646).
ISO 10646 28
Coding Guidelines
This requirement on an implementation provides a guarantee of representation information that developers
can make use of (e.g., in relational comparisons, see Table 866.3). The following are suggested wordings for
deviations from the guideline recommendation dealing with making use of representation information.
represen-
tation in-
formation
using
569.1
Dev
569.1
An integer character constant denoting a digit character may appear in the visible source as the operand
of an additive operator.
Example
1 #include <stdio.h>
2
3 extern char c_glob = ’4’;
4
5 int main(void)
6 {
7 if (’0’ + 3 == ’3’)
8 printf("Sentence 221 is TRUE\n");
9
10 if (c_glob < ’5’)
11 printf("Sentence 221 may be TRUE\n");
12 if (c_glob < 53) /
*
’5’ == 53 in ASCII
*
/
13 printf("Sentence 221 does not apply\n");
14 }
224
In source files, there shall be some way of indicating the end of each line of text;end-of-line
representation
v 1.2 June 24, 2009
5.2.1 Character sets
227
Commentary
This is a requirement on the implementation.
The C library makes a distinction between text and binary files. However, there is no requirement that
source files exist in either of these forms. The worst-case scenario: In a host environment that did not have
a native method of delimiting lines, an implementation would have to provide/define its own convention
and supply tools for editing such files. Some integrated development environments do define their own
conventions for storing source files and other associated information.
C
++
The C
++
Standard does not specify this level of detail (although it does refer to end-of-line indicators,
2.1p1n1).
Common Implementations
Unicode Technical Report #13: “Unicode newline guidelines” discusses the issues associated with repre-
senting new-lines in files. The ISO 6429 standard also defines NEL (NExt Line, hexadecimal 0x85) as
an end-of-line indicator. The Microsoft Windows convention is to indicate this end-of-line with a carriage
return/line feed pair, \r\n (a convention that goes back through CP/M to DEC RT-11); the Unix convention is
to use a single line feed character \n; the MacIntosh convention is to use the carriage return character, \r.
Some mainframes implement a form of text files that mimic punched cards by having fixed-length lines.
Each line contains the same number of characters, often 80. The space after the last user-written character is
sometimes padded with spaces, other times it is padded with null characters.
225
this International Standard treats such an end-of-line indicator as if it were a single new-line character.
Commentary
The standard is not interested in the details of the byte representation of end-of-line on storage media. It
116 transla-
tion phase
1
makes use of the concept of end-of-line and uses the conceptual simplification of treating it as if it were a
single character.
C
++
2.1p1n1
. . . (introducing new-line characters for end-of-line indicators) . . .
226
In the basic execution character set, there shall be control characters representing alert, backspace, carriage
basic execution
character set
control characters
return, and new line.
Commentary
This is a requirement on the implementation.
These characters form part of the set of 96 execution character set members (counting the null character)
defined by the standard, plus new line which is introduced in translation phase 1. However, these characters
221 basic execu-
tion character
set
116 transla-
tion phase
1
are not in the basic source character set, and are represented in it using escape sequences.
866 escape se-
quence
syntax
Other Languages
Few other languages include the concept of control characters, although many implementations provide
semantics for them in source code (they are usually mapped exactly from the source to the execution character
set). Java defines the same control characters as C and gives them their equivalent Ascii values. However, it
does not define any semantics for these characters.
Common Implementations
ECMA-48 Control Functions for Coded Character Sets, Fifth Edition (available free from their Web site,
http://www.ecma-international.ch
) was fast-tracked as the third edition of ISO/IEC 6429. This
standard defines significantly more control functions than those specified in theC Standard.
June 24, 2009 v 1.2
5.2.1 Character sets
228
227
If any other characters are encountered in a source file (except in an identifier, a character constant, a string
literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior
is undefined.
Commentary
The standard does not prohibit such characters from occurring in a source file outright. The Committee was
aware of implementations that used such characters to extend the language. For instance, the use of the
@
character in an object definition to specify its address in storage.
The list of exceptions is extensive. The only usage remaining, for such characters, is as a punctuator. Any
other character has to be accepted as a preprocessing token. It may subsequently, for instance, be stringized.
#
operator
1950
It is the attempt to convert this preprocessing token into a token where the undefined behavior occurs.
preprocess-
ing token
converted to token
137
C90
Support for additional characters in identifiers is new in C99.
C
++
2.1p1
Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name
that designates that character.
The C
++
Standard specifies the behavior and a translator is required to handle source code containing such a
character. A C translator is permitted to issue a diagnostic and fail to translate the source code.
Other Languages
Most languages regard the appearance of an unknown character in the source as some form of error. Like C,
most language implementations support additional characters in string literals and comments.
Common Implementations
Most implementations generate a diagnostic, either when the preprocessing token containing one of these
characters is converted to a token, or as a result of the very likely subsequent syntax violation. Some
implementations
[728]
define the @ character to be a token, its usual use being to provide the syntax for
specifying the address at which an object is to be placed in storage. It is generally followed by an integer
constant expression.
Coding Guidelines
An occurrence of a character outside of the basic source character set, in one of these contexts, is most likely
to be a typing mistake and is very likely to be diagnosed by the translator. The other possibility is that such
characters were intended to be used because use is being made of an extension. This issue is discussed
elsewhere.
extensions
cost/benefit
95.1
Example
1 static int glob @ 0x100; /
*
Put glob at location 0x100.
*
/
228
A letter is an uppercase letter or a lowercase letter as defined above;letter
Commentary
This defines the term letter.
There is a third kind of case that characters can have, titlecase (a term sometimes applied to words where
the first letter is in uppercase, or titlecase, and the other letters are in lowercase). In most instances titlecase
is the same as uppercase, but there are a few characters where this is not true; for instance, the titlecase of the
Unicode character U01C9, lj, is U01C8, Lj, and its uppercase is U01C7, LJ.
v 1.2 June 24, 2009
5.2.1.1 Trigraph sequences
232
C90
This definition is new in C99.
229
in this International Standard the term does not include other characters that are letters in other alphabets.
Commentary
All implementations are required to support the basic source character set to which this terminology applies.
Annex D lists those universal character names that can appear in identifiers. However, they are not referred
to as letters (although they may well be regarded as such in their native language).
The term letter assumes that the orthography (writing system) of a language has an alphabet. Some
792 orthography
orthographies, for instance Japanese, don’t have an alphabet as such (let alone the concept of upper- and
lowercase letters). Even when the orthography of a language does include characters that are considered
to be matching upper and lowercase letters by speakers of that language (e.g., æ and Æ, å and Å), the C
Standard does not define these characters to be letters.
C
++
The definition used in the C
++
Standard, 17.3.2.1.3 (the footnote applies to C90 only), implies this is also
true in C
++
.
Coding Guidelines
The term letter has a common usage meaning in a number of different languages. Developers do not often
use this term in its C Standard sense. Perhaps the safest approach for coding guideline documents to take is
to avoid use of this term completely.
230
The universal character name construct provides a way to name other characters.
Commentary
In theory all characters on planet Earth and beyond. In practice, those defined in ISO 10646.
28 ISO 10646
C90
Support for universal character names is new in C99.
Other Languages
Other language standards are slowly moving to support ISO 10646. Java supports a similar concept.
Common Implementations
Support for these characters is relatively new. It will take time before similarities between implementations
become apparent.
231
Forward references: universal character names (6.4.3), character constants (6.4.4.4), preprocessing direc-
tives (6.10), string literals (6.4.5), comments (6.4.9), string (7.1.1).
5.2.1.1 Trigraph sequences
232
trigraph se-
quences
replaced by
All occurrences in a source file Before any other processing takes place, each occurrence of one of the
following sequences of three characters (called trigraph sequences
12)
) are replaced with the corresponding
single character.
Commentary
Trigraphs were an invention of theC committee. They are a method of supporting the input (into source files,
not executing programs) and the printing of some C source characters in countries whose alphabets, and
keyboards, do not include them in their national character set. Digraphs, discussed elsewhere, are another
916 digraphs
sequence of characters that are replaced by a corresponding single character.
The \? escape sequence was introduced to allow sequences of ?s to occur within string literals.
895 string literal
syntax
The wording was changed by the response to DR #309.
June 24, 2009 v 1.2
5.2.1.1 Trigraph sequences
234
Other Languages
Until recently many computer languages did not attempt to be as worldly as C, requiring what might be called
an Ascii keyboard. Pascal specifies what it calls lexical alternatives for some lexical tokens. The character
sequences making up these lexical alternatives are only recognized in a context where they can form a single,
complete token.
Common Implementations
On the Apple MacIntosh host, the notation
’????’
is used to denote the unknown file type. Translators in
this environment often disable trigraphs by default to prevent unintended replacements from occurring.
233
trigraph se-
quences
mappings
??= # ??) ] ??! |
??( [ ??’ ^ ??< }
??/ \ ??< { ??- ~
Commentary
The above sequences were chosen to minimize the likelihood of breaking any existing, conforming, C source
code.
Other Languages
Many languages use a small subset, or none, of these problematic source characters, reducing the potential
severity of the problem. The Pascal standard specifies
(.
and
.)
as alternative lexical representations of
[
and ] respectively.
Common Implementations
Recognizing trigraph sequences entails a check against every character read in by the translator. Performance
profiling of translators has shown that a large percentage of time is spent in the lexer. A study by Waite
[1469]
found 41% of total translation time was spent in a handcrafted lexer (with little code optimization performed
by the translator). An automatically produced lexer, the lex tool was used, consumed 3 to 5 as much time.
One vendor, Borland, who used to take pride, and was known, for the speed at which their translators
operated, did not include trigraph processing in the main translator program. A stand-alone utility was
provided to perform trigraph processing. Those few programs that used trigraphs needed to be processed by
this utility, generating a temporary file that was processed by the main translator program. While using this
pre-preprocessor was a large overhead for programs that used trigraphs, performance was not degraded for
source code that did not contain them.
Usage
There are insufficient trigraphs in the visible form of the
.c
files to enable any meaningful analysis of the
usage of different trigraphs to be made.
234
No other trigraph sequences exist.trigraph se-
quences
no other
Commentary
The set of characters for which trigraphs were created to provide an alternative spelling are known, and
unlikely to be extended.
Coding Guidelines
Although no other trigraph sequences exist, sequences of two adjacent questions marks in string literals
may lead to confusion. Developers may be unsure about whether they represent a trigraph or not. Using the
escape sequence \? on at least one of these questions marks can help clarify the intent.
Example
1 char
*
unknown_trigraph = "??++";
2 char
*
cannot_be_trigraph = "?\? ";
v 1.2 June 24, 2009
5.2.1.2 Multibyte characters
238
Usage
The visible form of the
.c
files contained 593 (
.h
10) instances of two question marks (i.e.,
??
) in string
literals that were not followed by a character that would have created a trigraph sequence.
235
Each ? that does not begin one of the trigraphs listed above is not changed.
Commentary
Two ?s followed by any other character than those listed above is not a trigraph.
Common Implementations
No implementation is known to define any other sequence of ?s to be replaced by other characters.
Coding Guidelines
No other trigraph sequences are defined by the standard, have been notified for future addition to the standard,
or used in known implementations. Placing restrictions on other uses of other sequences of
?
s provides no
benefit.
236
EXAMPLE 1
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
becomes
#define arraycheck(a,b) a[b] || b[a]
Commentary
This example was added by the response to DR #310 and is intended to show a common trigraph usage.
237
EXAMPLE 2 The following source line
printf("Eh???/n");
becomes (after replacement of the trigraph sequence ??/)
printf("Eh?\n");
Commentary
This illustrates the sometimes surprising consequences of trigraph processing.
5.2.1.2 Multibyte characters
238
The source character set may contain multibyte characters, used to represent members of the extended
multibyte
character
source contain
character set.
Commentary
The mapping from physical source file multibyte characters to the source character set occurs in translation
60 multibyte
character
phase 1. Whether multibyte characters are mapped to UCNs, single characters (if possible), or remain as
116 transla-
tion phase
1
multibyte characters depends on the model used by the implementation.
115 UCN
models of
C
++
The representations used for multibyte characters, in source code, invariably involve at least one character
that is not in the basic source character set:
2.1p1
Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name
that designates that character.
The C
++
Standard does not discuss the issue of a translator having to process multibyte characters during
translation. However, implementations may choose to replace such characters with a corresponding universal-
character-name.
June 24, 2009 v 1.2
5.2.1.2 Multibyte characters
241
Other Languages
Most programming languages do not contain the concept of multibyte characters.
Common Implementations
Support for multibyte characters in identifiers, using a shift state encoding, is sometimes seen as an ex-
tension. Support for multibyte characters in this context using UCNs is new in C99. The most common
universal
charac-
ter name
syntax
815
implementations have been created to support the various Japanese character sets.
Coding Guidelines
The standard does not define how multibyte characters are to be represented. Any program that contains
them is dependent on a particular implementation to do the right thing. Converting programs that existed
before support for universal character names became available may not be economically viable.
Some coding guideline documents recommend against the use of characters that are not specified in the C
Standard. Simply prohibiting multibyte characters because they rely on implementation-defined behavior
ignores the cost/benefit issues applicable to the developers who need to read the source. These are complex
issues for which your author has insufficient experience with which to frame any applicable guideline
recommendations.
239
The execution character set may also contain multibyte characters, which need not have the same encoding
as for the source character set.
Commentary
Multibyte characters could be read from a file during program execution, or even created by assigning byte
values to contiguous array elements. These multibyte sequences could then be interpreted by various library
functions as representing certain (wide) characters.
The execution character set need not be fixed at translation time. A program’s locale can be changed
at execution time (by a call to the
setlocale
function). Such a change of locale can alter how multibyte
characters are interpreted by a library function.
C
++
There is no explicit statement about such behavior being permitted in the C
++
Standard. TheC header
<wchar.h>
(specified in Amendment 1 to C90) is included by reference and so the support it defines for
multibyte characters needs to be provided by C
++
implementations.
Other Languages
Most languages do not include library functions for handling multibyte characters.
Coding Guidelines
Use of multibyte characters during program execution is an applications issue that is outside the scope of
these coding guidelines.
240
For both character sets, the following shall hold:
Commentary
This is a set of requirements that applies to an implementation. It is the minimum set of guaranteed
requirements that a program can rely on.
Coding Guidelines
The set of requirements listed in the following C-sentences is fairly general. Dealing with implementations
that do not meet the requirements listed in these sentences is outside the scope of these coding guidelines.
241
— The basic character set shall be present and each character shall be encoded as a single byte.
v 1.2 June 24, 2009
5.2.1.2 Multibyte characters
243
Commentary
This is a requirement on the implementation. It prevents an implementation from being purely multibyte-
based. The members of the basic character set are guaranteed to always be available and fit in a byte.
222 basic char-
acter set
fit in a byte
Common Implementations
An implementation that includes support for an extended character set might choose to define
CHAR_BIT
to
216 extended
character set
307 CHAR_BIT
macro
be 16 (most of the commonly used characters in ISO 10646 are representable in 16 bits, each in UTF-16; at
28 ISO 10646
28 UTF-16
least those likely to be encountered outside of academic research and the traditional Chinese written on Hong
Kong). Alternatively, an implementation may use an encoding where the members of the basic character set
are representable in a byte, but some members of the extended character set require more than one byte for
their encoding. One such representation is UTF-8.
28 UTF-8
242
— The presence, meaning, and representation of any additional members is locale-specific.
Commentary
On program startup the execution locale is the
"C"
locale. During execution it can be set under program
control. The standard is silent on what the translation time locale might be.
Common Implementations
The full Ascii character set is used by a large number of implementations.
Coding Guidelines
It often comes as a surprise to developers to learn what characters theC Standard does not require to be
provided by an implementation. Source code readability could be affected if any of these additional members
appear within comments and cannot be meaningfully displayed. Balancing the benefits of using additional
members against the likelihood of not being able to display them is a management issue.
The use of any additional members during the execution of a program will be driven by the user require-
ments of the application. This issue is outside the scope of these coding guidelines.
243
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte
multibyte
character
state-dependent
encoding
shift state
characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte
characters are encountered in the sequence.
Commentary
State-dependent encodings are essentially finite state machines. When a state encoding, or any multibyte
encoding, is being used the number of characters in a string literal is not the same as the number of bytes
encountered before the null character. There is no requirement that the sequence of shift states and characters
representing an extended character be unique.
215 extended
characters
There are situations where the visual appearance of two or more characters is considered to be a single
combining
characters
character. For instance, (using ISO 10646 as the example encoding), the two characters LATIN SMALL
LETTER O (U+006F) followed by COMBINING CIRCUMFLEX ACCENT (U+0302) represent the grapheme
cluster (the ISO 10646 term
[334]
for what might be considered a user character)
ô
not the two characters
o ^
. Some languages use grapheme clusters that require more than one combining character, for instance
ô
¯
. Unicode (not ISO 10646) defines a canonical accent ordering to handle sequences of these combining
characters. The so-called combining characters are defined to combine with the character that comes
immediately before them in the character stream. For backwards compatibility with other character encodings,
and ease of conversion, the ISO 10646 Standard provides explicit codes for some accent characters; for
instance, LATIN SMALL LETTER O WITH CIRCUMFLEX (U+00F4) also denotes ô.
A character that is capable of standing alone, the
o
above, is known as a base character. A character that
modifies a base character, the
ô
above, is known as a combining character (the visible form of some combining
characters are called diacritic characters). Most character encodings do not contain any combining characters,
and those that do contain them rarely specify whether they should occur before or after the modified base
June 24, 2009 v 1.2
5.2.1.2 Multibyte characters
243
character. Claims that a particular standard require the combining character to occur before the base character
it modifies may be based on a misunderstanding. For instance, ISO/IEC 6937 specifies a single-byte
encoding for base characters and a double-byte encoding for some visual combinations of (diacritic + base)
Latin letter. These double-byte encodings are precomposed in the sense that they represent a single character;
there is no single-byte encoding for the diacritic character, and the representation of the second byte happens
to be the same as that of the single-byte representation of the corresponding base character (e.g., 0xC14F
represents LATIN CAPITAL LETTER O WITH GRAVE and 0xC16F represents LATIN SMALL LETTER O
WITH GRAVE).
C90
The C90 Standard specified implementation-defined shift states rather than locale-specific shift states.
C
++
The definition of multibyte character, 1.3.8, says nothing about encoding issues (other than that more than
one byte may be used). The definition of multibyte strings, 17.3.2.1.3.2, requires the multibyte characters to
begin and end in the initial shift state.
Common Implementations
Most methods for state-dependent encoding are based on ISO/IEC 2022:1994 (identical to the standard
ISO 2022
ECMA-35 “Character Code Structure and Extension Techniques”, freely available from their Web site,
http://www.ecma.ch
). This uses a different structure than that specified in ISO/IEC 10646–1. The
encoding method defined by ISO 2022 supports both 7-bit and 8-bit codes. It divides these codes up into
control characters (known as C0 and C1) and graphics characters (known as G0, G1, G2, and G3). In the
initial shift state the C0 and G0 characters are in effect.
Table 243.1:
Commonly seen ISO 2022 Control Characters. The alternative values for SS2 and SS3 are only available for 8-bit
codes.
Name Acronym Code Value Meaning
Escape ESC 0x1b Escape
Shift-In SI 0x0f Shift to the G0 set
Shift-Out SO 0x0e Shift to the G1 set
Locking-Shift 2 LS2 ESC 0x6e Shift to the G2 set
Locking-Shift 3 LS3 ESC 0x6f Shift to the G3 set
Single-Shift 2 SS2 ESC 0x4e, or 0x8e Next character only is in G2
Single-Shift 3 SS3 ESC 0x4f, or 0x8f Next character only is in G3
Some of the control codes and their values are listed in Table 243.1. The codes SI, SO, LS2, and LS3 are
known as locking shifts. They cause a change of state that lasts until the next control code is encountered. A
stream that uses locking shifts is said to use stateful encoding.
ISO 2022 specifies an encoding method: it does not specify what the values within the range used for
graphic characters represent. This role is filled by other standards, such as ISO 8859. A C implementation
ISO 8859 24
that supports a state-dependent encoding chooses which character sets are available in each state that it
supports (the C Standard only defines the character set for the initial shift state).
Table 243.2: An implementation where G1 is ISO 8859–1, and G2 is ISO 8891–7 (Greek).
Encoded values 0x62 0x63 0x64 0x0e 0xe6 0x1b 0x6e 0xe1 0xe2 0xe3 0x0f
Control character SO LS2 SI
Graphic character a b c æ α β γ
Having to rely on implicit knowledge of what character set is intended to be used for G1, G2, and so on, is
not always satisfactory. A method of specifying the character sets in the sequence of bytes is needed. The
v 1.2 June 24, 2009
[...]... value occurs at translation time The execution time value actually received by the display device is outside the scope of the standard The library function fputc could map the value represented by these single char object into any sequence of bytes necessary basic execution character set 222 basic character set 221 fit in a byte C+ + This requirement can be deduced from 2.2p3 Other Languages Java explicitly... the scope of these coding guidelines 258 \b (backspace) Moves the active position to the previous position on the current line Commentary The standard specifies that the active position is moved It says nothing about what might happen to any character displayed prior to the backspace at thenew current active position June 24, 2009 v 1.2 backspace escape sequence 260 5.2.2 Character display semantics... C9 9 in that it renders the behavior of the program as unspecified The program simply writes the character; how the device handles the character is beyond its control C+ + The C+ + Standard does not discuss character display semantics Common Implementations The most common implementation behavior is to ignore the request leaving the active position unchanged Some VDUs have the ability to wrap back to the. .. viewed as having the same effect as writing the appropriate number of backspace characters However, the effect of writing a backspace character might be to erase the previous character, while a carriage return does not cause the contents of a line to be erased Like backspace, the standard says 258 backspace escape sequence nothing about the effect of writing characters at the position on a line that... says nothing about the order in which lines are organized The vertical tab (and new line) escape sequence move the active position in the same line direction There is no escape sequence for moving the active position in the opposite direction, similar to backspace for movement within a line The concept of vertical tabulation implicitly invokes the concept of current page This concept is primarily applied... character Other devices write all subsequent characters, up to the next new- line character, at the final position On some displays, writing to the bottom right corner of a display has an effect other than displaying the character output, for instance, clearing the screen or causing it to scroll The termcap and ncurses both provide configuration options that specify whether writing to this display location... written characters (which can occur in Arabic) This specification implies that the positions are a fixed width apart 58 glyph The graphic representation of a character is known as a glyph C+ + The C+ + Standard does not discuss character display semantics Common Implementations In some oriental languages, character glyphs can usually be organized into two groups, one being twice the width as the other Implementations... described here Coding Guidelines A program cannot assume that any of the functionality described will occur when the escape sequence is sent to a display device The root cause for the variability in support for the intended behaviors is the variability of the display devices In most cases an implementation’s action is to send the binary representation of the escape sequence to the device The manufacturers... lines on the page of the display device being written However, it does place a dependency on the characteristics of the display device being known to the host executing the program, or on the device itself, to respond to the data sent to it 261 \n (new line) Moves the active position to the initial position of the next line termcap database new- line escape sequence Commentary What happens to the preceding... applicable to C+ + source files Coding Guidelines In some cases source files can contain multibyte characters and be translated by translators that have no knowledge of the structure of these multibyte characters The developer is relying on the translator ignoring them in comments containing their native language, or simply copying the character sequence in a string literal into the program image In other . .
226
In the basic execution character set, there shall be control characters representing alert, backspace, carriage
basic execution
character set
control characters
return,. source code (they are usually mapped exactly from the source to the execution character
set). Java defines the same control characters as C and gives them their