while (current != null) { if (current.Value.AtImminentRiskOfDeath) { current = current.Next; } else { break; } } if (current == null) { waitingPatients.AddLast(newPatient); } else { waitingPatients.AddBefore(current, newPatient); } This code adds the new patient after all those patients in the queue whose lives appear to be at immediate risk, but ahead of all other patients—the patient is presumably either quite unwell or a generous hospital benefactor. (Real triage is a little more complex, of course, but you still insert items into the list in the same way, no matter how you go about choosing the insertion point.) Note the use of LinkedListNode<T>—this is how LinkedList<T> presents the queue’s contents. It allows us not only to see the item in the queue, but also to navigate back and forth through the queue with the Next and Previous properties. Stacks Whereas Queue<T> operates a FIFO order, Stack<T> operates a last in, first out (LIFO) order. Looking at this from a queuing perspective, it seems like the height of unfairness—latecomers get priority over those who arrived early. However, there are some situations in which this topsy-turvy ordering can make sense. A performance characteristic of most computers is that they tend to be able to work faster with data they’ve processed recently than with data they’ve not touched lately. CPUs have caches that provide faster access to data than a computer’s main memory can support, and these caches typically operate a policy where recently used data is more likely to stay in the cache than data that has not been touched recently. If you’re writing a server-side application, you may consider throughput to be more important than fairness—the total rate at which you process work may matter more than how long any individual work item takes to complete. In this case, a LIFO order may make the most sense—work items that were only just put into a queue are much more likely to still live in the CPU’s cache than those that were queued up ages ago, Stacks | 313 and so you’ll get better throughput during high loads if you process newly arrived items first. Items that have sat in the queue for longer will just have to wait for a lull. Like Queue<T>, Stack<T> offers a method to add an item, and one to remove it. It calls these Push and Pop, respectively. They are very similar to the queue’s Enqueue and Dequeue, except they both work off the same end of the list. (You could get the same effect using a LinkedList, and always calling AddFirst and RemoveFirst.) A stack could also be useful for managing navigation history. The Back button in a browser works in LIFO order—the first page it shows you is the last one you visited. (And if you want a Forward button, you could define a second stack—each time the user goes Back, Push the current page onto the Forward stack. Then if the user clicks Forward, Pop a page from the Forward stack, and Push the current page onto the Back stack.) Summary The .NET Framework class library provides various useful collection classes. We saw List<T> in an earlier chapter, which provides a simple resizable linear list of items. Dictionaries store entries by associating them with keys, providing fast key-based lookup. HashSet<T> and SortedSet<T> manage sets of unique items, with optional or- dering. Queues, linked lists, and stacks each manage a queue of items, offering various strategies for how the order of addition relates to the order in which items come out of the queue. 314 | Chapter 9: Collection Classes CHAPTER 10 Strings Chapter 10 is all about strings. A bit late, you might think: we’ve had about nine chap- ters of string-based action already! Well, yes, you’d be right. That’s not terribly sur- prising, though: text is probably the single most important means an application has of communicating with its users. That is especially true as we haven’t introduced any graphical frameworks yet. I suppose we could have beeped the system speaker in Morse, although even that can be considered a text-based operation. Even with a graphical UI framework where we have pictures and buttons and graphs and sounds, they almost always have textual labels, descriptions, comments, or tool tips. Users who have difficulty reading (perhaps because they have a low-vision condition) may have that text transformed into sound by accessibility tools, but the application is still processing text strings under the covers. Even when we are dealing with integers or doubles internally within an algorithm, there comes a time when we need to represent them to humans, and preferably in a way that is meaningful to us. We usually do that (at least in part) by converting them into strings of one form or another. Strings are surprisingly complex and sophisticated entities, so we’re going to take some time to explore their properties in this chapter. First, we’ll look at what we’re really doing when we initialize a literal string. Then, we’ll see a couple of techniques which let us convert from other types to a string represen- tation and how we can control the formatting of that conversion. Next, we’ll look at various different techniques we can use to process a string. This will include composition, splitting, searching and replacing content, and what it means to compare strings of various kinds. Finally, we will look at how .NET represents strings internally, how that differs from other representations in popular use in the world, and how we can convert between those representations by using an Encoding. 315 What Is a String? A string is an ordered sequence of characters: We could consider this sentence to be a string. We start with the first character, which is W. Then we continue on in order from left to right: 'W', 'e', ' ', 'c', 'o', 'u', 'l', 'd' And so on. A string doesn’t have to be a whole sentence, of course, or even anything meaningful. Any ordered sequence of characters is a string. Notice that each character might be an uppercase letter, lowercase letter, space, punctuation mark, number (or, in fact, any other textual symbol). It doesn’t even have to be an English letter. It could be Arabic, for example: Here we have the following characters: ' ' ,'' ,'' ,'' ,'' ,'' ,' ' If you look carefully, you’ll notice that the string is ordered the other way round—the first character is the rightmost one, and the last character is the leftmost one. This is because Arabic scripts read right to left and not left to right; but the string is still ordered, character by character. A quick reminder: a font is a particular visual design for an entire set of characters. Historically, it was a box containing a set of moveable type in a specific design at a certain size, but we’ve come to blur the meanings of font family, typeface, and font in popular usage, and people tend to use these terms interchangeably now. I think it is interesting to note that only a few years ago, fonts were the sole purview of designers and printers; but they’ve now become com- monplace, thanks to the ubiquity of the word processor. Just in case you have been on the moon since 1968, here are three ex- amples taken from different fonts: 316 | Chapter 10: Strings You’ll also notice that the “joined up” cursive form of the characters is visually quite different from their form when separated out individually. This is normal; the ultimate visual representation of the character in the string is entirely separate from the string itself. We’re just so used to the characters of our own language that we don’t tend to think of them as abstract symbols, and tend to discount any visual differences down to the choice of font or other typographical niceties when we are interpreting them. We could happily design a font where the character e looks like Q and the character f looks like A. All our text processing would continue as normal: searching and sorting would be just fine (words starting with f wouldn’t start appearing in the dictionary before words starting with e), because the data in the string is unchanged; but when we drew it on the screen, it would look more than a bit confusing. * The take-home point is that there are a bunch of layers between the .NET runtime’s representation of a string as data in memory, and its final visual appearance on a screen, in a file, or in another application (such as notepad.exe, for example). As we go through this chapter, we’ll unpick those layers as we come across them, and point out some of the common pitfalls. Let’s get on and see how the .NET Framework presents a string to us. The String and Char Types It will come as no surprise that the .NET Framework provides us with two types that correspond with strings and characters: String and Char. In fact, as we’ve seen before, these are such important types that C# even provides us with keywords that correspond to the underlying types: string and char. String needs to provide us with that “ordered sequence of characters” behavior. It does so by implementing IEnumerable<char>, as Example 10-1 illustrates. Example 10-1. Iterating through the characters in a string string myString = "I've gone all vertical."; foreach (char theCharacter in myString) { Console.WriteLine(theCharacter); } * In fact, I don’t think that this particular typeface would catch on. The String and Char Types | 317 If you create a console application for this code, you’ll see output like this when it runs: I ' v e g o n e a l l v e r t i c a l . What exactly does that code do? First, it initializes a variable called myString which we will use to hold the reference to our string object (because String is a reference type). We then enumerate the string, yielding every Char in turn, and we output each Char to the console on its own separate line. Char is a value type, so we’re actually getting a copy of the character from the string itself. The string object is created using a literal string—a sequence of characters enclosed in double quotes: "I've gone all vertical." We’re already quite familiar with initializing a string with a literal—we probably do it without a second thought; but let’s have a look at these literals in a little more detail. Literal Strings and Chars The simplest literal string is a set of characters enclosed in double quotes, shown in the first line of Example 10-2. Example 10-2. A string literal string myString = "Literal string"; Console.WriteLine(myString); This produces the output: Literal string 318 | Chapter 10: Strings You can also initialize a string from a char[], using the appropriate constructor. One way to obtain a char array is by using char literals. A char literal is a single character, wrapped in single quotes. Example 10-3 constructs a string this way. Example 10-3. Initializing a string from char literals string myString = new string(new [] { 'H', 'e', 'l', 'l', 'o', ' ', '"', 'w', 'o', 'r', 'l', 'd', '"' }); Console.WriteLine(myString); If you compile and run this, you’ll see the following output: Hello "world" Notice that we’ve got double-quote marks in our output. That was easy to achieve with this char[], because the delimiter for an individual character is the single quote; but how could we include double quotes in the string, without resorting to a literal char array? Equally, how could we specify the single-quote character as a literal char? Escaping Special Characters The way to deal with troublesome characters in string and char literals is to escape them with the backslash character. That means that you precede the quote with a \, and it interprets the quote as part of the string, rather than the end of it. Like this: † "Literal \"string\"" If you build and run with this change, you’ll see the output, with quotes in place: Literal "string" There are several other special characters that you can escape in this way. You can find some common ones listed in Table 10-1. Table 10-1. Common escaped characters for string literals Escaped character Purpose \" Include a double quote in a string literal. \' Include a single quote in a char literal. \\ Insert a backslash. \n New line. \r Carriage return. \t Tab. There are also some rather uncommon ones, listed in Table 10-2. In general, you don’t need to worry about them, but they are quite interesting. † We’ll just show the string literal from here on, rather than repeating the boilerplate code each time. Just replace the string initializer with the example. Literal Strings and Chars | 319 Table 10-2. Less common escape characters for string literals Escaped character Purpose \0 The character represented by the char with value zero (not the character '0'). \a Alert or “Bell”. Back in the dim and distant past, terminals didn’t really have sound, so you couldn’t play a great big .wav file beautifully designed by Robert Fripp every time you wanted to alert the user to the fact that he had done something a bit wrong. Instead, you sent this character to the console, and it beeped at you, or even dinged a real bell (like the line-end on a manual typewriter). It still works today, and on some PCs there’s still a separate speaker just for making this old-school beep. Try it, but be prepared for unexpected retro-side effects like growing enormous sideburns and developing an obsession with disco. \b Backspace. Yes, you can include backspaces in your string. Write: "Hello world\b\b\b\b\bdolly" to the console, and you’ll see: Hello dolly Not all rendering engines support this character, though. You can see the same string rendered in a WPF application in Figure 10-1. Notice how the backspace characters have been ignored. Remember: output mechanisms can interpret individual characters differently, even though they’re the same character, in the same string. \f Form feed. Another special character from yesteryear. This used to push a whole page worth of paper through the printer. This is somewhat less than useful now, though. Even the console doesn’t do what you’d expect. If you write: "Hello\fworld" to the console, you’ll see something like: Hello♀world Yes, that is the symbol for “female” in the middle there. That’s because the original IBM PC defined a special character mapping so that it could use some of these characters to produce graphical symbols (like male, female, heart, club, diamond, and spade) that weren’t part of the regular character set. These mappings are sometimes called code pages, and the default code page for the console (at least for U.S. English systems) incorporates those original IBM definitions. We’ll talk more about code pages and encodings later. \v Vertical quote. This one looks like a “male” symbol (♂) in the console’s IBM-emulating code page. The first character in Table 10-2 is worth a little attention: character value 0, sometimes also referred to as the null character, although it’s not the same as a null reference— char is a value type, so it’s more like the char equivalent of the number 0. In a lot of programming systems, this character is used to mark the end of a string—C and C++ use this convention, as do many Windows APIs. However, in .NET, and therefore in C#, string objects contain the length as a separate field, and so you’re free to put null characters in your strings if you want. However, you may need to be careful—if those 320 | Chapter 10: Strings strings end up being passed to Windows APIs, it’s possible that Windows will ignore everything after the first null. There’s one more escape form that’s a little different from all the others, because you can use it to escape any character. This escape sequence begins with \u and is then followed by four hexadecimal digits, letting you specify the exact numeric value for a character. How can a textual character have a numeric value? Well, we’ll get into that in detail in the “Encoding Characters” on page 360 section, but roughly speaking, each possible character can be identified by number. For example, the uppercase letter A has the number 65, B is 66, and so on. In hexadecimal, those are 41 and 42, respectively. So we can write this string: "\u0041\u0042\u0043" which is equivalent to: "ABC" Of course, if that’s the string you want, you’d normally just write that second form. The \u escape sequence is more useful when you need a particular character that’s not on your keyboard. For example, \u00A9 is the copyright symbol: ©. Sometimes you’ll have a block of text that includes a lot of these special characters (like carriage returns, for instance) and you want to just paste it out of some other application straight into your code as a literal string without having to add lots of backslashes. While it can be done, you might question the wisdom of large quantities of text in your C# source files. You might want to store the text in a separate resource file, and load it up on demand. If you prefix the opening double-quote mark with the @ symbol, the compiler will then interpret every subsequent character (including any whitespace such as newlines, and tabs) as part of the string, until it sees a matching double-quote mark to close the string. Example 10-4 exploits this to embed new lines and indentation in a string literal. Figure 10-1. WPF ignoring control characters Literal Strings and Chars | 321 Example 10-4. Avoiding backslashes with @-quoting string multiLineString = @"Lots of lines and tabs!"; Console.WriteLine(multiLineString); This code will produce the following output: Lots of lines and tabs! Notice how it respects the whitespace between the double quotes. The @ prefix can be especially useful for literal file paths. You don’t need to escape all those backslashes. So instead of writing "C:\\some\\path" you can write just @"c:\some\path". Formatting Data for Output So, we know how to initialize literal strings, which is terribly useful; but what about our other data? How do we display an Int32 or DateTime or whatever? We’ve already met one way of converting any object to a string—the virtual ToString method, which Example 10-5 uses. Example 10-5. Converting numbers to strings with ToString int myValue = 45; string myString = myValue.ToString(); Console.WriteLine(myString); This will produce the output you might expect: 45 What if we try a decimal? Example 10-6 shows this. Example 10-6. Calling ToString on a decimal decimal myValue = 45.65M; string myString = myValue.ToString(); Console.WriteLine(myString); Again, we get the expected output: 45.65 OK, what if we have some decimals in something like an accounting ledger, and we want to format them all to line up properly, with a preceding dollar sign? 322 | Chapter 10: Strings [...]... with this notation For example, 1. 05 × 103 represents the number 1 050 , and 1. 05 × 10−3 represents the number 0.001 05 Developers use plain text editors, which don’t support formatting such as superscript, so there’s a convention for representing exponential numbers with plain, unformatted text We can write those last two examples as 1.05E+003 and 1.05E-003, respectively C# recognizes this convention for... amount = 1 654 539; string text = amount.ToString("D9"); We’re asking for nine digits in the output string, and it pads with leading zeros: 001 654 539 If you don’t supply a qualifying number of decimal digits, as Example 10-10 shows, it just uses as many as necessary Example 10-10 Decimal format with unspecified precision int amount = -28 957 29; string text = amount.ToString("D"); This produces: −28 957 29 Hexadecimal... format double amount = 152 .683 854 85; string text = amount.ToString("F4"); This produces: 152 .6839 The output will be padded with trailing zeros if necessary Example 10-16 causes this by asking for four digits where only two are required Example 10-16 Fixed-point format causing trailing zeros double amount = 152 .68; string text = amount.ToString("F4"); So, the output in this case is: 152 .6800 General Sometimes... 10 -55 shows one way to do this Example 10 -55 Concatenating strings string fragment1 = "To be, "; string fragment2 = "or not to be."; string composedString = fragment1 + fragment2; Console.WriteLine(composedString); 344 | Chapter 10: Strings Here, we’ve used the + operator to concatenate two strings The C# compiler turns this into a call to the String class’s static method Concat, so Example 10 -56 shows... the number down Look at Example 10-21 Example 10-21 Custom numeric formats double value = 12.3 456 ; Console.WriteLine(value.ToString("00.######")); value = 1.23 456 ; Console.WriteLine(value.ToString("00.000000")); Console.WriteLine(value.ToString("##.000000")); We see the following output: 12.3 456 01.23 456 0 1.23 456 0 You don’t actually have to put all the # symbols you require before the decimal place— a... time = new DateTime(2001, 12, 24, 13, 14, 15, 16); Console.WriteLine(time.ToString("t")); Console.WriteLine(time.ToShortTimeString()); Console.WriteLine(time.ToString("T")); Console.WriteLine(time.ToLongTimeString()); Formatting Data for Output | 333 This will result in: 13:14 13:14 13:14: 15 13:14: 15 Or, as Example 10- 35 shows, you can combine the two Example 10- 35 Getting both the time and date DateTime... Val1: 32, Val2: 123. 457 , Val3: 01/11/1999 17:22: 25 A specific format item can be referenced multiple times, and in any order in the format string You can also apply the standard and custom formatting we discussed earlier to any of the individual format items Example 10- 45 shows that in action Example 10- 45 Using format strings from String.Format int first = 32; double second = 123. 457 ; DateTime third... = 254 .238 758 39484; string text = amount.ToString("E"); This produces: 2 .54 2388E+002 Fixed point Another format string that applies to all numeric types, the fixed-point format provides the ability to display a number with a specific number of digits after the decimal point As usual, it rounds the result, rather than truncating Example 10- 15 asks for four digits after the decimal point Example 10- 15. .. precision Example 10-13 Exponential format double amount = 254 .238 758 39484; string text = amount.ToString("E4"); And here’s the string it produces: 2 .54 24E+002 If you don’t provide a precision specifier, as in Example 10-14, you get six digits to the right of the decimal point (or fewer, if the trailing digits would be zero) Formatting Data for Output | 3 25 We’ll see later how these defaults can be controlled... an overload of the standard ToString method Example 10-7 Currency format decimal dollarAmount = 1231 65. 453 9M; string text = dollarAmount.ToString("C"); Console.WriteLine(text); The capital C indicates that we want the decimal formatted as if it were a currency value; and here’s the output: $123,1 65. 45 Notice how it has rounded to two decimal places (rounding down in this case), added a comma to group . amount.ToString("G4"); Console.WriteLine(text); double amount2 = 0. 000 000 000 000 152 68; text = amount2.ToString("G4"); Console.WriteLine(text); This will produce the following output: 152 .7 1 .52 7E-13 Note. 10- 21. Example 10- 21. Custom numeric formats double value = 12. 3 45 6; Console.WriteLine(value.ToString(" ;00 .######")); value = 1.2 3 45 6; Console.WriteLine(value.ToString(" ;00 .00 000 0")); Console.WriteLine(value.ToString("## .00 000 0")); We. 1.2 3 45 6; Console.WriteLine(value.ToString(" ;00 .00 000 0")); Console.WriteLine(value.ToString("## .00 000 0")); We see the following output: 12. 3 45 6 01 .2 3 45 60 1.2 3 45 60 You don’t actually have to put all the # symbols you require