Pointers on C

Some Notes On the Character Classification Macros

Introduction

The character classification macros (defined in ctype.h) provide a portable mechanism for determining whether a character is alphabetic, numeric, punctuation, white space, etc. It is portable in that it does not require the use of specific character constants for comaprison, which would be valid on other machines only if they used the same character set. Consider this simple example:

if( ch >= 'A' && ch <= 'Z' ) ...

The intent is to determine whether the character contained in the variable ch is an upper-case alphabetic character. This test works fine on computers whose character set is ASCII or Extended ASCII ¹, but fails on compters that use the EBCDIC character set (which, admittedly, are far less common these days than they were many years ago). Conversely, the test:

if( isupper( ch ) ) ...

will work regardless of the character set used by a particular machine because the macro is implemented on each machine to work properly with its character set.

Implementation

These macros are often implemented with a table-lookup mechanism. A table is created that contains an entry for every character value, and each entry contains flags to indicate whether the corresponding character is numeric, alphabetic, upper case, lower case, etc. This table is automatically linked with the program when the macros are used. The macro merely uses its argument as a subscript to access an entry from this table and tests whether that entry contains the flag for the desired characteristic.

The Problem

I will describe the problem in the context of computers with 8-bit bytes, as they are the most common these days. This table-lookup scheme generally worked fine in the days of ASCII when the values assigned to characters did not exceed 127. If the high-order bit of the byte was used at all, it was often used as a parity bit for the detection of errors when characters were transmitted over communications lines that were not as reliable as today's. Once the characters were received and checked for errors, the parity bit was replaced with a 0.

The problem began with the expansion of the character set to include numerous additional characters, for example currency symbols (such as ¢, £, ¥), punctuation (such as ¡ and ¿), special characters (such as ©, ±, and ½), and--most notably--characters that exist in many languages but not in English (such as à, á, æ, etc.). To represent all these additional characters without omitting any of the standard ASCII characters, the values from 128 through 255 are used and the resulting character set is commonly referred to as Extended ASCII ².

Now consider three traditional aspects of the C language:

  1. Characters are promoted to integers before being used for any type of arithmetic.
  2. The default character type, either signed or unsigned, is not specified by the Standard so that the implementer can choose whichever is most efficient for a particular machine.
  3. There is no range checking on array subscripts.

The combination of these three characteristics make it risky to use the character classification macros with Extended ASCII characters. Consider the expression isupper(ch) where the variable ch contains the value 0xc2 (which is the character Â):

  1. The macro evaluates to an expression something like this: ( _chartypes[ ch ] & _CTYPE_UPPER )
  2. Because it is used in an expression, the value ch is first promoted to an integer.
  3. On a 32-bit machine whose default character type is signed, the result of this promotion will be 0xffffffc2, or -62, due to sign extension.
  4. The machine now accesses _chartypes[ -62 ] to see if it contains the bit CTYPE_UPPER.
  5. Because there is no range checking on subscripts, the access is attempted.
  6. If you are extremely lucky, the access is to a memory location outside of the bounds of the program, and the program stops with an error.
  7. However, it is much more likely that the access will succeed and produce an indeterminate result.

Note that the implementer will have constructed a classification table containing the proper flags in each of its 256 entries for each of the 256 Extended ASCII characters. However, the test fails because the memory location that is checked lies outside of this table.

The Solution

The solution is simple, though tedious. All arguments to the character classification macros should be cast as unsigned integers. Because they are going to be promoted to integers anyway, there is no point in casting the arguments to unsigned charcters, though it does no harm to do so. The drawback of casting is the tedium of doing it every time and the ease of forgetting to do it.

A better solution is to define your own set of macros to use instead of the character classification macros, as in this example:

#define safe_isupper(ch) isupper((unsigned int)ch)

This solution is also not perfect, as it is easy to forget and use the standard macros by mistake. Nor is it possible to undefine the standard macros to prevent their accidental use because they are needed to expand the newly defined macros.

As is often the case in C, discipline (and a good memory) are needed to reliably avoid the problem.

Postscript

I would love for this discussion to find its way into Pointers on C, but that will never happen as the publisher has no interest in bringing out an updated edition.

Footnotes

¹ One could argue that this test would fail for Extended ASCII character sets in that it would not classify characters such as À, Ç, Ê and Í as "upper case alphabetic."

² "Extended ASCII" is actually a misnomer, for it implies that the ASCII definition itself was extended. In fact, this is not true (ref: http://en.wikipedia.org/wiki/Extended_ascii).

K. Reek 1/16/2011

Page views: 5089

[Return]Back to my home page.