Logo

String Module

Back in the late sixties and early seventies I invented a string handling module which used a four pointer block that would allow access to the beginning and end of a string and make it easy to provide many operations of the string body. That all went away with C. So for a while I used the crude, clumsy functions for manipulating C strings (i.e. single pointer to beginning of block and a zero character to designate the end of the string). There was no memory management of the string body and many holes in code were allowed by this implementation.

Somewhere along the line C++ became popular and Microsoft foundation Class (MFC) was invented. I came to this late and started writing code in c++. Time passes and I realized that Visual Studio's libraries have many ways to handle strings. Eventually I settled on the basic_string template in the Ansi version string. The string class has many useful operations and may be subclassed to add even more operations.

There is another module named CString which has some useful features. But is is clearly inferior to string particularly in its definition. There just doesn't seem to be a definition of CString like there is for the basic_string and string classes. Having said that, one must tolerate CString because many of the MFC library requires the use of a CString, either as an input or as an output.

Both string and CString manage the string body storage so that is one worry that is removed from the engineer. This is such an important issue that almost any cost should be tolerated to avoid trying to manage the string's body (i.e. where the characters are kept).

During the transition to Unicode the template, basic_string, was easily used to form a tstring which uses a Tchar in the body. This one small changed transformed the entire module in Unicode with hardly any effort (even though it took some doing to figure it out the first time).

Tchar is an interesting type in itself. It is either an Ansi character or a 16 bit character. This is determined by the an element of the Properties of the project. Specifically Properties/General/Character Set (in VS2022 it is Properties/Advanced/Character Set). The choices are "Not Set" (i.e. Ansi), "Unicode" or "Multi-Byte". It appears that for Unicode and Multi-Byte the Tchar is a 16 bit character. All of the C functions that operate on the C-style string have two versions that are listed in "tchar.h". Using the "t" version of the C-style string functions makes the function useful for both Ansi and Unicode strings without changing the code itself, just the Character Set in the properties. Furthermore, with the little definition of tstring to be a basic string of a Tchar the String class works for both character sets too.

String

More Comments About Strings (6/1/20)

Sometime in the last year I converted the String package (and many other packages) to using Unicode characters (16 bits per character). This had a serious ripple effect and it took some time to dampen the ripple. By now it seems to be done.

Let me start this discussion with some of my personal conventions for naming. Back when C was being invented the lower case character set was being introduced into languages because the computers could support them. Up to that time only upper case characters were supported by the hardware/software. So, of course, lower case was used everywhere. The basic idea of a "type" was invented and given names, for example the integer was an "int". Again hardware constructs limited the size of an int and the C language was very flexible and molded the language to each set of hardware. Time passes, more things are invented, sizes change but the C language moved with them.

Eventually, we got to "Object Oriented Languages" of which C++ is one. Now the programmer could define his own elements in the language. Along the way the Hungarian Notation was invented where the first few characters of an object describe its type. I never appreciated the use of characters that way trying in my naming to describe the object itself in the name. For example, a name for a softball would not be szsoftball (where sz indicates that it is a zero terminated string) in the Hungarian Notation but I would call it a "softball", period. The language would be sure to insist that I use the correct operations on the softball however it was defined.

However, there are entities in C++ that should be immediately recognizable, that of typedefs, classes, structs, enums and objects. The former four entities may define objects (or pseudo-objects). So I have chosen a simple scheme: Capitalize the names of my typdefs, classes, structs and enums. Objects begin with lower case letters. Multiple word names in both cases the inner words are capitalized even if they are just abbreviations or single letters which stand for something longer.

Now why did I settle on this convention. Some time ago (probably 30 or 40 years) I read something about the psychology of reading. The notion that I took from that article was that Upper and lower cases make reading easier. The use of both upper and lower case signal something in the brain that passes information to the reader that is useful.

OK, enough on conventions. Here is my take on the Unicode character:

typedef       TCHAR   Tchar;     // Ordinary signed character
typedef const TCHAR   TCchar;    // Ordinary constant signed character
typedef const _TUCHAR TUCchar;   // Ordinary constant unsigned character

Along came Unicode and the String class had to change. Now it is a subclass of tstring:

typedef basic_string<Tchar> tstring;

Whoops I extended the "string" class definition by prefixing a t to it since this is an extension of the MFC definition.

What can one do with a String? It can be initialized with:

There are a few attributes of a string that can be retrieved:

One can manipulate the string:

The module includes two classes whose purpose is to translate between 16 bit characters and 8 bit characters. These classes are needed to translate between Unicode characters in the application and char (8 bit) characters used in other parts of the operating system (e.g. many applications still need 8 bit characters in files).

Finally, all functions available to a basic_string are available to a String without any special coding.