Unraveling Strings in Visual C++

This is an article I wrote sometime in the late 1990's about working with strings coming from COM, MFC, Win32, the C++ Standard Library, etc. It does not include anything about .NET since it was written before .NET came out.

Outline

Why so many strings?
When is each string appropriate?
Using various strings
Conversions between types
Conclusion
References
Sample code

Introduction

In the good old days, a string was a pointer to a null-terminated array of chars. Period. Now a string might be a char*, wchar_t*, LPCTSTR, BSTR, CString, basic_string, _bstr_t, CComBSTR, etc. Unfortunately, you cannot simply choose your favorite string representation and ignore the rest. Each representation has its own domain and it is frequently necessary to convert between types when crossing domain boundaries. Why are there so many kinds of strings? When is each one appropriate? How do you carry out common tasks with each? How do they relate to each other?

Why so many strings?

Strings differ in three important ways: character set, memory layout, and conventions for use. The most obvious and simplest of these is character set. To keep things focused, we will limit ourselves to ANSI and Unicode. ANSI strings, the kind everybody grew up on, are arrays of single-byte characters. By far most of the world's strings are ANSI strings. So why bother with Unicode?

Eight bits are plenty to represent all the characters of ordinary English text. But with the slightest thought to international software, it quickly becomes apparent that eight bits are woefully inadequate. Unicode, with 16 bits per character, has enough possibilities to cover all the world's major languages with enough characters left over to even throw in a few ancient languages for good measure.

Windows NT was built from the beginning to use Unicode strings exclusively internally, though you may write applications for NT that use either ANSI or Unicode. Windows CE only understands Unicode. OLE is built around Unicode strings. But Windows 3.x “doesn't know Unicode from a dress code, and never will [1].” The same is true of Windows 9x. The ANSI vs. Unicode strings are much like the English vs. metric measurement units: most everyone agrees the latter is the way to go, but the former has a tremendous installed base. In both situations, we will probably have to live with two standards and all the concomitant complications for a very long time.

C++ has two built in character types: char and wchar_t. Most commonly a char is an ANSI character and a wchar_t is a Unicode character. This is not always the case, but to simplify things a bit, we will make this assumption. Wide character strings, i.e. strings of wchar_ts, are null-terminated arrays of characters, directly analogous with ordinary strings. The terminating null character in this case is a wchar_t null. Incidentally, the default settings for the Visual C++ debugger are to not display Unicode characters. There is a check box under Tools / Options / Debug labeled "Display unicode strings" which turns this on.

In order to be able to use the same source code for ANSI and Unicode builds, Windows introduced the TCHAR data type. TCHAR is simply a macro that expands to char in ANSI builds (i.e. _UNICODE is not defined) and wchar_t in Unicode builds (_UNICODE is defined). There are various string types based on the TCHAR macro, such as LPCTSTR (long pointer to a constant TCHAR string).

Microsoft also introduced a number of macros and typedefs with "OLE" in the name such as OLECHAR, LPOLESTR, etc. These are vestiges of an automatic ANSI / Unicode conversion scheme that Microsoft used prior to MFC 4.0 and has since abandoned. However, the names live on for legacy support and for Macintosh development. For example, if you look for help on CLSIDFromProgID you'll find that its first argument is an LPCOLESTR. For Win32 development, "OLE" corresponds to Unicode. For Win16 and for the Macintosh, the symbol OLE2ANSI is defined and "OLE" corresponds to ANSI. For example, in Win32 development, an OLECHAR is simply a wchar_t and an LPOLESTR is a wide character string.

Microsoft?s character and string types may be summarized as follows. A character name has the form XCHAR and string name has the form LPCXSTR where C is optional and X is either T, OLE, W, or empty. The C indicates a string type is constant, and the X has the following meanings:

T Expands to wchar_t if _UNICODE is defined, else expands to char
OLE Expands to char if OLE2ANSI is defined, else expands to wchar_t
W wchar_t
(empty) char

MFC introduced the CString class as a wrapper around LPCTSTR data type which provides methods for common tasks such as memory allocation and substring searches. A CString can be used in most circumstances where you would use an LPCTSTR.

The Standard C++ library provides a parameterized string class basic_string<T> where T is most often a char or wchar_t. The Standard library provides the typedefs string and wstring respectively for these common cases.

The real confusion in string types comes when we introduce BSTRs. A BSTR differs from a common string in that it always uses Unicode, regardless of compiler switches. However, it also has a different layout in memory. Furthermore, there are different conventions for using BSTRs than for using simple null-terminated string, whether of the ANSI or Unicode variety, and these conventions are seldom codified.

A BSTR is a null-terminated Unicode string, but with a byte count (not character count!) prepended. An advantage of a byte-count prefix is that BSTR can contain internal nulls, whereas an ordinary string may not. One unusual aspect of the BSTR is that the byte count is not in the 0th entry of the array the BSTR points to. Instead, the byte count is stored in the two bytes preceding the memory the pointer ostensibly points to. (MFC?s CString uses a similar trick so that passing a CString involves no more overhead than passing a pointer [2]. This causes no problems for developers, however, because the implementation is thoroughly encapsulated.)

OLE standardized on the BSTR partially because of OLE's desire to be language-independent. Many languages use the counted arrays rather than using a special symbol to mark the end of a string. The BSTR compromises by requiring both a count and a terminating character. (Note that in the context of string and character types, OLE refers only to character widths. In particular, an LPOLESTR is simply a wide character string and not a BSTR. Despite the name, an LPOLESTR is not OLE's favorite string!)

BSTRs are an unnatural imposition on C++. However, they are unavoidable because OLE is built around BSTRs and not native C++ strings. In order to make BSTR manipulation easier from C++, several wrapper classes have been created. One is ATL's CComBSTR class, which handles basic memory management and a few basic operations for strings.

There is another BSTR wrapper which one must use in order to take advantage of the native COM support in the Visual C++ compiler. When you use the #import directive, the compiler creates wrapper functions for the methods on the imported COM interfaces. BSTR arguments and return values are wrapped as _bstr_t. (However, BSTR* arguments are left alone so the _bstr_t doesn't entirely eliminate the need to manipulate BSTRs.) The design goals of _bstr_t are different from that of CComBSTR. The former provides more convenience functions, and is implemented with reference counting to avoid unnecessary memory copying.

When is each string appropriate?

MFC class methods often take LPCTSTR arguments. The choice of a class wrapper for strings in MFC development is obviously CString especially because a CString can be used in most situations where an LPCTSTR is specified. The advantage of the CString class is that it provides many useful methods for memory management and string manipulation. One disadvantage is that CString carries with it a little bit more overhead than a raw LPCTSTR. Also, if CString is the only MFC class in a project, it still requires linking to and redistributing the MFC DLLs.

The Standard C++ basic_string<> has the advantage of being portable to non-Windows platforms. Also, you may explicitly decide between char and wchar_t strings on an individual basis rather than deciding once and for all based on a compiler switch as with TCHAR strings. And you could use basic_string<TCHAR> to maintain the ANSI vs. Unicode flexibility of CString. Like CString, basic_string<> does define a large number of convenient string manipulation functions. A design goal of this string class was to make the class sufficiently convenient and efficient that it would seldom be necessary to use null terminated strings and the C library manipulation functions.

In OLE interfaces, there is no choice but to use BSTR or one of its wrapper classes. Ordinarily, a C++ developer would use a BSTR only as a delivery vehicle to a COM interface; string manipulation is more easily done via library methods and wrapper classes native to C++. Because a BSTR may contain any characters, even internal nulls, it is possible to wrap arbitrary data in a BSTR to pass to another function (for example, to avoid having to write custom marshalling code for a COM interface). 

ATL's CComBSTR is a light-weight wrapper class with adequate functionality for common tasks, and is a natural choice for ATL development. The _bstr_t class is more complicated, but cannot be avoided when using the #import directive and the wrapper functions it creates.

Using various strings

The L symbol before a character literal denotes that the character is a wide character, as in

wchar_t ch = L'a';

This designation is seldom necessary: the first 255 characters of Unicode are the same as ANSI. Had we left out the L in front of the first quote mark, the char 'a' would have been promoted to the wchar_t with the same value.

The L symbol is also used to distinguish wchar_t strings from ordinary strings, as in

wchar_t wsz = L"Unicode String";

Windows provides the macros _T() and _TEXT() which do nothing unless _UNICODE is defined, in which case they each expand to L. Hence _T("John") reverts to simply "John" in ANSI builds and expands to L"John" in Unicode builds. There is an analogous OLESTR macro that disappears if OLE2ANSI is defined and expands to L otherwise.

For most of the Standard C library string routines, you can change the initial "str" in the name to "wcs" to determine the name of the corresponding routing for wide character strings. For example, wcscpy is the wide character counterpart of the venerable strcpy. Also, you may change "str" to "_tsc" to come up with the name of a corresponding TCHAR routine. 

Because a BSTR allocates memory before the location it nominally points to, a whole different API is necessary for working with BSTRs. For example, you may not initialize a BSTR by saying

BSTR b = L"A String";

This will correctly initialize b to a wide character array, but the byte count is uninitialized and so b is not a valid BSTR. The proper way to initialize b is 

BSTR b = ::SysAllocString(L"A String");

Before b goes out of scope, its memory needs to be released by calling ::SysFreeString. Note that because the memory for BSTRs is allocated via a system call rather than the C++ new operator, memory leaks due to failing to call ::SysFreeString will not show up in the Visual C++ debugger. (NuMega's BoundsChecker will catch these leaks, however.)

Two other handy functions for working with BSTRs are ::SysAllocStringLen and ::SysStringLen. The former allocates a string to a given length and the latter is analogous to the Standard C strlen

The subtlest difficulty with using BSTRs is that they have conventions for their use that differ from those of other strings. For example, a NULL BSTR is treated as a valid, zero-length string unlike an ordinary string. The only place I have seen anyone attempt to codify these conventions is in Bruce McKinney's excellent article cited earlier. The reader is advised to study the section of his article entitled "The Eight Rules of BSTR." 

The CComBSTR wrapper is straightforward to use. It does not have a lot of methods, but the ones it has are simple and self-explanatory. The _bstr_t class is more complex. It has more convenience functions. It reference-counts memory to avoid unnecessary copying and throws exceptions. CComBSTR does no reference counting and does not throw exceptions.

Conversions between types

Developers frequently work in the intersection of two or more cultures. You may be writing an OLE application using Standard C++, MFC and ATL. But OLE, Standard C++, MFC, and ATL represent four different cultures, each with its own preferred string type or string wrapper class. Therefore an important part of working with strings is knowing how to convert between the various manifestations.

Because a BSTR is null-terminated and because its pointer points past the byte count, a BSTR "is a" (in an inheritance sort of sense) wide character string. You may pass a BSTR to a function expecting a wchar_t*. (Of course, if the BSTR being passed in contains any internal nulls, data after the first null will be lost in the interpretation as a wide character string.) However, this interchangeability with wide character strings is tricky. You cannot always look at a variable and tell whether a wchar_t* is merely a null-terminated wide character string or whether in fact it is a BSTR. The source code for _bstr_t is a good example. There is an operator _bstr_t::operator const wchar_t* which implies only that you may pass a _bstr_t to a function expecting a const wchar_t*. However, reading the implementation code, you discover that the const wchar_t* in question is actually a full-fledged BSTR. As McKinney points out, "a BSTR is a BSTR by convention" and not a built-in type that the compiler can check. 

The header file atlconv.h contains a whopping 28 conversion macros for converting between the various non-class string types covered in this article. These macros have the form X2Y. The source type X can be A, T, W, or OLE for ANSI, TCHAR, wchar_t or OLE respectively. The destination type Y can be any of these types or additionally BSTR. Except for BSTR, the destination types may optionally have a C in front of their type to indicate const. For example, A2CW takes an ANSI string and returns a constant wide character string. Of course, there are no macros for converting a type to itself. Note that there is no need for a BSTR source type because you may use a BSTR as a wide character string. Some of these macros require that you first call the macro USES_CONVERSION while others do not. Note that unlike most macros, USES_CONVERSION must be followed by a semicolon. Except when converting to a BSTR, these macros allocate memory on the stack; BSTRs are always allocated by a system call and must be freed using ::SysFreeString.

CString defines a constructor and an operator= that each take an LPCTSTR argument. In particular, you can pass an LPCTSTR into a function taking a CString. CString also provides an operator LPCTSTR and so you can also pass a CString to a function expecting an LPCTSTR. CString has a method AllocSysString that produces a BSTR from its contents. Finally, CString can take a LPCWSTR (a const wchar_t*) as an argument to either a constructor or to operator=.

The basic_string<T> class has constructor and operator= methods which take a const T* argument. However, you cannot pass a basic_string<T> to a function expecting a const T* because basic_string<> extracts to a character string via an operator called c_str() rather than via a type conversion operator.

CComBSTR has both a constructor and an operator= which take a BSTR argument, as well as a type conversion operator for BSTR. Thus CComBSTR has roughly the same relationship with BSTR as CString has with LPCTSTR

The class _bstr_t has constructor and operator= overloads that take either ANSI or wide character strings. Also, it supports type conversion operators to both kinds of strings. As noted earlier, the type conversion operator for wide character strings actually returns a BSTR. Therefore you can pass or receive a _bstr_t as an ANSI string or a BSTR.

Conclusion

Developers these days have to contend with at least two character sets — ANSI and Unicode — and at least two memory representations — null terminated and count prepended. This alone makes multiple string types inevitable. Macros and wrapper classes simplify the situation in some circumstances, but they also add their own complexity. 

The Visual C++ developer stands in the intersection of a number of programming idioms — traditional C, Standard C++, MFC, COM, ATL — each with its own favorite string representation. You cannot avoid working with numerous string representations and converting from one to another. It is important to understand how each works and the implicit conventions for working with each type.

References

1. Bruce McKinney, Strings the OLE Way, available on MSDN.
2. Jim Beveridge, CString: Part of the plumbing behind MFC and a model for efficient design, Visual C++ Developers Journal, Volume 1 Number 4.

Sample code

#include <afxpriv.h> // for USES_CONVERSION
#include <comdef.h>  // for _bstr_t

CString cs;                
BSTR bstr;
WCHAR wsz[81];
CComBSTR cbstr;
char sz[81];
TCHAR tsz[81];
basic_string<char> bs;
_bstr_t _bstr;

USES_CONVERSION;
	
// Convert CString to various types
cs = "String1";
bstr = cs.AllocSysString();     // BSTR	
_tcscpy(tsz, (LPCTSTR)cs);      // LPCTSTR
strcpy(sz, T2A(tsz));           // ANSI string
wcscpy(wsz, bstr);              // wide string
cbstr = bstr;                   // CComBSTR via 
bs = sz;                        // STL string
_bstr = (LPCTSTR) cs;           // _bstr_t via either 
                                //     operator=(const char*) or
                                //     operator=(const wchar_t*) 
                                //     if _UNICODE is defined.
::SysFreeString(bstr);

// Convert BSTR to various types
bstr = ::SysAllocString(L"String2");
cs = bstr;                      // CString via its LPCWSTR ctor
wcscpy(wsz, bstr);              // Unicode
cbstr = bstr;                   // CComBSTR via operator=(LPOLESTR)
strcpy(sz, W2A(bstr));          // ANSI string
bs = sz;                        // STL string operator=(const T*)
_tcscpy(tsz, W2T(bstr));        // LPTSTR 
_bstr = bstr;                   // _bstr_t via operator=(const wchar_t*)
::SysFreeString(bstr);

 

Other C++ articles: