This is an article I wrote sometime in the late 1990’s about working with strings coming from COM, MFC, Win32, the C++ Standard Library, etc. It does not include anything about .NET since it was written before .NET came out.
In the good old days, a string was a pointer to a null-terminated array of chars. Period. Now a string might be a
CComBSTR, etc. Unfortunately, you cannot simply choose your favorite string representation and ignore the rest. Each representation has its own domain and it is frequently necessary to convert between types when crossing domain boundaries. Why are there so many kinds of strings? When is each one appropriate? How do you carry out common tasks with each? How do they relate to each other?
Strings differ in three important ways: character set, memory layout, and conventions for use. The most obvious and simplest of these is character set. To keep things focused, we will limit ourselves to ANSI and Unicode. ANSI strings, the kind everybody grew up on, are arrays of single-byte characters. By far most of the world’s strings are ANSI strings. So why bother with Unicode?
Eight bits are plenty to represent all the characters of ordinary English text. But with the slightest thought to international software, it quickly becomes apparent that eight bits are woefully inadequate. Unicode, with 16 bits per character, has enough possibilities to cover all the world’s major languages with enough characters left over to even throw in a few ancient languages for good measure.
Windows NT was built from the beginning to use Unicode strings exclusively internally, though you may write applications for NT that use either ANSI or Unicode. Windows CE only understands Unicode. OLE is built around Unicode strings. But Windows 3.x “doesn’t know Unicode from a dress code, and never will .” The same is true of Windows 9x. The ANSI vs. Unicode strings are much like the English vs. metric measurement units: most everyone agrees the latter is the way to go, but the former has a tremendous installed base. In both situations, we will probably have to live with two standards and all the concomitant complications for a very long time.
C++ has two built in character types:
wchar_t. Most commonly a char is an ANSI character and a
wchar_t is a Unicode character. This is not always the case, but to simplify things a bit, we will make this assumption. Wide character strings, i.e. strings of
wchar_ts, are null-terminated arrays of characters, directly analogous with ordinary strings. The terminating null character in this case is a
wchar_t null. Incidentally, the default settings for the Visual C++ debugger are to not display Unicode characters. There is a check box under Tools / Options / Debug labeled “Display unicode strings” which turns this on.
In order to be able to use the same source code for ANSI and Unicode builds, Windows introduced the
TCHAR data type.
TCHAR is simply a macro that expands to char in ANSI builds (i.e.
_UNICODE is not defined) and
wchar_t in Unicode builds (
_UNICODE is defined). There are various string types based on the
TCHAR macro, such as
LPCTSTR (long pointer to a constant
Microsoft also introduced a number of macros and
typedefs with “
OLE” in the name such as
LPOLESTR, etc. These are vestiges of an automatic ANSI / Unicode conversion scheme that Microsoft used prior to MFC 4.0 and has since abandoned. However, the names live on for legacy support and for Macintosh development. For example, if you look for help on
CLSIDFromProgID you’ll find that its first argument is an
LPCOLESTR. For Win32 development, “OLE” corresponds to Unicode. For Win16 and for the Macintosh, the symbol
OLE2ANSI is defined and “
OLE” corresponds to ANSI. For example, in Win32 development, an
OLECHAR is simply a
wchar_t and an
LPOLESTR is a wide character string.
Microsoft?s character and string types may be summarized as follows. A character name has the form
XCHAR and string name has the form
C is optional and
X is either
W, or empty. The
C indicates a string type is constant, and the
X has the following meanings:
MFC introduced the
CString class as a wrapper around
LPCTSTR data type which provides methods for common tasks such as memory allocation and substring searches. A
CString can be used in most circumstances where you would use an
The Standard C++ library provides a parameterized string class
T is most often a
wchar_t. The Standard library provides the typedefs
wstring respectively for these common cases.
The real confusion in string types comes when we introduce
BSTR differs from a common string in that it always uses Unicode, regardless of compiler switches. However, it also has a different layout in memory. Furthermore, there are different conventions for using
BSTRs than for using simple null-terminated string, whether of the ANSI or Unicode variety, and these conventions are seldom codified.
BSTR is a null-terminated Unicode string, but with a byte count (not character count!) prepended. An advantage of a byte-count prefix is that
BSTR can contain internal nulls, whereas an ordinary string may not. One unusual aspect of the
BSTR is that the byte count is not in the 0th entry of the array the
BSTR points to. Instead, the byte count is stored in the two bytes preceding the memory the pointer ostensibly points to. (MFC?s
CString uses a similar trick so that passing a
CString involves no more overhead than passing a pointer . This causes no problems for developers, however, because the implementation is thoroughly encapsulated.)
OLE standardized on the
BSTR partially because of OLE’s desire to be language-independent. Many languages use the counted arrays rather than using a special symbol to mark the end of a string. The
BSTR compromises by requiring both a count and a terminating character. (Note that in the context of string and character types, OLE refers only to character widths. In particular, an
LPOLESTR is simply a wide character string and not a
BSTR. Despite the name, an
LPOLESTR is not OLE’s favorite string!)
BSTRs are an unnatural imposition on C++. However, they are unavoidable because OLE is built around
BSTRs and not native C++ strings. In order to make
BSTR manipulation easier from C++, several wrapper classes have been created. One is ATL’s
CComBSTR class, which handles basic memory management and a few basic operations for strings.
There is another
BSTR wrapper which one must use in order to take advantage of the native COM support in the Visual C++ compiler. When you use the
#import directive, the compiler creates wrapper functions for the methods on the imported COM interfaces.
BSTR arguments and return values are wrapped as
BSTR* arguments are left alone so the
_bstr_t doesn’t entirely eliminate the need to manipulate
BSTRs.) The design goals of
_bstr_t are different from that of
CComBSTR. The former provides more convenience functions, and is implemented with reference counting to avoid unnecessary memory copying.
MFC class methods often take
LPCTSTR arguments. The choice of a class wrapper for strings in MFC development is obviously
CString especially because a
CString can be used in most situations where an
LPCTSTR is specified. The advantage of the
CString class is that it provides many useful methods for memory management and string manipulation. One disadvantage is that
CString carries with it a little bit more overhead than a raw
LPCTSTR. Also, if
CString is the only MFC class in a project, it still requires linking to and redistributing the MFC DLLs.
The Standard C++
basic_string<> has the advantage of being portable to non-Windows platforms. Also, you may explicitly decide between
wchar_t strings on an individual basis rather than deciding once and for all based on a compiler switch as with
TCHAR strings. And you could use
basic_string<TCHAR> to maintain the ANSI vs. Unicode flexibility of
basic_string<> does define a large number of convenient string manipulation functions. A design goal of this string class was to make the class sufficiently convenient and efficient that it would seldom be necessary to use null terminated strings and the C library manipulation functions.
In OLE interfaces, there is no choice but to use
BSTR or one of its wrapper classes. Ordinarily, a C++ developer would use a
BSTR only as a delivery vehicle to a COM interface; string manipulation is more easily done via library methods and wrapper classes native to C++. Because a
BSTR may contain any characters, even internal nulls, it is possible to wrap arbitrary data in a
BSTR to pass to another function (for example, to avoid having to write custom marshalling code for a COM interface).
CComBSTR is a light-weight wrapper class with adequate functionality for common tasks, and is a natural choice for ATL development. The
_bstr_t class is more complicated, but cannot be avoided when using the
#import directive and the wrapper functions it creates.
L symbol before a character literal denotes that the character is a wide character, as in
wchar_t ch = L'a';
This designation is seldom necessary: the first 255 characters of Unicode are the same as ANSI. Had we left out the
L in front of the first quote mark, the
char 'a' would have been promoted to the
wchar_t with the same value.
L symbol is also used to distinguish
wchar_t strings from ordinary strings, as in
wchar_t wsz = L"Unicode String";
Windows provides the macros
_TEXT() which do nothing unless
_UNICODE is defined, in which case they each expand to
_T("John") reverts to simply
"John" in ANSI builds and expands to
L"John" in Unicode builds. There is an analogous
OLESTR macro that disappears if
OLE2ANSI is defined and expands to
For most of the Standard C library string routines, you can change the initial “
str” in the name to “
wcs” to determine the name of the corresponding routing for wide character strings. For example,
wcscpy is the wide character counterpart of the venerable
strcpy. Also, you may change “
str” to “
_tsc” to come up with the name of a corresponding
BSTR allocates memory before the location it nominally points to, a whole different API is necessary for working with
BSTRs. For example, you may not initialize a
BSTR by saying
BSTR b = L"A String";
This will correctly initialize
b to a wide character array, but the byte count is uninitialized and so
b is not a valid
BSTR. The proper way to initialize
BSTR b = ::SysAllocString(L"A String");
b goes out of scope, its memory needs to be released by calling
::SysFreeString. Note that because the memory for
BSTRs is allocated via a system call rather than the C++ new operator, memory leaks due to failing to call
::SysFreeString will not show up in the Visual C++ debugger. (NuMega’s BoundsChecker will catch these leaks, however.)
Two other handy functions for working with
::SysStringLen. The former allocates a string to a given length and the latter is analogous to the Standard C
The subtlest difficulty with using BSTRs is that they have conventions for their use that differ from those of other strings. For example, a
BSTR is treated as a valid, zero-length string unlike an ordinary string. The only place I have seen anyone attempt to codify these conventions is in Bruce McKinney’s excellent article cited earlier. The reader is advised to study the section of his article entitled “The Eight Rules of BSTR.”
CComBSTR wrapper is straightforward to use. It does not have a lot of methods, but the ones it has are simple and self-explanatory. The
_bstr_t class is more complex. It has more convenience functions. It reference-counts memory to avoid unnecessary copying and throws exceptions.
CComBSTR does no reference counting and does not throw exceptions.
Developers frequently work in the intersection of two or more cultures. You may be writing an OLE application using Standard C++, MFC and ATL. But OLE, Standard C++, MFC, and ATL represent four different cultures, each with its own preferred string type or string wrapper class. Therefore an important part of working with strings is knowing how to convert between the various manifestations.
BSTR is null-terminated and because its pointer points past the byte count, a
BSTR “is a” (in an inheritance sort of sense) wide character string. You may pass a
BSTR to a function expecting a
wchar_t*. (Of course, if the
BSTR being passed in contains any internal nulls, data after the first null will be lost in the interpretation as a wide character string.) However, this interchangeability with wide character strings is tricky. You cannot always look at a variable and tell whether a
wchar_t* is merely a null-terminated wide character string or whether in fact it is a
BSTR. The source code for
_bstr_t is a good example. There is an operator
_bstr_t::operator const wchar_t* which implies only that you may pass a
_bstr_t to a function expecting a
const wchar_t*. However, reading the implementation code, you discover that the
const wchar_t* in question is actually a full-fledged
BSTR. As McKinney points out, “a
BSTR is a
BSTR by convention” and not a built-in type that the compiler can check.
The header file
atlconv.h contains a whopping 28 conversion macros for converting between the various non-class string types covered in this article. These macros have the form
X2Y. The source type
X can be
OLE for ANSI,
wchar_t or OLE respectively. The destination type
Y can be any of these types or additionally
BSTR. Except for
BSTR, the destination types may optionally have a
C in front of their type to indicate const. For example,
A2CW takes an ANSI string and returns a constant wide character string. Of course, there are no macros for converting a type to itself. Note that there is no need for a
BSTR source type because you may use a
BSTR as a wide character string. Some of these macros require that you first call the macro
USES_CONVERSION while others do not. Note that unlike most macros,
USES_CONVERSION must be followed by a semicolon. Except when converting to a
BSTR, these macros allocate memory on the stack;
BSTRs are always allocated by a system call and must be freed using
CString defines a constructor and an
operator= that each take an
LPCTSTR argument. In particular, you can pass an
LPCTSTR into a function taking a
CString also provides an operator
LPCTSTR and so you can also pass a
CString to a function expecting an
CString has a method
AllocSysString that produces a
BSTR from its contents. Finally,
CString can take a
const wchar_t*) as an argument to either a constructor or to
basic_string<T> class has constructor and
operator= methods which take a
const T* argument. However, you cannot pass a
basic_string<T> to a function expecting a
const T* because
basic_string<> extracts to a character string via an operator called
c_str() rather than via a type conversion operator.
CComBSTR has both a constructor and an
operator= which take a
BSTR argument, as well as a type conversion operator for
CComBSTR has roughly the same relationship with
CString has with
_bstr_t has constructor and
operator= overloads that take either ANSI or wide character strings. Also, it supports type conversion operators to both kinds of strings. As noted earlier, the type conversion operator for wide character strings actually returns a
BSTR. Therefore you can pass or receive a
_bstr_t as an ANSI string or a
Developers these days have to contend with at least two character sets — ANSI and Unicode — and at least two memory representations — null terminated and count prepended. This alone makes multiple string types inevitable. Macros and wrapper classes simplify the situation in some circumstances, but they also add their own complexity.
The Visual C++ developer stands in the intersection of a number of programming idioms — traditional C, Standard C++, MFC, COM, ATL — each with its own favorite string representation. You cannot avoid working with numerous string representations and converting from one to another. It is important to understand how each works and the implicit conventions for working with each type.
1. Bruce McKinney, Strings the OLE Way, available on MSDN.
2. Jim Beveridge, CString: Part of the plumbing behind MFC and a model for efficient design, Visual C++ Developers Journal, Volume 1 Number 4.
#include <afxpriv.h> // for USES_CONVERSION #include <comdef.h> // for _bstr_t CString cs; BSTR bstr; WCHAR wsz; CComBSTR cbstr; char sz; TCHAR tsz; basic_string<char> bs; _bstr_t _bstr; USES_CONVERSION; // Convert CString to various types cs = "String1"; bstr = cs.AllocSysString(); // BSTR _tcscpy(tsz, (LPCTSTR)cs); // LPCTSTR strcpy(sz, T2A(tsz)); // ANSI string wcscpy(wsz, bstr); // wide string cbstr = bstr; // CComBSTR via bs = sz; // STL string _bstr = (LPCTSTR) cs; // _bstr_t via either // operator=(const char*) or // operator=(const wchar_t*) // if _UNICODE is defined. ::SysFreeString(bstr); // Convert BSTR to various types bstr = ::SysAllocString(L"String2"); cs = bstr; // CString via its LPCWSTR ctor wcscpy(wsz, bstr); // Unicode cbstr = bstr; // CComBSTR via operator=(LPOLESTR) strcpy(sz, W2A(bstr)); // ANSI string bs = sz; // STL string operator=(const T*) _tcscpy(tsz, W2T(bstr)); // LPTSTR _bstr = bstr; // _bstr_t via operator=(const wchar_t*) ::SysFreeString(bstr);
Other C++ articles: