Unraveling Strings in Visual C++
This is an article I wrote sometime in the late 1990s about working with strings coming from COM, MFC, Win32, the C++ Standard Library, etc. It does not include anything about .NET since it was written before .NET came out.
Outline
Why so many strings?
When is each string appropriate?
Using various strings
Conversions between types
Conclusion
References
Sample code
Introduction
In the good old days, a string was a pointer to a null-terminated array of
chars. Period. Now a string might be a char*, wchar_t*,
LPCTSTR, BSTR, CString,
basic_string, _bstr_t,
CComBSTR, etc. Unfortunately, you cannot simply choose your
favorite string representation and ignore the rest. Each representation has its
own domain and it is frequently necessary to convert between types when crossing
domain boundaries. Why are there so many kinds of strings? When is each one
appropriate? How do you carry out common tasks with each? How do they relate to
each other?
Why so many strings?
Strings differ in three important ways: character set, memory layout, and conventions for use. The most obvious and simplest of these is character set. To keep things focused, we will limit ourselves to ANSI and Unicode. ANSI strings, the kind everybody grew up on, are arrays of single-byte characters. By far most of the world's strings are ANSI strings. So why bother with Unicode?
Eight bits are plenty to represent all the characters of ordinary English text. But with the slightest thought to international software, it quickly becomes apparent that eight bits are woefully inadequate. Unicode, with 16 bits per character, has enough possibilities to cover all the world's major languages with enough characters left over to even throw in a few ancient languages for good measure.
Windows NT was built from the beginning to use Unicode strings exclusively internally, though you may write applications for NT that use either ANSI or Unicode. Windows CE only understands Unicode. OLE is built around Unicode strings. But Windows 3.x “doesn't know Unicode from a dress code, and never will [1].” The same is true of Windows 9x. The ANSI vs. Unicode strings are much like the English vs. metric measurement units: most everyone agrees the latter is the way to go, but the former has a tremendous installed base. In both situations, we will probably have to live with two standards and all the concomitant complications for a very long time.
C++ has two built in character types: char and wchar_t.
Most commonly a char is an ANSI character and a
wchar_t is a Unicode character. This is not always the case, but to
simplify things a bit, we will make this assumption. Wide character strings,
i.e. strings of wchar_ts, are null-terminated arrays of
characters, directly analogous with ordinary strings. The terminating null
character in this case is a
wchar_t null. Incidentally, the default settings for the Visual C++
debugger are to not display Unicode characters. There is a check box under Tools
/ Options / Debug labeled "Display unicode strings" which turns this on.
In order to be able to use the same source code for ANSI and Unicode builds,
Windows introduced the
TCHAR data type. TCHAR is simply a macro that expands
to char in ANSI builds (i.e.
_UNICODE is not defined) and wchar_t in Unicode builds
(_UNICODE is defined). There are various string types based on the
TCHAR macro, such as LPCTSTR (long pointer to a
constant
TCHAR string).
Microsoft also introduced a number of macros and typedefs with "OLE"
in the name such as
OLECHAR, LPOLESTR, etc. These are vestiges of an
automatic ANSI / Unicode conversion scheme that Microsoft used prior to MFC 4.0
and has since abandoned. However, the names live on for legacy support and for
Macintosh development. For example, if you look for help on
CLSIDFromProgID you'll find that its first argument is an
LPCOLESTR. For Win32 development, "OLE" corresponds to Unicode. For
Win16 and for the Macintosh, the symbol
OLE2ANSI is defined and "OLE" corresponds to ANSI. For
example, in Win32 development, an
OLECHAR is simply a wchar_t and an
LPOLESTR is a wide character string.
Microsoft?s character and string types may be summarized as follows. A
character name has the form
XCHAR and string name has the form LPCXSTR where
C is optional and X is either
T, OLE, W, or empty. The
C indicates a string type is constant, and the X has
the following meanings:
T |
Expands to wchar_t if _UNICODE is defined,
else expands to char |
OLE |
Expands to char if OLE2ANSI is defined, else
expands to wchar_t |
W |
wchar_t |
| (empty) | char |
MFC introduced the CString class as a wrapper around
LPCTSTR data type which provides methods for common tasks such as
memory allocation and substring searches. A
CString can be used in most circumstances where you would use an
LPCTSTR.
The Standard C++ library provides a parameterized string class
basic_string<T> where
T is most often a char or
wchar_t. The Standard library provides the typedefs
string and wstring respectively for these common
cases.
The real confusion in string types comes when we introduce BSTRs.
A BSTR differs from a common string in that it always uses Unicode,
regardless of compiler switches. However, it also has a different layout in
memory. Furthermore, there are different conventions for using BSTRs
than for using simple null-terminated string, whether of the ANSI or Unicode
variety, and these conventions are seldom codified.
A BSTR is a null-terminated Unicode string, but with a byte
count (not character count!) prepended. An advantage of a byte-count prefix is
that BSTR can contain internal nulls, whereas an ordinary string
may not. One unusual aspect of the BSTR is that the byte count is
not in the 0th entry of the array the BSTR points to. Instead, the
byte count is stored in the two bytes preceding the memory the pointer
ostensibly points to. (MFC?s
CString uses a similar trick so that passing a CString
involves no more overhead than passing a pointer [2].
This causes no problems for developers, however, because the implementation is
thoroughly encapsulated.)
OLE standardized on the BSTR partially because of OLE's desire
to be language-independent. Many languages use the counted arrays rather than
using a special symbol to mark the end of a string. The BSTR
compromises by requiring both a count and a terminating character. (Note that in
the context of string and character types, OLE refers only to character widths.
In particular, an
LPOLESTR is simply a wide character string and not a BSTR.
Despite the name, an
LPOLESTR is not OLE's favorite string!)
BSTRs are an unnatural imposition on C++. However, they are
unavoidable because OLE is built around BSTRs and not native C++
strings. In order to make BSTR manipulation easier from C++,
several wrapper classes have been created. One is ATL's CComBSTR
class, which handles basic memory management and a few basic operations for
strings.
There is another BSTR wrapper which one must use in order to
take advantage of the native COM support in the Visual C++ compiler. When you
use the
#import directive, the compiler creates wrapper functions for the methods
on the imported COM interfaces. BSTR arguments and return values
are wrapped as
_bstr_t. (However, BSTR* arguments are left alone so
the
_bstr_t doesn't entirely eliminate the need to manipulate
BSTRs.) The design goals of _bstr_t are different from that
of
CComBSTR. The former provides more convenience functions, and is
implemented with reference counting to avoid unnecessary memory copying.
When is each string appropriate?
MFC class methods often take LPCTSTR arguments. The choice of a
class wrapper for strings in MFC development is obviously
CString especially because a CString can be used in
most situations where an
LPCTSTR is specified. The advantage of the CString
class is that it provides many useful methods for memory management and string
manipulation. One disadvantage is that
CString carries with it a little bit more overhead than a raw
LPCTSTR. Also, if CString is the only MFC class in a
project, it still requires linking to and redistributing the MFC DLLs.
The Standard C++ basic_string<> has the advantage of being
portable to non-Windows platforms. Also, you may explicitly decide between
char and
wchar_t strings on an individual basis rather than deciding once
and for all based on a compiler switch as with
TCHAR strings. And you could use basic_string<TCHAR>
to maintain the ANSI vs. Unicode flexibility of
CString. Like CString, basic_string<>
does define a large number of convenient string manipulation functions. A design
goal of this string class was to make the class sufficiently convenient and
efficient that it would seldom be necessary to use null terminated strings and
the C library manipulation functions.
In OLE interfaces, there is no choice but to use BSTR or one of
its wrapper classes. Ordinarily, a C++ developer would use a BSTR
only as a delivery vehicle to a COM interface; string manipulation is more
easily done via library methods and wrapper classes native to C++. Because a
BSTR may contain any characters, even internal nulls, it is
possible to wrap arbitrary data in a BSTR to pass to another
function (for example, to avoid having to write custom marshalling code for a
COM interface).
ATL's CComBSTR is a light-weight wrapper class with adequate
functionality for common tasks, and is a natural choice for ATL development. The
_bstr_t class is more complicated, but cannot be avoided when using
the
#import directive and the wrapper functions it creates.
Using various strings
The L symbol before a character literal denotes that the
character is a wide character, as in
wchar_t ch = L'a';
This designation is seldom necessary: the first 255 characters of Unicode are
the same as ANSI. Had we left out the
L in front of the first quote mark, the char 'a' would
have been promoted to the
wchar_t with the same value.
The L symbol is also used to distinguish wchar_t
strings from ordinary strings, as in
wchar_t wsz = L"Unicode String";
Windows provides the macros _T() and _TEXT() which
do nothing unless
_UNICODE is defined, in which case they each expand to
L. Hence _T("John") reverts to simply
"John" in ANSI builds and expands to L"John" in
Unicode builds. There is an analogous
OLESTR macro that disappears if OLE2ANSI is defined
and expands to
L otherwise.
For most of the Standard C library string routines, you can change the
initial "str" in the name to "wcs" to determine the
name of the corresponding routing for wide character strings. For example,
wcscpy is the wide character counterpart of the venerable
strcpy. Also, you may change "str" to "_tsc"
to come up with the name of a corresponding
TCHAR routine.
Because a BSTR allocates memory before the location it nominally
points to, a whole different API is necessary for working with BSTRs.
For example, you may not initialize a BSTR by saying
BSTR b = L"A String";
This will correctly initialize b to a wide character array, but
the byte count is uninitialized and so
b is not a valid BSTR. The proper way to initialize
b is
BSTR b = ::SysAllocString(L"A String");
Before b goes out of scope, its memory needs to be released by
calling
::SysFreeString. Note that because the memory for BSTRs
is allocated via a system call rather than the C++ new operator, memory leaks
due to failing to call ::SysFreeString will not show up in the
Visual C++ debugger. (NuMega's BoundsChecker will catch these leaks, however.)
Two other handy functions for working with BSTRs are
::SysAllocStringLen and
::SysStringLen. The former allocates a string to a given length and
the latter is analogous to the Standard C
strlen.
The subtlest difficulty with using BSTRs is that they have conventions for
their use that differ from those of other strings. For example, a NULL
BSTR is treated as a valid, zero-length string unlike an ordinary
string. The only place I have seen anyone attempt to codify these conventions is
in Bruce McKinney's excellent article cited earlier. The reader is advised to
study the section of his article entitled "The Eight Rules of BSTR."
The CComBSTR wrapper is straightforward to use. It does not have
a lot of methods, but the ones it has are simple and self-explanatory. The
_bstr_t class is more complex. It has more convenience functions.
It reference-counts memory to avoid unnecessary copying and throws exceptions.
CComBSTR does no reference counting and does not throw exceptions.
Conversions between types
Developers frequently work in the intersection of two or more cultures. You may be writing an OLE application using Standard C++, MFC and ATL. But OLE, Standard C++, MFC, and ATL represent four different cultures, each with its own preferred string type or string wrapper class. Therefore an important part of working with strings is knowing how to convert between the various manifestations.
Because a BSTR is null-terminated and because its pointer points
past the byte count, a BSTR "is a" (in an inheritance sort of
sense) wide character string. You may pass a BSTR to a function
expecting a
wchar_t*. (Of course, if the BSTR being passed in
contains any internal nulls, data after the first null will be lost in the
interpretation as a wide character string.) However, this interchangeability
with wide character strings is tricky. You cannot always look at a variable and
tell whether a
wchar_t* is merely a null-terminated wide character string or
whether in fact it is a BSTR. The source code for
_bstr_t is a good example. There is an operator
_bstr_t::operator const wchar_t* which implies only that you may pass a
_bstr_t to a function expecting a const wchar_t*.
However, reading the implementation code, you discover that the
const wchar_t* in question is actually a full-fledged BSTR.
As McKinney points out, "a BSTR is a BSTR by
convention" and not a built-in type that the compiler can check.
The header file atlconv.h contains a whopping 28 conversion
macros for converting between the various non-class string types covered in this
article. These macros have the form
X2Y. The source type X can be
A, T, W, or OLE for ANSI,
TCHAR, wchar_t or OLE respectively. The destination
type
Y can be any of these types or additionally BSTR.
Except for BSTR, the destination types may optionally have a
C in front of their type to indicate const. For example,
A2CW takes an ANSI string and returns a constant wide character
string. Of course, there are no macros for converting a type to itself. Note
that there is no need for a BSTR source type because you may use a
BSTR as a wide character string. Some of these macros require that
you first call the macro
USES_CONVERSION while others do not. Note that unlike most macros,
USES_CONVERSION must be followed by a semicolon. Except when
converting to a BSTR, these macros allocate memory on the stack;
BSTRs are always allocated by a system call and must be freed using
::SysFreeString.
CString defines a constructor and an operator= that
each take an
LPCTSTR argument. In particular, you can pass an LPCTSTR
into a function taking a
CString. CString also provides an operator
LPCTSTR and so you can also pass a CString to a
function expecting an
LPCTSTR. CString has a method
AllocSysString that produces a BSTR from its contents.
Finally,
CString can take a LPCWSTR (a
const wchar_t*) as an argument to either a constructor or to
operator=.
The basic_string<T> class has constructor and
operator= methods which take a const T* argument.
However, you cannot pass a
basic_string<T> to a function expecting a const T*
because
basic_string<> extracts to a character string via an operator
called
c_str() rather than via a type conversion operator.
CComBSTR has both a constructor and an operator=
which take a BSTR argument, as well as a type conversion operator
for BSTR. Thus
CComBSTR has roughly the same relationship with BSTR
as
CString has with LPCTSTR.
The class _bstr_t has constructor and operator=
overloads that take either ANSI or wide character strings. Also, it supports
type conversion operators to both kinds of strings. As noted earlier, the type
conversion operator for wide character strings actually returns a BSTR.
Therefore you can pass or receive a
_bstr_t as an ANSI string or a BSTR.
Conclusion
Developers these days have to contend with at least two character sets — ANSI and Unicode — and at least two memory representations — null terminated and count prepended. This alone makes multiple string types inevitable. Macros and wrapper classes simplify the situation in some circumstances, but they also add their own complexity.
The Visual C++ developer stands in the intersection of a number of programming idioms — traditional C, Standard C++, MFC, COM, ATL — each with its own favorite string representation. You cannot avoid working with numerous string representations and converting from one to another. It is important to understand how each works and the implicit conventions for working with each type.
References
1. Bruce McKinney,
Strings the OLE Way, available on MSDN.
2. Jim Beveridge, CString: Part of the plumbing behind MFC and a model for
efficient design, Visual C++ Developers
Journal, Volume 1 Number 4.
Sample code
#include <afxpriv.h> // for USES_CONVERSION
#include <comdef.h> // for _bstr_t
CString cs;
BSTR bstr;
WCHAR wsz[81];
CComBSTR cbstr;
char sz[81];
TCHAR tsz[81];
basic_string<char> bs;
_bstr_t _bstr;
USES_CONVERSION;
// Convert CString to various types
cs = "String1";
bstr = cs.AllocSysString(); // BSTR
_tcscpy(tsz, (LPCTSTR)cs); // LPCTSTR
strcpy(sz, T2A(tsz)); // ANSI string
wcscpy(wsz, bstr); // wide string
cbstr = bstr; // CComBSTR via
bs = sz; // STL string
_bstr = (LPCTSTR) cs; // _bstr_t via either
// operator=(const char*) or
// operator=(const wchar_t*)
// if _UNICODE is defined.
::SysFreeString(bstr);
// Convert BSTR to various types
bstr = ::SysAllocString(L"String2");
cs = bstr; // CString via its LPCWSTR ctor
wcscpy(wsz, bstr); // Unicode
cbstr = bstr; // CComBSTR via operator=(LPOLESTR)
strcpy(sz, W2A(bstr)); // ANSI string
bs = sz; // STL string operator=(const T*)
_tcscpy(tsz, W2T(bstr)); // LPTSTR
_bstr = bstr; // _bstr_t via operator=(const wchar_t*)
::SysFreeString(bstr);
Other C++ articles:
- Regular expressions
- Random number generation
- Floating point exceptions
- Math.h in Visual Studio, POSIX, and ISO