Unraveling Strings in Visual C++
This is an article I wrote sometime in the late 1990's about working with strings coming from COM, MFC, Win32, the C++ Standard Library, etc. It does not include anything about .NET since it was written before .NET came out.
Outline
Why so many strings?
When is each string appropriate?
Using various strings
Conversions between types
Conclusion
References
Sample code
Introduction
In the good old days, a string was a pointer to a null-terminated array of
chars. Period. Now a string might be a char*
, wchar_t*
,
LPCTSTR
, BSTR
, CString
,
basic_string
, _bstr_t
,
CComBSTR
, etc. Unfortunately, you cannot simply choose your
favorite string representation and ignore the rest. Each representation has its
own domain and it is frequently necessary to convert between types when crossing
domain boundaries. Why are there so many kinds of strings? When is each one
appropriate? How do you carry out common tasks with each? How do they relate to
each other?
Why so many strings?
Strings differ in three important ways: character set, memory layout, and conventions for use. The most obvious and simplest of these is character set. To keep things focused, we will limit ourselves to ANSI and Unicode. ANSI strings, the kind everybody grew up on, are arrays of single-byte characters. By far most of the world's strings are ANSI strings. So why bother with Unicode?
Eight bits are plenty to represent all the characters of ordinary English text. But with the slightest thought to international software, it quickly becomes apparent that eight bits are woefully inadequate. Unicode, with 16 bits per character, has enough possibilities to cover all the world's major languages with enough characters left over to even throw in a few ancient languages for good measure.
Windows NT was built from the beginning to use Unicode strings exclusively internally, though you may write applications for NT that use either ANSI or Unicode. Windows CE only understands Unicode. OLE is built around Unicode strings. But Windows 3.x “doesn't know Unicode from a dress code, and never will [1].” The same is true of Windows 9x. The ANSI vs. Unicode strings are much like the English vs. metric measurement units: most everyone agrees the latter is the way to go, but the former has a tremendous installed base. In both situations, we will probably have to live with two standards and all the concomitant complications for a very long time.
C++ has two built in character types: char
and wchar_t
.
Most commonly a char is an ANSI character and a
wchar_t
is a Unicode character. This is not always the case, but to
simplify things a bit, we will make this assumption. Wide character strings,
i.e. strings of wchar_t
s, are null-terminated arrays of
characters, directly analogous with ordinary strings. The terminating null
character in this case is a
wchar_t
null. Incidentally, the default settings for the Visual C++
debugger are to not display Unicode characters. There is a check box under Tools
/ Options / Debug labeled "Display unicode strings" which turns this on.
In order to be able to use the same source code for ANSI and Unicode builds,
Windows introduced the
TCHAR
data type. TCHAR
is simply a macro that expands
to char in ANSI builds (i.e.
_UNICODE
is not defined) and wchar_t
in Unicode builds
(_UNICODE
is defined). There are various string types based on the
TCHAR
macro, such as LPCTSTR
(long pointer to a
constant
TCHAR
string).
Microsoft also introduced a number of macros and typedef
s with "OLE
"
in the name such as
OLECHAR
, LPOLESTR
, etc. These are vestiges of an
automatic ANSI / Unicode conversion scheme that Microsoft used prior to MFC 4.0
and has since abandoned. However, the names live on for legacy support and for
Macintosh development. For example, if you look for help on
CLSIDFromProgID
you'll find that its first argument is an
LPCOLESTR
. For Win32 development, "OLE" corresponds to Unicode. For
Win16 and for the Macintosh, the symbol
OLE2ANSI
is defined and "OLE
" corresponds to ANSI. For
example, in Win32 development, an
OLECHAR
is simply a wchar_t
and an
LPOLESTR
is a wide character string.
Microsoft?s character and string types may be summarized as follows. A
character name has the form
XCHAR
and string name has the form LPCXSTR
where
C
is optional and X
is either
T
, OLE
, W
, or empty. The
C
indicates a string type is constant, and the X
has
the following meanings:
T |
Expands to wchar_t if _UNICODE is defined,
else expands to char |
OLE |
Expands to char if OLE2ANSI is defined, else
expands to wchar_t |
W |
wchar_t |
(empty) | char |
MFC introduced the CString
class as a wrapper around
LPCTSTR
data type which provides methods for common tasks such as
memory allocation and substring searches. A
CString
can be used in most circumstances where you would use an
LPCTSTR
.
The Standard C++ library provides a parameterized string class
basic_string<T>
where
T
is most often a char
or
wchar_t
. The Standard library provides the typedefs
string
and wstring
respectively for these common
cases.
The real confusion in string types comes when we introduce BSTR
s.
A BSTR
differs from a common string in that it always uses Unicode,
regardless of compiler switches. However, it also has a different layout in
memory. Furthermore, there are different conventions for using BSTR
s
than for using simple null-terminated string, whether of the ANSI or Unicode
variety, and these conventions are seldom codified.
A BSTR
is a null-terminated Unicode string, but with a byte
count (not character count!) prepended. An advantage of a byte-count prefix is
that BSTR
can contain internal nulls, whereas an ordinary string
may not. One unusual aspect of the BSTR
is that the byte count is
not in the 0th entry of the array the BSTR
points to. Instead, the
byte count is stored in the two bytes preceding the memory the pointer
ostensibly points to. (MFC?s
CString
uses a similar trick so that passing a CString
involves no more overhead than passing a pointer [2].
This causes no problems for developers, however, because the implementation is
thoroughly encapsulated.)
OLE standardized on the BSTR
partially because of OLE's desire
to be language-independent. Many languages use the counted arrays rather than
using a special symbol to mark the end of a string. The BSTR
compromises by requiring both a count and a terminating character. (Note that in
the context of string and character types, OLE refers only to character widths.
In particular, an
LPOLESTR
is simply a wide character string and not a BSTR
.
Despite the name, an
LPOLESTR
is not OLE's favorite string!)
BSTR
s are an unnatural imposition on C++. However, they are
unavoidable because OLE is built around BSTR
s and not native C++
strings. In order to make BSTR
manipulation easier from C++,
several wrapper classes have been created. One is ATL's CComBSTR
class, which handles basic memory management and a few basic operations for
strings.
There is another BSTR
wrapper which one must use in order to
take advantage of the native COM support in the Visual C++ compiler. When you
use the
#import
directive, the compiler creates wrapper functions for the methods
on the imported COM interfaces. BSTR
arguments and return values
are wrapped as
_bstr_t
. (However, BSTR*
arguments are left alone so
the
_bstr_t
doesn't entirely eliminate the need to manipulate
BSTR
s.) The design goals of _bstr_t
are different from that
of
CComBSTR
. The former provides more convenience functions, and is
implemented with reference counting to avoid unnecessary memory copying.
When is each string appropriate?
MFC class methods often take LPCTSTR
arguments. The choice of a
class wrapper for strings in MFC development is obviously
CString
especially because a CString
can be used in
most situations where an
LPCTSTR
is specified. The advantage of the CString
class is that it provides many useful methods for memory management and string
manipulation. One disadvantage is that
CString
carries with it a little bit more overhead than a raw
LPCTSTR
. Also, if CString
is the only MFC class in a
project, it still requires linking to and redistributing the MFC DLLs.
The Standard C++ basic_string<>
has the advantage of being
portable to non-Windows platforms. Also, you may explicitly decide between
char
and
wchar_t
strings on an individual basis rather than deciding once
and for all based on a compiler switch as with
TCHAR
strings. And you could use basic_string<TCHAR>
to maintain the ANSI vs. Unicode flexibility of
CString
. Like CString
, basic_string<>
does define a large number of convenient string manipulation functions. A design
goal of this string class was to make the class sufficiently convenient and
efficient that it would seldom be necessary to use null terminated strings and
the C library manipulation functions.
In OLE interfaces, there is no choice but to use BSTR
or one of
its wrapper classes. Ordinarily, a C++ developer would use a BSTR
only as a delivery vehicle to a COM interface; string manipulation is more
easily done via library methods and wrapper classes native to C++. Because a
BSTR
may contain any characters, even internal nulls, it is
possible to wrap arbitrary data in a BSTR
to pass to another
function (for example, to avoid having to write custom marshalling code for a
COM interface).
ATL's CComBSTR
is a light-weight wrapper class with adequate
functionality for common tasks, and is a natural choice for ATL development. The
_bstr_t
class is more complicated, but cannot be avoided when using
the
#import
directive and the wrapper functions it creates.
Using various strings
The L
symbol before a character literal denotes that the
character is a wide character, as in
wchar_t ch = L'a';
This designation is seldom necessary: the first 255 characters of Unicode are
the same as ANSI. Had we left out the
L
in front of the first quote mark, the char 'a'
would
have been promoted to the
wchar_t
with the same value.
The L
symbol is also used to distinguish wchar_t
strings from ordinary strings, as in
wchar_t wsz = L"Unicode String";
Windows provides the macros _T()
and _TEXT()
which
do nothing unless
_UNICODE
is defined, in which case they each expand to
L
. Hence _T("John")
reverts to simply
"John"
in ANSI builds and expands to L"John"
in
Unicode builds. There is an analogous
OLESTR
macro that disappears if OLE2ANSI
is defined
and expands to
L
otherwise.
For most of the Standard C library string routines, you can change the
initial "str
" in the name to "wcs
" to determine the
name of the corresponding routing for wide character strings. For example,
wcscpy
is the wide character counterpart of the venerable
strcpy
. Also, you may change "str
" to "_tsc
"
to come up with the name of a corresponding
TCHAR
routine.
Because a BSTR
allocates memory before the location it nominally
points to, a whole different API is necessary for working with BSTR
s.
For example, you may not initialize a BSTR
by saying
BSTR b = L"A String";
This will correctly initialize b
to a wide character array, but
the byte count is uninitialized and so
b
is not a valid BSTR
. The proper way to initialize
b
is
BSTR b = ::SysAllocString(L"A String");
Before b
goes out of scope, its memory needs to be released by
calling
::SysFreeString
. Note that because the memory for BSTR
s
is allocated via a system call rather than the C++ new operator, memory leaks
due to failing to call ::SysFreeString
will not show up in the
Visual C++ debugger. (NuMega's BoundsChecker will catch these leaks, however.)
Two other handy functions for working with BSTR
s are
::SysAllocStringLen
and
::SysStringLen
. The former allocates a string to a given length and
the latter is analogous to the Standard C
strlen
.
The subtlest difficulty with using BSTRs is that they have conventions for
their use that differ from those of other strings. For example, a NULL
BSTR
is treated as a valid, zero-length string unlike an ordinary
string. The only place I have seen anyone attempt to codify these conventions is
in Bruce McKinney's excellent article cited earlier. The reader is advised to
study the section of his article entitled "The Eight Rules of BSTR."
The CComBSTR
wrapper is straightforward to use. It does not have
a lot of methods, but the ones it has are simple and self-explanatory. The
_bstr_t
class is more complex. It has more convenience functions.
It reference-counts memory to avoid unnecessary copying and throws exceptions.
CComBSTR
does no reference counting and does not throw exceptions.
Conversions between types
Developers frequently work in the intersection of two or more cultures. You may be writing an OLE application using Standard C++, MFC and ATL. But OLE, Standard C++, MFC, and ATL represent four different cultures, each with its own preferred string type or string wrapper class. Therefore an important part of working with strings is knowing how to convert between the various manifestations.
Because a BSTR
is null-terminated and because its pointer points
past the byte count, a BSTR
"is a" (in an inheritance sort of
sense) wide character string. You may pass a BSTR
to a function
expecting a
wchar_t*
. (Of course, if the BSTR
being passed in
contains any internal nulls, data after the first null will be lost in the
interpretation as a wide character string.) However, this interchangeability
with wide character strings is tricky. You cannot always look at a variable and
tell whether a
wchar_t*
is merely a null-terminated wide character string or
whether in fact it is a BSTR
. The source code for
_bstr_t
is a good example. There is an operator
_bstr_t::operator const wchar_t*
which implies only that you may pass a
_bstr_t
to a function expecting a const wchar_t*
.
However, reading the implementation code, you discover that the
const wchar_t*
in question is actually a full-fledged BSTR
.
As McKinney points out, "a BSTR
is a BSTR
by
convention" and not a built-in type that the compiler can check.
The header file atlconv.h
contains a whopping 28 conversion
macros for converting between the various non-class string types covered in this
article. These macros have the form
X2Y
. The source type X
can be
A
, T
, W
, or OLE
for ANSI,
TCHAR
, wchar_t
or OLE respectively. The destination
type
Y
can be any of these types or additionally BSTR
.
Except for BSTR
, the destination types may optionally have a
C
in front of their type to indicate const. For example,
A2CW
takes an ANSI string and returns a constant wide character
string. Of course, there are no macros for converting a type to itself. Note
that there is no need for a BSTR
source type because you may use a
BSTR
as a wide character string. Some of these macros require that
you first call the macro
USES_CONVERSION
while others do not. Note that unlike most macros,
USES_CONVERSION
must be followed by a semicolon. Except when
converting to a BSTR
, these macros allocate memory on the stack;
BSTR
s are always allocated by a system call and must be freed using
::SysFreeString
.
CString
defines a constructor and an operator=
that
each take an
LPCTSTR
argument. In particular, you can pass an LPCTSTR
into a function taking a
CString
. CString
also provides an operator
LPCTSTR
and so you can also pass a CString
to a
function expecting an
LPCTSTR
. CString
has a method
AllocSysString
that produces a BSTR
from its contents.
Finally,
CString
can take a LPCWSTR
(a
const wchar_t*
) as an argument to either a constructor or to
operator=
.
The basic_string<T>
class has constructor and
operator=
methods which take a const T*
argument.
However, you cannot pass a
basic_string<T>
to a function expecting a const T*
because
basic_string<>
extracts to a character string via an operator
called
c_str()
rather than via a type conversion operator.
CComBSTR
has both a constructor and an operator=
which take a BSTR
argument, as well as a type conversion operator
for BSTR
. Thus
CComBSTR
has roughly the same relationship with BSTR
as
CString
has with LPCTSTR
.
The class _bstr_t
has constructor and operator=
overloads that take either ANSI or wide character strings. Also, it supports
type conversion operators to both kinds of strings. As noted earlier, the type
conversion operator for wide character strings actually returns a BSTR
.
Therefore you can pass or receive a
_bstr_t
as an ANSI string or a BSTR
.
Conclusion
Developers these days have to contend with at least two character sets — ANSI and Unicode — and at least two memory representations — null terminated and count prepended. This alone makes multiple string types inevitable. Macros and wrapper classes simplify the situation in some circumstances, but they also add their own complexity.
The Visual C++ developer stands in the intersection of a number of programming idioms — traditional C, Standard C++, MFC, COM, ATL — each with its own favorite string representation. You cannot avoid working with numerous string representations and converting from one to another. It is important to understand how each works and the implicit conventions for working with each type.
References
1. Bruce McKinney,
Strings the OLE Way, available on MSDN.
2. Jim Beveridge, CString: Part of the plumbing behind MFC and a model for
efficient design, Visual C++ Developers
Journal, Volume 1 Number 4.
Sample code
#include <afxpriv.h> // for USES_CONVERSION #include <comdef.h> // for _bstr_t CString cs; BSTR bstr; WCHAR wsz[81]; CComBSTR cbstr; char sz[81]; TCHAR tsz[81]; basic_string<char> bs; _bstr_t _bstr; USES_CONVERSION; // Convert CString to various types cs = "String1"; bstr = cs.AllocSysString(); // BSTR _tcscpy(tsz, (LPCTSTR)cs); // LPCTSTR strcpy(sz, T2A(tsz)); // ANSI string wcscpy(wsz, bstr); // wide string cbstr = bstr; // CComBSTR via bs = sz; // STL string _bstr = (LPCTSTR) cs; // _bstr_t via either // operator=(const char*) or // operator=(const wchar_t*) // if _UNICODE is defined. ::SysFreeString(bstr); // Convert BSTR to various types bstr = ::SysAllocString(L"String2"); cs = bstr; // CString via its LPCWSTR ctor wcscpy(wsz, bstr); // Unicode cbstr = bstr; // CComBSTR via operator=(LPOLESTR) strcpy(sz, W2A(bstr)); // ANSI string bs = sz; // STL string operator=(const T*) _tcscpy(tsz, W2T(bstr)); // LPTSTR _bstr = bstr; // _bstr_t via operator=(const wchar_t*) ::SysFreeString(bstr);
Other C++ articles:
- Regular expressions
- Random number generation
- Floating point exceptions
- Math.h in Visual Studio, POSIX, and ISO