This is an article I wrote sometime in the late 1990’s about working with strings coming from COM, MFC, Win32, the C++ Standard Library, etc. It does not include anything about .NET since it was written before .NET came out.
Outline
Why so many strings?
When is each string appropriate?
Using various strings
Conversions between types
Conclusion
References
Sample code
Introduction
In the good old days, a string was a pointer to a null-terminated array of chars. Period. Now a string might be a char*
, wchar_t*
, LPCTSTR
, BSTR
, CString
, basic_string
, _bstr_t
, CComBSTR
, etc. Unfortunately, you cannot simply choose your favorite string representation and ignore the rest. Each representation has its own domain and it is frequently necessary to convert between types when crossing domain boundaries. Why are there so many kinds of strings? When is each one appropriate? How do you carry out common tasks with each? How do they relate to each other?
Why so many strings?
Strings differ in three important ways: character set, memory layout, and conventions for use. The most obvious and simplest of these is character set. To keep things focused, we will limit ourselves to ANSI and Unicode. ANSI strings, the kind everybody grew up on, are arrays of single-byte characters. By far most of the world’s strings are ANSI strings. So why bother with Unicode?
Eight bits are plenty to represent all the characters of ordinary English text. But with the slightest thought to international software, it quickly becomes apparent that eight bits are woefully inadequate. Unicode, with 16 bits per character, has enough possibilities to cover all the world’s major languages with enough characters left over to even throw in a few ancient languages for good measure.
Windows NT was built from the beginning to use Unicode strings exclusively internally, though you may write applications for NT that use either ANSI or Unicode. Windows CE only understands Unicode. OLE is built around Unicode strings. But Windows 3.x “doesn’t know Unicode from a dress code, and never will [1].” The same is true of Windows 9x. The ANSI vs. Unicode strings are much like the English vs. metric measurement units: almost everyone agrees the latter is the way to go, but the former has a tremendous installed base. In both situations, we will probably have to live with two standards and all the concomitant complications for a very long time.
C++ has two built in character types: char
and wchar_t
. Most commonly a char is an ANSI character and a wchar_t
is a Unicode character. This is not always the case, but to simplify things a bit, we will make this assumption. Wide character strings, i.e. strings of wchar_t
s, are null-terminated arrays of characters, directly analogous with ordinary strings. The terminating null character in this case is a wchar_t
null. Incidentally, the default settings for the Visual C++ debugger are to not display Unicode characters. There is a check box under Tools / Options / Debug labeled “Display unicode strings” which turns this on.
In order to be able to use the same source code for ANSI and Unicode builds, Windows introduced the TCHAR
data type. TCHAR
is simply a macro that expands to char in ANSI builds (i.e. _UNICODE
is not defined) and wchar_t
in Unicode builds (_UNICODE
is defined). There are various string types based on the TCHAR
macro, such as LPCTSTR
(long pointer to a constant TCHAR
string).
Microsoft also introduced a number of macros and typedef
s with “OLE
” in the name such as OLECHAR
, LPOLESTR
, etc. These are vestiges of an automatic ANSI / Unicode conversion scheme that Microsoft used prior to MFC 4.0 and has since abandoned. However, the names live on for legacy support and for Macintosh development. For example, if you look for help on CLSIDFromProgID
you’ll find that its first argument is an LPCOLESTR
. For Win32 development, “OLE” corresponds to Unicode. For Win16 and for the Macintosh, the symbol OLE2ANSI
is defined and “OLE
” corresponds to ANSI. For example, in Win32 development, an OLECHAR
is simply a wchar_t
and an LPOLESTR
is a wide character string.
Microsoft?s character and string types may be summarized as follows. A character name has the form XCHAR
and string name has the form LPCXSTR
where C
is optional and X
is either T
, OLE
, W
, or empty. The C
indicates a string type is constant, and the X
has the following meanings:
T |
Expands to wchar_t if _UNICODE is defined, else expands to char |
OLE |
Expands to char if OLE2ANSI is defined, else expands to wchar_t |
W |
wchar_t |
(empty) | char |
MFC introduced the CString
class as a wrapper around LPCTSTR
data type which provides methods for common tasks such as memory allocation and substring searches. A CString
can be used in most circumstances where you would use an LPCTSTR
.
The Standard C++ library provides a parameterized string class basic_string<T>
where T
is most often a char
or wchar_t
. The Standard library provides the typedefs string
and wstring
respectively for these common cases.
The real confusion in string types comes when we introduce BSTR
s. A BSTR
differs from a common string in that it always uses Unicode, regardless of compiler switches. However, it also has a different layout in memory. Furthermore, there are different conventions for using BSTR
s than for using simple null-terminated string, whether of the ANSI or Unicode variety, and these conventions are seldom codified.
A BSTR
is a null-terminated Unicode string, but with a byte count (not character count!) prepended. An advantage of a byte-count prefix is that BSTR
can contain internal nulls, whereas an ordinary string may not. One unusual aspect of the BSTR
is that the byte count is not in the 0th entry of the array the BSTR
points to. Instead, the byte count is stored in the two bytes preceding the memory the pointer ostensibly points to. (MFC?s CString
uses a similar trick so that passing a CString
involves no more overhead than passing a pointer [2]. This causes no problems for developers, however, because the implementation is thoroughly encapsulated.)
OLE standardized on the BSTR
partially because of OLE’s desire to be language-independent. Many languages use the counted arrays rather than using a special symbol to mark the end of a string. The BSTR
compromises by requiring both a count and a terminating character. (Note that in the context of string and character types, OLE refers only to character widths. In particular, an LPOLESTR
is simply a wide character string and not a BSTR
. Despite the name, an LPOLESTR
is not OLE’s favorite string!)
BSTR
s are an unnatural imposition on C++. However, they are unavoidable because OLE is built around BSTR
s and not native C++ strings. In order to make BSTR
manipulation easier from C++, several wrapper classes have been created. One is ATL’s CComBSTR
class, which handles basic memory management and a few basic operations for strings.
There is another BSTR
wrapper which one must use in order to take advantage of the native COM support in the Visual C++ compiler. When you use the #import
directive, the compiler creates wrapper functions for the methods on the imported COM interfaces. BSTR
arguments and return values are wrapped as _bstr_t
. (However, BSTR*
arguments are left alone so the _bstr_t
doesn’t entirely eliminate the need to manipulate BSTR
s.) The design goals of _bstr_t
are different from that of CComBSTR
. The former provides more convenience functions, and is implemented with reference counting to avoid unnecessary memory copying.
When is each string appropriate?
MFC class methods often take LPCTSTR
arguments. The choice of a class wrapper for strings in MFC development is obviously CString
especially because a CString
can be used in most situations where an LPCTSTR
is specified. The advantage of the CString
class is that it provides many useful methods for memory management and string manipulation. One disadvantage is that CString
carries with it a little bit more overhead than a raw LPCTSTR
. Also, if CString
is the only MFC class in a project, it still requires linking to and redistributing the MFC DLLs.
The Standard C++ basic_string<>
has the advantage of being portable to non-Windows platforms. Also, you may explicitly decide between char
and wchar_t
strings on an individual basis rather than deciding once and for all based on a compiler switch as with TCHAR
strings. And you could use basic_string<TCHAR>
to maintain the ANSI vs. Unicode flexibility of CString
. Like CString
, basic_string<>
does define a large number of convenient string manipulation functions. A design goal of this string class was to make the class sufficiently convenient and efficient that it would seldom be necessary to use null terminated strings and the C library manipulation functions.
In OLE interfaces, there is no choice but to use BSTR
or one of its wrapper classes. Ordinarily, a C++ developer would use a BSTR
only as a delivery vehicle to a COM interface; string manipulation is more easily done via library methods and wrapper classes native to C++. Because a BSTR
may contain any characters, even internal nulls, it is possible to wrap arbitrary data in a BSTR
to pass to another function (for example, to avoid having to write custom marshalling code for a COM interface).
ATL’s CComBSTR
is a light-weight wrapper class with adequate functionality for common tasks, and is a natural choice for ATL development. The _bstr_t
class is more complicated, but cannot be avoided when using the #import
directive and the wrapper functions it creates.
Using various strings
The L
symbol before a character literal denotes that the character is a wide character, as in
wchar_t ch = L'a';
This designation is seldom necessary: the first 255 characters of Unicode are the same as ANSI. Had we left out the L
in front of the first quote mark, the char 'a'
would have been promoted to the wchar_t
with the same value.
The L
symbol is also used to distinguish wchar_t
strings from ordinary strings, as in
wchar_t wsz = L"Unicode String";
Windows provides the macros _T()
and _TEXT()
which do nothing unless _UNICODE
is defined, in which case they each expand to L
. Hence _T("John")
reverts to simply "John"
in ANSI builds and expands to L"John"
in Unicode builds. There is an analogous OLESTR
macro that disappears if OLE2ANSI
is defined and expands to L
otherwise.
For most of the Standard C library string routines, you can change the initial “str
” in the name to “wcs
” to determine the name of the corresponding routing for wide character strings. For example, wcscpy
is the wide character counterpart of the venerable strcpy
. Also, you may change “str
” to “_tsc
” to come up with the name of a corresponding TCHAR
routine.
Because a BSTR
allocates memory before the location it nominally points to, a whole different API is necessary for working with BSTR
s. For example, you may not initialize a BSTR
by saying
BSTR b = L"A String";
This will correctly initialize b
to a wide character array, but the byte count is uninitialized and so b
is not a valid BSTR
. The proper way to initialize b
is
BSTR b = ::SysAllocString(L"A String");
Before b
goes out of scope, its memory needs to be released by calling ::SysFreeString
. Note that because the memory for BSTR
s is allocated via a system call rather than the C++ new operator, memory leaks due to failing to call ::SysFreeString
will not show up in the Visual C++ debugger. (NuMega’s BoundsChecker will catch these leaks, however.)
Two other handy functions for working with BSTR
s are ::SysAllocStringLen
and ::SysStringLen
. The former allocates a string to a given length and the latter is analogous to the Standard C strlen
.
The subtlest difficulty with using BSTRs is that they have conventions for their use that differ from those of other strings. For example, a NULL
BSTR
is treated as a valid, zero-length string unlike an ordinary string. The only place I have seen anyone attempt to codify these conventions is in Bruce McKinney’s excellent article cited earlier. The reader is advised to study the section of his article entitled “The Eight Rules of BSTR.”
The CComBSTR
wrapper is straightforward to use. It does not have a lot of methods, but the ones it has are simple and self-explanatory. The _bstr_t
class is more complex. It has more convenience functions. It reference-counts memory to avoid unnecessary copying and throws exceptions. CComBSTR
does no reference counting and does not throw exceptions.
Conversions between types
Developers frequently work in the intersection of two or more cultures. You may be writing an OLE application using Standard C++, MFC and ATL. But OLE, Standard C++, MFC, and ATL represent four different cultures, each with its own preferred string type or string wrapper class. Therefore an important part of working with strings is knowing how to convert between the various manifestations.
Because a BSTR
is null-terminated and because its pointer points past the byte count, a BSTR
“is a” (in an inheritance sort of sense) wide character string. You may pass a BSTR
to a function expecting a wchar_t*
. (Of course, if the BSTR
being passed in contains any internal nulls, data after the first null will be lost in the interpretation as a wide character string.) However, this interchangeability with wide character strings is tricky. You cannot always look at a variable and tell whether a wchar_t*
is merely a null-terminated wide character string or whether in fact it is a BSTR
. The source code for _bstr_t
is a good example. There is an operator _bstr_t::operator const wchar_t*
which implies only that you may pass a _bstr_t
to a function expecting a const wchar_t*
. However, reading the implementation code, you discover that the const wchar_t*
in question is actually a full-fledged BSTR
. As McKinney points out, “a BSTR
is a BSTR
by convention” and not a built-in type that the compiler can check.
The header file atlconv.h
contains a whopping 28 conversion macros for converting between the various non-class string types covered in this article. These macros have the form X2Y
. The source type X
can be A
, T
, W
, or OLE
for ANSI, TCHAR
, wchar_t
or OLE respectively. The destination type Y
can be any of these types or additionally BSTR
. Except for BSTR
, the destination types may optionally have a C
in front of their type to indicate const. For example, A2CW
takes an ANSI string and returns a constant wide character string. Of course, there are no macros for converting a type to itself. Note that there is no need for a BSTR
source type because you may use a BSTR
as a wide character string. Some of these macros require that you first call the macro USES_CONVERSION
while others do not. Note that unlike most macros, USES_CONVERSION
must be followed by a semicolon. Except when converting to a BSTR
, these macros allocate memory on the stack; BSTR
s are always allocated by a system call and must be freed using ::SysFreeString
.
CString
defines a constructor and an operator=
that each take an LPCTSTR
argument. In particular, you can pass an LPCTSTR
into a function taking a CString
. CString
also provides an operator LPCTSTR
and so you can also pass a CString
to a function expecting an LPCTSTR
. CString
has a method AllocSysString
that produces a BSTR
from its contents. Finally, CString
can take a LPCWSTR
(a const wchar_t*
) as an argument to either a constructor or to operator=
.
The basic_string<T>
class has constructor and operator=
methods which take a const T*
argument. However, you cannot pass a basic_string<T>
to a function expecting a const T*
because basic_string<>
extracts to a character string via an operator called c_str()
rather than via a type conversion operator.
CComBSTR
has both a constructor and an operator=
which take a BSTR
argument, as well as a type conversion operator for BSTR
. Thus CComBSTR
has roughly the same relationship with BSTR
as CString
has with LPCTSTR
.
The class _bstr_t
has constructor and operator=
overloads that take either ANSI or wide character strings. Also, it supports type conversion operators to both kinds of strings. As noted earlier, the type conversion operator for wide character strings actually returns a BSTR
. Therefore you can pass or receive a _bstr_t
as an ANSI string or a BSTR
.
Conclusion
Developers these days have to contend with at least two character sets — ANSI and Unicode — and at least two memory representations — null terminated and count prepended. This alone makes multiple string types inevitable. Macros and wrapper classes simplify the situation in some circumstances, but they also add their own complexity.
The Visual C++ developer stands in the intersection of a number of programming idioms — traditional C, Standard C++, MFC, COM, ATL — each with its own favorite string representation. You cannot avoid working with numerous string representations and converting from one to another. It is important to understand how each works and the implicit conventions for working with each type.
References
1. Bruce McKinney, Strings the OLE Way, available on MSDN.
2. Jim Beveridge, CString: Part of the plumbing behind MFC and a model for efficient design, Visual C++ Developers Journal, Volume 1 Number 4.
Sample code
#include <afxpriv.h> // for USES_CONVERSION #include <comdef.h> // for _bstr_t CString cs; BSTR bstr; WCHAR wsz[81]; CComBSTR cbstr; char sz[81]; TCHAR tsz[81]; basic_string<char> bs; _bstr_t _bstr; USES_CONVERSION; // Convert CString to various types cs = "String1"; bstr = cs.AllocSysString(); // BSTR _tcscpy(tsz, (LPCTSTR)cs); // LPCTSTR strcpy(sz, T2A(tsz)); // ANSI string wcscpy(wsz, bstr); // wide string cbstr = bstr; // CComBSTR via bs = sz; // STL string _bstr = (LPCTSTR) cs; // _bstr_t via either // operator=(const char*) or // operator=(const wchar_t*) // if _UNICODE is defined. ::SysFreeString(bstr); // Convert BSTR to various types bstr = ::SysAllocString(L"String2"); cs = bstr; // CString via its LPCWSTR ctor wcscpy(wsz, bstr); // Unicode cbstr = bstr; // CComBSTR via operator=(LPOLESTR) strcpy(sz, W2A(bstr)); // ANSI string bs = sz; // STL string operator=(const T*) _tcscpy(tsz, W2T(bstr)); // LPTSTR _bstr = bstr; // _bstr_t via operator=(const wchar_t*) ::SysFreeString(bstr);
Other C++ articles: