What are TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR
原文:http://www.codeproject.com/Articles/76252/What-are-TCHAR-WCHAR-LPSTR-LPWSTR-LPCTSTR-etc
Many C++ Windows programmers get confused over what bizarre identifiers likeTCHAR
,LPCTSTR
are.
In this article, I would attempt by best to clear out the fog.
In general, a character can be represented in 1 byte or 2 bytes. Let's say 1-byte character is ANSI character - all English characters are represented through thisencoding. And let's say a 2-byte character is Unicode, which can represent ALL languages in the world.
The Visual C++ compiler supportschar
andwchar_t
as
native data-types for ANSI and Unicode characters, respectively. Though there is more concrete definition ofUnicode, but for understanding assume it as two-byte character which Windows OS uses for multiple
language support.
What if you want your C/C++ code to be independent of character encoding/mode used?
Suggestion: Use generic data-types and names to represent characters and string.
For example, instead of replacing:
char cResponse; // 'Y' or 'N' char sUsername[64]; // str* functions
with
wchar_t cResponse; // 'Y' or 'N' wchar_t sUsername[64]; // wcs* functions
In order to support multi-lingual (i.e., Unicode) in your language, you can simply code it in more generic manner:
#include<TCHAR.H> // Implicit or explicit include TCHAR cResponse; // 'Y' or 'N' TCHAR sUsername[64]; // _tcs* functions
The following project setting in General page describes which Character Set is to be used for compilation: (General -> Character Set)
This way, when your project is being compiled as Unicode, theTCHAR
would translate towchar_t
.
If it is being compiled as ANSI/MBCS, it would be translated tochar
. You are free to usechar
andwchar_t
,
and project settings will not affect any direct use of these keywords.
T
is
defined as:CHAR
#ifdef _UNICODE typedef wchar_t TCHAR; #else typedef char TCHAR; #endif
The macro_UNICODE
is defined when you set Character Set to "Use
Unicode Character Set", and thereforeTCHAR
would meanwchar_t
.
When Character Set if set to "Use Multi-Byte Character Set", TCHAR would meanchar
.
Likewise, to support multiple character-set using single code base, and possibly supporting multi-language, use specific functions (macros). Instead of usingstrcpy
,strlen
,strcat
(including
the secure versions suffixed with_s); orwcscpy
,wcslen
,wcscat
(including
secure), you should better use use_tcscpy
,_tcslen
,_tcscat
functions.
As you knowstrlen
is prototyped as:
size_t strlen(const char*);
And,wcslen
is prototyped as:
size_t wcslen(const wchar_t* );
You may better use_tcslen
, which islogicallyprototyped
as:
size_t _tcslen(const TCHAR* );
WCis for Wide Character. Therefore,wcs
turns to be wide-character-string.
This way,_tcs
would mean _T Character String. And you know _T may bechar
orwhat_t
,
logically.
But, in reality,_tcslen
(and other_tcs
functions)
are actuallynotfunctions, butmacros. They are defined simply as:
#ifdef _UNICODE #define _tcslen wcslen #else #define _tcslen strlen #endif
You should referTCHAR.H
to lookup more macro definitions
like this.
You might ask why they are defined as macros, and not implemented as functions instead? The reason is simple: A library or DLL may export a single function, with same name and prototype (Ignore overloading concept of C++). For instance, when you export a function as:
void _TPrintChar(char);
How the client is supposed to call it as?
void _TPrintChar(wchar_t);
_TPrintChar
cannot be magically converted into function taking 2-byte character. There has to be two separate functions:
void PrintCharA(char); // A = ANSI void PrintCharW(wchar_t); // W = Wide character
And a simple macro, as defined below, would hide the difference:
#ifdef _UNICODE void _TPrintChar(wchar_t); #else void _TPrintChar(char); #endif
The client would simply call it as:
TCHAR cChar; _TPrintChar(cChar);
Note that bothTCHAR
and_TPrintChar
would
map toeitherUnicode or ANSI, and thereforecChar
and the
argument to function would be eitherchar
orwchar_t
.
Macros do avoid these complications, and allows us to use either ANSI or Unicode function for characters and strings. Most of the Windows functions, that take string or a character are implemented this way, and for programmers convenience, only one function
(a macro!) is good.SetWindowText
is one example:
// WinUser.H #ifdef UNICODE #define SetWindowText SetWindowTextW #else #define SetWindowText SetWindowTextA #endif // !UNICODE
There are very few functions that do not have macros, and are available only with suffixedWorA. One example isReadDirectoryChangesW
,
which doesn't have ANSI equivalent.
You all know that we use double quotation marks to represent strings. The string represented in this manner is ANSI-string, having 1-byte each character. Example:
"This is ANSI String. Each letter takes 1 byte."
The string text given above isnotUnicode, and would be quantifiable for multi-language support. To represent Unicode string, you need to use prefixL
.
An example:
L"This is Unicode string. Each letter would take 2 bytes, including spaces."
Note theLat the beginning of string, which makes it a Unicode string. All characters (I repeatallcharacters) would take two bytes, including all English letters, spaces, digits, and the null character. Therefore, length of Unicode string would always be in multiple of 2-bytes. A Unicode string of length 7 characters would need 14 bytes, and so on. Unicode string taking 15 bytes, for example, would not be valid in any context.
In general, string would be in multiple ofsizeof(TCHAR)
bytes!
When you need to express hard-coded string, you can use:
"ANSI String"; // ANSI L"Unicode String"; // Unicode _T("Either string, depending on compilation"); // ANSI or Unicode // or use TEXT macro, if you need more readability
The non-prefixed string is ANSI string, theLprefixed string is Unicode, and string specified in_T
orTEXT
would
be either, depending on compilation. Again,_T
andTEXT
are
nothing but macros, and are defined as:
// SIMPLIFIED #ifdef _UNICODE #define _T(c) L##c #define TEXT(c) L##c #else #define _T(c) c #define TEXT(c) c #endif
The##
symbol istoken
pasting operator, which would turn_T("Unicode")
intoL"Unicode"
,
where the string passed is argument to macro - If_UNICODE
is defined. If_UNICODE
is
not defined,_T("Unicode")
would simply mean"Unicode"
.
The token pasting operator did exist even in C language, and is not specific about VC++ or character encoding.
Note that these macros can be used for strings as well as characters._T('R')
would turn intoL'R'
or
simple'R'
- former is Unicode character, latter is ANSI character.
No, you cannot use these macros to convert variables (string or character) into Unicode/non-Unicode text. Following is not valid:
char c = 'C'; char str[16] = "CodeProject"; _T(c); _T(str);
The bold lines would get successfully compiled in ANSI (Multi-Byte) build, since_T(x)
would simply bex
,
and therefore_T(c)
and_T(str)
would
come out to bec
andstr
,
respectively. But, when you build it with Unicode character set, it would fail to compile:
error C2065: 'Lc' : undeclared identifier error C2065: 'Lstr' : undeclared identifier
I would not like to insult your intelligence by describing why and what those errors are.
There exist set of conversion routine to convert MBCS to Unicode and vice versa, which I would explain soon.
It is important to note that almost all functions that take string (or character), primarily in Windows API, would have generalized prototype in MSDN and elsewhere. The functionSetWindowTextA/W
,
for instance, be classified as:
BOOL SetWindowText(HWND, const TCHAR*);
But, as you know,SetWindowText
is just a macro, and depending on your build settings, it would mean either of following:
BOOL SetWindowTextA(HWND, const char*); BOOL SetWindowTextW(HWND, const wchar_t*);
Therefore, don't be puzzled if following call fails to get address of this function!
HMODULE hDLLHandle; FARPROC pFuncPtr; hDLLHandle = LoadLibrary(L"user32.dll"); pFuncPtr = GetProcAddress(hDLLHandle, "SetWindowText"); //pFuncPtr will be null, since there doesn't exist any function with name SetWindowText !
FromUser32.DLL
, the two functionsSetWindowTextA
andSetWindowTextW
are
exported, not the function with generalized name.
Interestingly, .NET Framework is smart enough to locate function from DLL with generalized name:
[DllImport("user32.dll")] extern public static int SetWindowText(IntPtr hWnd, string lpString);
No rocket science, just bunch ofifsandelsearoundGetProcAddress
!
All of the functions that have ANSI and Unicode versions, would have actual implementation only in Unicode version. That means, when you callSetWindowTextA
from
your code, passing an ANSI string - it would convert the ANSI string to Unicode text and then would callSetWindowTextW
.
The actual work (setting the window text/title/caption) will be performed by Unicode version only!
Take another example, which would retrieve the window text, usingGetWindowText
. You callGetWindowTextA
,
passing ANSI buffer as target buffer.GetWindowTextA
would first callGetWindowTextW
,
probably allocating a Unicode string (awchar_t
array) for it. Then it would convert that Unicode stuff, for you,
into ANSI string.
This ANSI to Unicode and vice-versa conversion is not limited to GUI functions, but entire set of Windows API, which do take strings and have two variants. Few examples could be:
CreateProcess
GetUserName
OpenDesktop
DeleteFile
- etc
It is therefore very much recommended to call the Unicode version directly. In turn, it means you shouldalwaystarget for Unicode builds, and not ANSI builds - just because you are accustomed to using ANSI string for years. Yes, you may save and retrieve ANSI strings, for example in file, or send as chat message in your messenger application. The conversion routines do exist for such needs.
Note: There exists another typedef:WCHAR
, which is equivalent
towchar_t
.
TheTCHAR
macro is for a single character. You can definitely declare an array ofTCHAR
.
What if you would like to express acharacter-pointer, or aconst-character-pointer- Which one of the following?
// ANSI characters foo_ansi(char*); foo_ansi(const char*); /*const*/ char* pString; // Unicode/wide-string foo_uni(WCHAR*); wchar_t* foo_uni(const WCHAR*); /*const*/ WCHAR* pString; // Independent foo_char(TCHAR*); foo_char(const TCHAR*); /*const*/ TCHAR* pString;
After reading aboutTCHAR
stuff, you woulddefinitely select the last one as your choice. There are better alternatives
available to representstrings. For that, you just need to includeWindows.h.Note: If
your project implicitly or explicitly includesWindows.h, you need not includeTCHAR.H
First, revisit old string functions for better understanding. You knowstrlen
:
size_t strlen(const char*);
Which may be represented as:
size_t strlen(LPCSTR);
Where symbolLPCSTR
istypedef
'ed
as:
// Simplified typedef const char* LPCSTR;
The meaning goes like:
- LP- Long Pointer
- C- Constant
- STR- String
Essentially,LPCSTR
would mean (Long)Pointer to a Constant String.
Let's representstrcpy
using new style type-names:
LPSTR strcpy(LPSTR szTarget, LPCSTR szSource);
The type ofszTargetisLPSTR
,
withoutCin the type-name. It is defined as:
typedef char* LPSTR;
Note that theszSourceisLPCSTR
, sincestrcpy
function
will not modify the source buffer, hence theconst
attribute. The return type is non-constant-string:LPSTR
.
Alright, thesestr
-functions are for ANSI string manipulation. But we want routines for 2-byte Unicode strings. For
the same, the equivalent wide-character str-functions are provided. For example, to calculate length of wide-character (Unicode string), you would usewcslen
:
size_t nLength; nLength = wcslen(L"Unicode");
The prototype ofwcslen
is:
size_t wcslen(const wchar_t* szString); // Or WCHAR*
And that can be represented as:
size_t wcslen(LPCWSTR szString);
Where the symbolLPCWSTR
is
defined as:
typedef const WCHAR* LPCWSTR; // const wchar_t*
Which can be broken down as:
- LP- Pointer
- C- Constant
- WSTR- Wide character String
Similarly,strcpy
equivalent iswcscpy
,
for Unicode strings:
wchar_t* wcscpy(wchar_t* szTarget, const wchar_t* szSource)
Which can be represented as:
LPWSTR wcscpy(LPWSTR szTarget, LPWCSTR szSource);
Where the target is non-constant wide-string (LPWSTR
), and source is constant-wide-string.
There exist set of equivalentwcs
-functions forstr
-functions.
Thestr
-functions would be used for plain ANSI strings, andwcs
-functions
would be used for Unicode strings.
Though, I already advised to use Unicode native functions, instead of ANSI-only or TCHAR-synthesizedfunctions. The reason was simple - your application must only be Unicode, and you shouldnoteven care about code portability for ANSI builds. But for the sake of completeness, I am mentioning these generic mappings.
To calculate length of string, you may use_tcslen
function
(a macro). In general, it is prototyped as:
size_t _tcslen(const TCHAR* szString);
Or, as:
size_t _tcslen(LPCTSTR szString);
Where the type-nameLPCTSTR
can be classified as:
- LP - Pointer
- C - Constant
- T = TCHAR
- STR = String
Depending on the project settings,LPCTSTR
would be mapped
to eitherLPCSTR
(ANSI) orLPCWSTR
(Unicode).
Note:strlen
,wcslen
or_tcslen
will
return number ofcharactersin string, not the number of bytes.
The generalized string-copy routine_tcscpy
is defined as:
size_t _tcscpy(TCHAR* pTarget, const TCHAR* pSource);
Or, in more generalized form, as:
size_t _tcscpy(LPTSTR pTarget, LPCTSTR pSource);
You can deduce the meaning ofLPTSTR
!
Usage Examples
First, a broken code:
int main() { TCHAR name[] = "Saturn"; int nLen; // Or size_t lLen = strlen(name); }
On ANSI build, this code will successfully compile sinceTCHAR
would bechar
,
and hence name would be an array ofchar
. Callingstrlen
againstname
variable
would also work flawlessly.
Alright. Let's compile the same with withUNICODE
/_UNICODE
defined
(i.e."Use Unicode Character Set" in project settings). Now, the compiler would report set of errors:
- error C2440: 'initializing' : cannot convert from 'const char [7]' to 'TCHAR []'
- error C2664: 'strlen' : cannot convert parameter 1 from 'TCHAR []' to 'const char *'
And the programmers would start committing mistakes by correcting it this way (first error):
TCHAR name[] = (TCHAR*)"Saturn";
Which will not pacify the compiler, since the conversion is not possible fromTCHAR*
toTCHAR[7]
.
The same error would also come when native ANSI string is passed to a Unicode function:
nLen = wcslen("Saturn"); // ERROR: cannot convert parameter 1 from 'const char [7]' to 'const wchar_t *'
Unfortunately (orfortunately), this error can be incorrectly corrected by simple C-style typecast:
nLen = wcslen((const wchar_t*)"Saturn");
And you'd think you've attained one more experience level in pointers! You are wrong - the code would give incorrect result, and in most cases would simply cause Access Violation. Typecasting this way is like passing afloat
variable
where a structure of 80 bytes is expected (logically).
The string"Saturn"
is
sequence of 7 bytes:
'S'(83) | 'a'(97) | 't'(116) | 'u'(117) | 'r'(114) | 'n'(110) | '\0'(0) |
But when you pass same set of bytes towcslen
, it treats each 2-byte as a single character. Therefore first two bytes
[97, 83] would be treated as one character having value: 24915 (97<<8 | 83
). It is Unicode character:?
.
And the next character is represented by [117, 116] and so on.
For sure, you didn't pass those set of Chinese characters, but improper typecasting has done it! Therefore it is very essential to know that type-castingwill notwork! So, for the first line of initialization, you must do:
TCHAR name[] = _T("Saturn");
Which would translate to 7-bytes or 14-bytes, depending on compilation. The call towcslen
should be:
wcslen(L"Saturn");
In the sample program code given above, I usedstrlen
, which causes error when building in Unicode. The non-working
solution is C-sytle typecast:
lLen = strlen ((const char*)name);
On Unicode build, name would be of 14-bytes (7 Unicode characters, including null). Since string"Saturn"contains only English letters, which can be represented using original ASCII, the Unicode letter'S'
would
be represented as [83, 0]. Other ASCII characters would be represented with a zero next to them. Note that'S'
is
now represented as2-bytevalue83
. The end of string would
be represented bytwo byteshaving value0
.
So, when you pass such string tostrlen
, the first character (i.e. first byte) would be correct ('S'
in
case of "Saturn"). But the second character/byte would indicate end of string. Therefore,strlen
would return incorrect
value1
as the length of string.
As you know, Unicode string may contain non-English characters, the result of strlen would be more undefined.
In short, typecasting will not work. You either need to represent strings in correct form itself, or use ANSI to Unicode, and vice-versa, routines for conversions.
(There is more to add from this location, stay tuned!)
Now, I hope you understand the following signatures:
BOOL SetCurrentDirectory( LPCTSTR lpPathName ); DWORD GetCurrentDirectory(DWORD nBufferLength,LPTSTR lpBuffer);
Continuing. You must have seen some functions/methods asking you to passnumber of characters, or returning the number of characters. Well, likeGetCurrentDirectory
,
you need to pass number of characters, andnotnumber of bytes. For example:
TCHAR sCurrentDir[255]; // Pass 255 and not 255*2 GetCurrentDirectory(sCurrentDir, 255);
On the other side, if you need to allocate number or characters, you must allocate proper number of bytes. In C++, you can simply usenew
:
LPTSTR pBuffer; // TCHAR* pBuffer = new TCHAR[128]; // Allocates 128 or 256 BYTES, depending on compilation.
But if you use memory allocation functions likemalloc
,LocalAlloc
,GlobalAlloc
,
etc; you must specify the number of bytes!
pBuffer = (TCHAR*) malloc (128 * sizeof(TCHAR) );
Typecasting the return value is required, as you know. The expression inmalloc
's argument ensures that it allocates
desired number of bytes - and makes up room for desired number of characters.