Unicode  
 

Unicode is a multi-language character set designed to encompass virtually all of the characters used with computers today. Unicode characters are represented by a 16-bit value on Windows, and differ from other character sets in two important ways. First, unlike the traditional single-byte (ANSI) character sets, Unicode is capable of representing significantly more characters in a variety of languages. Second, unlike multi-byte character sets (where some characters may be one byte in length, while others may be two, three or four bytes), on Windows the Unicode characters are fixed-width, which can make them easier to work with.

The SocketTools libraries support both the ANSI and Unicode character sets by providing two versions of each function that either expects a string as an argument (including those functions which pass structures that contain strings) or returns the address of a string. The functions which use multi-byte strings have a suffix of "A" (ANSI), while the functions which use Unicode strings have a suffix of "W" (wide). No suffix is used with functions which expect binary (non-textual) data or only use numeric parameters and return numeric values.

For example, consider the InetGetLocalName function mentioned in the previous section. If you looked at the list of exported functions in the library, you would see two functions exported, InetGetLocalNameA and InetGetLocalNameW. In C and C++, which function is called actually depends on how the application is being built. That is, if your application is built to use Unicode (in other words, the UNICODE macro is defined and you are linking with Unicode versions of the standard libraries), then the InetGetLocalNameW function will be used instead of the InetGetLocalNameA. In other languages, you may have to explicitly declare which version of the function you wish to use. In Visual Basic, for example, the Alias keyword must be used with the function declaration to specify the correct name.

Automatic Encoding

When building a project that is configured to use the Unicode character set, SocketTools will automatically convert strings to UTF-8 encoded text before transmitting that data over the network. This conversion only occurs with string types, and will not be performed on byte arrays or other types of data that is not represented as a null terminated string value.

Converting strings to UTF-8 encoding ensures textual data is sent and received in a uniform way that is not affected by the local system's localization and language settings. Virtually all modern servers on the Internet today expect text to be exchanged using UTF-8 encoding, and because ASCII characters are considered a subset of UTF-8, they are not subject to encoding.

Earlier versions of SocketTools always performed Unicode string conversions using the default system code page, rather than using UTF-8 encoding. This change will not typically affect most applications; however, if you are using Unicode strings, it is important to keep in mind that this conversion to UTF-8 can change how data is exchanged over the network. If you want to prevent this automatic UTF-8 encoding, you can perform the preferred conversion in your code (for example, using the WideCharToMultiByte function) and then explicitly call the ANSI version of the SocketTools function, rather than the Unicode version.

The Encoding and Compression library includes helper functions that can simplify the process of performing UTF-8 encoding and decoding. The IsUnicodeText function will analyze a string buffer to determine if it contains valid Unicode text. The UnicodeDecodeText and UnicodeEncodeText functions can be used to perform conversions between UTF-8 encoded text, multi-byte and Unicode strings.

Strings and Byte Arrays

Some SocketTools functions require you to use byte arrays instead of strings, regardless of the character set your project uses. This can create problems when reading and writing Unicode string data. For example, consider the InetRead and InetWrite functions which are used to read and write data on a socket. Because character strings and byte arrays are essentially identical when using the ANSI character set, a C/C++ programmer may try to write code such as this:

LPTSTR lpszData = _T("This is a test, this is only a test");
INT cchData = lstrlen(lpszData);
INT nResult;

nResult = InetWrite(hSocket, lpszData, cchData);

This would work as expected until you change your project to use the Unicode character set. The problem is that the Unicode string is no longer an array of 8-bit bytes, but is now an array of 16-bit integers. The Unicode string must be converted to a byte array before passing it to the InetWrite function. One way to do this is to use the WideCharToMultiByte function:

LPTSTR lpszData _T("This is a test, this is only a test");
INT cchData = lstrlen(lpszData);
LPBYTE lpBuffer;
INT nResult;

#ifdef UNICODE
lpBuffer = (LPBYTE)_alloca((cchData + 1) * 4);

if (lpBuffer == NULL)
{    // Unable to allocate memory
    return;
}}}

cchData = WideCharToMultiByte(CP_UTF8, 0,
                              (LPCWSTR)lpszData,
                              cchData,
                              (LPSTR)lpBuffer,
                              ((cchData + 1) * 4),
                              NULL, NULL);

if (cchData <= 0)
{
    // Unable to convert the Unicode string
    return;
}
#else
pBuffer = (LPBYTE)lpszData;
#endif

nResult = InetWrite(hSocket, lpBuffer, cchData);

Note that the type of characters being converted may also present a problem to the developer. In this example, the string is easily converted because it is composed only of characters that are part of the basic ASCII character set. In most cases it is recommended that you use CP_UTF8 to convert the text to UTF-8. When converting a string that contains international characters, such as accented vowels, the conversion using the system code page may result in unprintable characters. For additional information, check your programming language's technical reference for issues with regards to localization and the use of Unicode.