Skip to content

UB28⚓︎

Author: Yikai Cui.

Definition⚓︎

A universal character name in an identifier does not designate a character whose encoding falls into one of the specified ranges (6.4.2.1).

当一个标识符中的通用字符名不在指定的范围内时。

Description⚓︎

A universal character name is something like \uXXXX or \UXXXXXXXX. It can be used to represent a character that is not in the basic source character set.

一个通用字符名是类似于\uXXXX或者\UXXXXXXXX的形式。它可用于表示不在基本源字符集中的字符。

The standard specifies that a universal character name shall not designate a code point where the hexadecimal value is:

  • less than 00A0 other than 0024($), 0040(@), or 0060(`)
  • in the range D800 through DFFF inclusive
  • greater than 10FFFF

标准指出,通用字符名不应指定以下十六进制值的范围:

  • 小于00A0,除了0024($), 0040(@), 或者0060(`)
  • 在D800到DFFF之间
  • 大于10FFFF

The code points less than 00A0 can be represented in the basic source character set, so they are not allowed to be represented by universal character names. Here is an example where the use of universal character names in basic source character set can be confusing: suppose a lexer meets an identifier with universal character name \u0022 (where '\u0022' is the value, or code point, of "), it may treat the universal character name as '"', which may cause compilation errors.

小于00A0的码点可以在基本源字符集中表示,因此不允许用通用字符名表示。下面是一个通用字符名在基本源字符集中使用时可能会引起混淆的例子:假设词法分析遇到一个带有通用字符名的标识符\u0022(其中'\u0022'"),它可能会将通用字符名视为",这可能会导致编译错误。

The code points in the range D800 through DFFF are reserved for surrogate pairs in UTF-16, so they are not allowed to be represented by universal character names.

D800到DFFF之间的范围用于UTF-16中的代理对,是无效的范围,因此不允许用通用字符名表示。

The code points greater than 10FFFF are not valid Unicode code points (unused unicode plains), so they are not allowed to be represented by universal character names.

大于10FFFF的范围是无效的Unicode码点(未使用的Unicode平面),因此不允许用通用字符名表示。

Code⚓︎

UB28.c
#include <stdio.h>

int main() {
    int a\u0024 = 1; // not an undefined behavior! (1)
    int b\u0022 = 2; // undefined behavior! (2)
    return a$;
}
  1. Not an undefined behavior. Here \u0024 falls into the range of the universal character names, therefore is not an undefined behavior. The variable can even be used by a$ (see the return statement)
  2. Undefined behavior! Here \u0022 does not fall into any the range of the universal character names, therefore is an undefined behavior.

View source

Note⚓︎

Nearly all (Standard conforming) implementations raise errors on invalid universal character names during compilation.

几乎所有(符合标准的)编译器在编译期间对无效的通用字符名都会报错。

Advice⚓︎

Using universal character names in identifiers is not recommended, not only because the source file may not be successfully compiled in earlier compilers conforming C99 or earlier, but also because using characters that are not in the basic character set might reduce the readablity of the source code.

不推荐在标识符中使用通用字符名。不仅因为源文件可能无法在符合C99或更早版本的编译器中成功编译,还因为因为使用不在基本字符集中的字符可能会降低源代码的可读性。

The advice is simple: just use English. If you must use them, make sure they are valid.

建议很简单:只使用英文编写代码。如果必须使用通用字符名,请确保它们是有效的。