Unicode字符属性

自从PHP 4.4.0和5.1.0, 三个额外的转义序列在选用UTF-8模式时用于匹配通用字符类型. 他们是:

上面xx代表的属性名用于限制Unicode通常的类别属性. 每个字符都有一个这样的确定的属性, 通过两个缩写的字母指定. 为了与perl兼容, 可以在左花括号{后面增加^表示取反. 比如: \p{^Lu}就等同于\P{Lu}

如果通过\p或\P仅指定了一个字母, 它包含所有以这个字母开头的属性. 在这种情况下, 花括号的转义序列是可选的.

\p{L}
\pL

**支持的Unicode属性**
Property	Matches	Notes
C	Other
Cc	Control
Cf	Format
Cn	Unassigned
Co	Private use
Cs	Surrogate
L	Letter	Includes the following properties: Ll, Lm, Lo, Lt and Lu.
Ll	Lower case letter
Lm	Modifier letter
Lo	Other letter
Lt	Title case letter
Lu	Upper case letter
M	Mark
Mc	Spacing mark
Me	Enclosing mark
Mn	Non-spacing mark
N	Number
Nd	Decimal number
Nl	Letter number
No	Other number
P	Punctuation
Pc	Connector punctuation
Pd	Dash punctuation
Pe	Close punctuation
Pf	Final punctuation
Pi	Initial punctuation
Po	Other punctuation
Ps	Open punctuation
S	Symbol
Sc	Currency symbol
Sk	Modifier symbol
Sm	Mathematical symbol
So	Other symbol
Z	Separator
Zl	Line separator
Zp	Paragraph separator
Zs	Space separator

“Greek”, “InMusicalSymbols”等扩展属性在PCRE中不支持

指定大小写不敏感匹配对这些转义序列不会产生影响, 比如, \p{Lu}始终匹配大写字母.

\X转义匹配任意数量的Unicode字符. \X等价于(?>\PM\pM*)

也就是说, 它匹配一个没有”mark”属性的字符, 紧接着任意多个由”mark”属性的字符. 并将这个序列认为是一个原子组(详见下文). 典型的有”mark”属性的字符是影响到前面的字符的重音符.

用Unicode属性来匹配字符并不快, 因为PCRE需要去搜索一个包含超过15000字符的数据结构. 这就是为什么在PCRE中要使用传统的转义序列\d, \w而不使用Unicode属性的原因.