python unicodedata module

简介

处理Unicode字符数据库(UCD)的模块,UCD为所有Unicode字符定义字符属性。

UCD是Unicode字符数据库(Unicode Character DataBase)的缩写。

UCD由一些描述Unicode字符属性和内部关系的纯文本或html文件组成。

UCD中的文本文件大都是适合于程序分析的Unicode相关数据。其中的html文件解释了数据库的组织,数据的格式和含义。

定义函数

1
2
3
>>> import unicodedata
>>> dir(unicodedata)
['UCD', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'bidirectional', 'category', 'combining', 'decimal', 'decomposition', 'digit', 'east_asian_width', 'lookup', 'mirrored', 'name', 'normalize', 'numeric', 'ucd_3_2_0', 'ucnhash_CAPI', 'unidata_version']

unicodedata.lookup(name)

1
2
3
4
5
lookup(name, /)
Look up character by name.

If a character with the given name is found, return the
corresponding character. If not found, KeyError is raised.
1
2
3
4
5
6
7
>>> unicodedata.lookup('LEFT CURLY BRACKET')
{
>>> unicodedata.lookup('LEFT CURLY')
Traceback (most recent call last):
File "<pyshell#35>", line 1, in <module>
unicodedata.lookup('LEFT CURLY')
KeyError: "undefined character name 'LEFT CURLY'"

unicodedata.name(chr, default=None)

1
2
3
4
5
name(chr, default=None, /)
Returns the name assigned to the character chr as a string.

If no name is defined, default is returned, or, if not given,
ValueError is raised.
1
2
3
4
5
6
7
8
>>> unicodedata.name('}')
'RIGHT CURLY BRACKET'

>>> unicodedata.name('}}')
Traceback (most recent call last):
File "<pyshell#37>", line 1, in <module>
unicodedata.name('}}')
TypeError: name() argument 1 must be a unicode character, not str

unicodedata.decimal(chr, default=None)

1
2
3
4
5
6
decimal(chr, default=None, /)
Converts a Unicode character into its equivalent decimal value.

Returns the decimal value assigned to the character chr as integer.
If no such value is defined, default is returned, or, if not given,
ValueError is raised.
1
2
3
4
5
6
7
8
9
>>> unicodedata.decimal('8')
8
>>> type(unicodedata.decimal('8'))
<class 'int'>
>>> unicodedata.decimal('a')
Traceback (most recent call last):
File "<pyshell#46>", line 1, in <module>
unicodedata.decimal('a')
ValueError: not a decimal

unicodedata.digit(chr, default=None)

1
2
3
4
5
6
digit(chr, default=None, /)
Converts a Unicode character into its equivalent digit value.

Returns the digit value assigned to the character chr as integer.
If no such value is defined, default is returned, or, if not given,
ValueError is raised.
1
2
3
4
5
6
7
>>> unicodedata.digit('8')
8
>>> unicodedata.digit('a')
Traceback (most recent call last):
File "<pyshell#50>", line 1, in <module>
unicodedata.digit('a')
ValueError: not a digit

unicodedata.numeric(chr, default=None)

1
2
3
4
5
6
numeric(chr, default=None, /)
Converts a Unicode character into its equivalent numeric value.

Returns the numeric value assigned to the character chr as float.
If no such value is defined, default is returned, or, if not given,
ValueError is raised.
1
2
3
4
5
6
7
8
9
>>> unicodedata.numeric('5')
5.0
>>> unicodedata.numeric('五')
5.0
>>> unicodedata.numeric('无')
Traceback (most recent call last):
File "<pyshell#52>", line 1, in <module>
unicodedata.numeric('无')
ValueError: not a numeric character

unicodedata.category(chr)

1
2
category(chr, /)
Returns the general category assigned to the character chr as string.

unicodedata.bidirectional(chr)

1
2
3
bidirectional(chr, /)
Returns the bidirectional class assigned to the character chr as string.
If no such value is defined, an empty string is returned.

unicodedata.combining(chr)

1
2
3
combining(chr, /)
Returns the canonical combining class assigned to the character chr as integer.
Returns 0 if no combining class is defined.

unicodedata.decomposition(chr)

1
2
3
decomposition(chr, /)
Returns the character decomposition mapping assigned to the character chr as string.
An empty string is returned in case no such mapping is defined.

unicodedata.east_asian_width(chr)

1
2
east_asian_width(chr, /)
Returns the east asian width assigned to the character chr as string.

unicodedata.mirrored(chr)

1
2
3
4
mirrored(chr, /)
Returns the mirrored property assigned to the character chr as integer.
Returns 1 if the character has been identified as a "mirrored"
character in bidirectional text, 0 otherwise.

unicodedata.normalize(form, unistr)

normalize(form, unistr, /)
    Return the normal form 'form' for the Unicode string unistr.
    Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.

unicodedata.unidata_version

unicodedata.ucnhash_CAPI

unicodedata.ucd_3_2_0

1
2
3
4
DATA
ucd_3_2_0 = <unicodedata.UCD object>
ucnhash_CAPI = <capsule object "unicodedata.ucnhash_CAPI">
unidata_version = '11.0.0'

Reference

https://docs.python.org/zh-cn/3/library/unicodedata.html#module-unicodedata

https://blog.csdn.net/xc_zhou/article/details/82079753

-------------本文结束感谢您的阅读-------------