What is the encoding of Chinese characters on Wiki

2020-05-30 03:15发布

I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.

I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?

标签： utf-8 character-encoding cjk url-encoding

3条回答

Animai°情兽

2楼-- · 2020-05-30 03:58


>>> c='\xe7\x9a\x84'.decode('utf8')
>>> c
u'\u7684'
>>> print c
的

though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2020-05-30 04:04

The header of a wikipedia page includes this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

So the page is UTF-8.

0人赞添加讨论(0) 举报

够拽才男人

4楼-- · 2020-05-30 04:04

The example you give is an IRI.

IRIs use the UTF8 encoding. UTF8 implements unicode, and in unicode, each character has a codepoint, that is between 0x4E00 and 0x9FFF (2 bytes) for all chinese characters.

But UTF8 doesn't encode characters by just storing their codepoint (UTF32 does that). Instead, it uses a more complex standard, that makes all chinese ideograms 2 or 3 bytes long.

0人赞添加讨论(0) 举报

What is the encoding of Chinese characters on Wiki

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间