What is Unicode?
Unicode is a universal computing standard for consistent encoding, representation, and handling of text in most of the world's writing systems. In the domain industry, Unicode enables Internationalized Domain Names (IDNs) containing non-Latin characters such as Chinese, Arabic, Cyrillic, and other scripts. Unicode assigns a unique code point to every character across all languages, ensuring consistent representation across different systems.Unicode in Domain Names
IDN Support
Unicode enables domains like:
- 例え.jp (Japanese)
- مثال.مصر (Arabic)
- пример.рф (Russian Cyrillic)
- 例子.中国 (Chinese)
Punycode Conversion
DNS uses ASCII, so Unicode domains convert to Punycode:
Unicode: münchen.de
Punycode: xn--mnchen-3ya.de
Unicode: 北京.中国
Punycode: xn--1lq90i.xn--fiqs8s
Unicode Code Points
Structure
Format: U+XXXX (hexadecimal)
Examples:
A = U+0041 (Latin A)
а = U+0430 (Cyrillic a)
中 = U+4E2D (Chinese character)
Character Blocks
| Block | Range | Script |
|---|---|---|
| Basic Latin | U+0000-007F | English/ASCII |
| Cyrillic | U+0400-04FF | Russian, etc. |
| Arabic | U+0600-06FF | Arabic |
| CJK | U+4E00-9FFF | Chinese/Japanese/Korean |
Security Concerns
Homoglyph Attacks
Similar-looking characters from different scripts:
Latin 'a' (U+0061) vs Cyrillic 'а' (U+0430)
Latin 'o' (U+006F) vs Cyrillic 'о' (U+043E)
Attack: аpple.com (Cyrillic 'а') looks like apple.com
Browser Protections
Browsers may display Punycode for suspicious mixed-script domains.
Unicode Normalization
Different ways to represent same character:
é = U+00E9 (precomposed)
é = U+0065 + U+0301 (decomposed: e + combining accent)
Normalization forms: NFC, NFD, NFKC, NFKD
Unicode is fundamental to global internet accessibility, enabling users worldwide to register and access domain names in their native scripts and languages.