Unicode

Protocols & Standards
A computing standard for consistent encoding and handling of text in most world writing systems.
← Back to Glossary

What is Unicode?

Unicode is a universal computing standard for consistent encoding, representation, and handling of text in most of the world's writing systems. In the domain industry, Unicode enables Internationalized Domain Names (IDNs) containing non-Latin characters such as Chinese, Arabic, Cyrillic, and other scripts. Unicode assigns a unique code point to every character across all languages, ensuring consistent representation across different systems.

Unicode in Domain Names

IDN Support

Unicode enables domains like:

Punycode Conversion

DNS uses ASCII, so Unicode domains convert to Punycode:

Unicode: münchen.de

Punycode: xn--mnchen-3ya.de

Unicode: 北京.中国

Punycode: xn--1lq90i.xn--fiqs8s

Unicode Code Points

Structure

Format: U+XXXX (hexadecimal)

Examples:

A = U+0041 (Latin A)

а = U+0430 (Cyrillic a)

中 = U+4E2D (Chinese character)

Character Blocks

BlockRangeScript
Basic LatinU+0000-007FEnglish/ASCII
CyrillicU+0400-04FFRussian, etc.
ArabicU+0600-06FFArabic
CJKU+4E00-9FFFChinese/Japanese/Korean

Security Concerns

Homoglyph Attacks

Similar-looking characters from different scripts:

Latin 'a' (U+0061) vs Cyrillic 'а' (U+0430)

Latin 'o' (U+006F) vs Cyrillic 'о' (U+043E)

Attack: аpple.com (Cyrillic 'а') looks like apple.com

Browser Protections

Browsers may display Punycode for suspicious mixed-script domains.

Unicode Normalization

Different ways to represent same character:

é = U+00E9 (precomposed)

é = U+0065 + U+0301 (decomposed: e + combining accent)

Normalization forms: NFC, NFD, NFKC, NFKD

Unicode is fundamental to global internet accessibility, enabling users worldwide to register and access domain names in their native scripts and languages.

Put This Knowledge to Work

Use DomScan's API to check domain availability, health, and more.