🗣️ Language & Locale

Various specifications define how to represent languages and locales in software. This document aims to provide a high-level overview of these specifications and how to use them.

tip

TL;DR:

Find a library that wraps CLDR data and use it to format dates, numbers, and currencies.
There is only one standard for language tags: BCP 47, but you have to decide which subtags to use.
When in doubt, language-region is always a safe bet.

Language	ISO 639-1	ISO 639-2	ISO 639-3
English	`en`	`eng`	`eng`
Chinese	`zh`	`zho`	`zho` + 16 other codes

RFC 5646

RFC 5646 loosely defines the structure of language tags as follows:

{language}[-{script}][-{region}][-{variant}][-{extension}][-{privateuse}]

Besides language, all other components are optional.

Per specification, the entire tag is case-insensitive, but most implementations use the following conventions:

Component	Description	Case
`language`	ISO 639-1 or ISO 639-2 code (if no ISO 639-1 code exists)	lowercase
`script`	ISO 15924 code for the script of the language	Titlecase
`region`	ISO 3166-1	UPPERCASE
`variant`	Variant subtags, must be registered in the IANA Language Subtag Registry	lowercase
`extension`	Extension subtags	lowercase
`privateuse`	Private use subtags	lowercase

Examples

Tag	Description
`en`	English
`en-US`	English in the United States
`en-GB`	English in the United Kingdom
`zh-Hans`	Chinese in the simplified script
`zh-Hant`	Chinese in the traditional script
`zh-TW`	Chinese in Taiwan
`zh-Hant-TW`	Chinese in Taiwan in the traditional script
`zh-TW-tailo`	臺灣台語羅馬字拼音方案
`sl-nedis`	Natisone or Nadiza dialect of Slovenian

IETF BCP 47

BCP 47 is a combination of RFC 5646 and RFC 4647, which defines the matching and lookup algorithms for language tags and is the most widely used go-to guide for language tags.

The most common use of language tags is to set the language of your application. Once you set the language, you will likely use that to format dates, numbers, and currencies. That's where CLDR comes in.

CLDR publishes new versions twice a year, and it's the most comprehensive and up-to-date source for locale data that is used by almost all major software.

You can view the data in the CLDR Survey Tool, or access source data from the CLDR GitHub repository.

For most actual use cases, you will use a library that wraps CLDR data, such as:

ICU Locale

The ICU Project performs a series of "canonicalization" steps to convert a locale identifier into a canonical form, which is why hypens (-) are being replaced with underscores (_). e.g. en-US becomes en_US.

Some other libraries may do the same, YMMV, but whenever possible, use the RFC format to avoid any issues.

ISO 639​

Examples​

RFC 5646​

Examples​

IETF BCP 47​

CLDR​

ICU Locale​

ISO 639

Examples

RFC 5646

Examples

IETF BCP 47

CLDR

ICU Locale