Skip to main content

đŸ—Ŗī¸ Language & Locale

Various specifications define how to represent languages and locales in software. This document aims to provide a high-level overview of these specifications and how to use them.

tip

TL;DR:

  1. Find a library that wraps CLDR data and use it to format dates, numbers, and currencies.
  2. There is only one standard for language tags: BCP 47, but you have to decide which subtags to use.
  3. When in doubt, language-region is always a safe bet.

ISO 639​

Simply refers to the language, region-agnostic.

  • ISO 639-1: two-letter codes
  • ISO 639-2: three-letter codes
  • ISO 639-3: three-letter codes, one language may have multiple codes

Examples​

LanguageISO 639-1ISO 639-2ISO 639-3
Englishenengeng
Chinesezhzhozho + 16 other codes

The most commonly used is ISO 639-1.

RFC 5646​

RFC 5646 loosely defines the structure of language tags as follows:

{language}[-{script}][-{region}][-{variant}][-{extension}][-{privateuse}]

Besides language, all other components are optional.

Per specification, the entire tag is case-insensitive, but most implementations use the following conventions:

ComponentDescriptionCase
languageISO 639-1 or ISO 639-2 code (if no ISO 639-1 code exists)lowercase
scriptISO 15924 code for the script of the languageTitlecase
regionISO 3166-1UPPERCASE
variantVariant subtags, must be registered in the IANA Language Subtag Registrylowercase
extensionExtension subtagslowercase
privateusePrivate use subtagslowercase

Examples​

TagDescription
enEnglish
en-USEnglish in the United States
en-GBEnglish in the United Kingdom
zh-HansChinese in the simplified script
zh-HantChinese in the traditional script
zh-TWChinese in Taiwan
zh-Hant-TWChinese in Taiwan in the traditional script
zh-TW-tailoč‡ēįŖ台čĒžįž…éĻŦ字æ‹ŧéŸŗæ–šæĄˆ
sl-nedisNatisone or Nadiza dialect of Slovenian

IETF BCP 47​

BCP 47 is a combination of RFC 5646 and RFC 4647, which defines the matching and lookup algorithms for language tags and is the most widely used go-to guide for language tags.

CLDR​

The most common use of language tags is to set the language of your application. Once you set the language, you will likely use that to format dates, numbers, and currencies. That's where CLDR comes in.

CLDR publishes new versions twice a year, and it's the most comprehensive and up-to-date source for locale data that is used by almost all major software.

You can view the data in the CLDR Survey Tool, or access source data from the CLDR GitHub repository.

For most actual use cases, you will use a library that wraps CLDR data, such as:

ICU Locale​

The ICU Project performs a series of "canonicalization" steps to convert a locale identifier into a canonical form, which is why hypens (-) are being replaced with underscores (_). e.g. en-US becomes en_US.

Some other libraries may do the same, YMMV, but whenever possible, use the RFC format to avoid any issues.