Skip to content

elixir-unicode/unicode_idna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unicode IDNA

Pure-Elixir implementation of UTS #46 Unicode IDNA Compatibility Processing, with RFC 3492 Punycode encoding/decoding, the RFC 5893 bidi rule, and CONTEXTJ joiner rules from RFC 5892.

The library converts domain names between their Unicode and ASCII (Punycode xn-- prefixed) representations and validates that each label conforms to IDNA 2008 as relaxed by UTS #46. It passes the full IdnaTestV2.txt conformance suite — 6,389 rows × 3 operations = 19,167 assertions — for Unicode 17.0.

Installation

def deps do
  [
    {:unicode_idna, "~> 0.1"}
  ]
end

Usage

Unicode.IDNA.to_ascii/2 and Unicode.IDNA.to_unicode/2 accept either a full domain name as a String.t or a list of labels. The return value mirrors the input shape: a string in returns a string out (labels are rejoined with .); a list in returns the list of processed labels.

# String in / string out — full domain
iex> Unicode.IDNA.to_ascii("bücher.de")
{:ok, "xn--bcher-kva.de"}

iex> Unicode.IDNA.to_unicode("xn--bcher-kva.de")
{:ok, "bücher.de"}

# Alternate IDNA label separators are recognised
iex> Unicode.IDNA.to_ascii("中文。中国")
{:ok, "xn--fiq228c.xn--fiqs8s"}

# List in / list out — already-split labels
iex> Unicode.IDNA.to_ascii(["bücher", "de"])
{:ok, ["xn--bcher-kva", "de"]}

iex> Unicode.IDNA.to_unicode(["xn--bcher-kva", "de"])
{:ok, ["bücher", "de"]}

# Errors
iex> Unicode.IDNA.to_ascii("not_valid")
{:error, :disallowed}

iex> Unicode.IDNA.to_ascii("not_valid", use_std3_ascii_rules: false)
{:ok, "not_valid"}

Public API

  • Unicode.IDNA.to_ascii/2 — UTS #46 ToASCII. Accepts a t:String.t/0 or a list of label strings; returns the same shape.

  • Unicode.IDNA.to_unicode/2 — UTS #46 ToUnicode, with the same dual-shape semantics.

  • Unicode.IDNA.valid_label?/2 — predicate for a single label, equivalent to match?({:ok, _}, to_ascii([label], options)).

  • Unicode.IDNA.Punycode.encode/1 and Unicode.IDNA.Punycode.decode/1 — RFC 3492 primitives.

  • Unicode.IDNA.Bidi.validate/1 and validate_in_bidi_domain/1 — RFC 5893 bidi rule.

  • Unicode.IDNA.Context.validate/1 — RFC 5892 Appendix A CONTEXTJ rules for ZWJ / ZWNJ.

Options

Option Default Meaning
:transitional false UTS #46 transitional vs. non-transitional processing. The default false matches modern browsers (Chrome, Firefox, Safari).
:check_hyphens true Reject leading/trailing hyphens and -- at positions 3–4 (the latter is suppressed for labels that came from a Punycode-decoded ACE form).
:check_bidi true If any label contains a right-to-left character, every label in the domain must satisfy the RFC 5893 bidi rule.
:check_joiners true Labels containing ZWJ (U+200D) or ZWNJ (U+200C) must satisfy the CONTEXTJ rules of RFC 5892 Appendix A.
:use_std3_ascii_rules true Restrict ASCII characters in a label to letters, digits and hyphen. Set false to allow _ and other STD3-disallowed ASCII (e.g. for Twitter-style permissive subdomain rules).
:verify_dns_length true Reject empty labels, labels longer than 63 octets, and full domains longer than 253 octets per RFC 1035.

Refreshing Unicode data

mix unicode_idna.download

This refreshes data/idna_mapping_table.txt and data/idna_test_v2.txt (the conformance vectors) from unicode.org. The bundled files are committed to source control; the task exists to make version bumps reproducible.

About

Unicode IDNA implementation conformant with UTS #46

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages