{utfnormalize:text}

Description

Normalises a UTF-8 string to Unicode Normalization Form C (NFC): it composes a base letter followed by combining diacritical marks into the single precomposed character, so two byte-sequences that look identical become identical. For example the letter a followed by a combining acute accent becomes the one-character a-acute. Plain ASCII text and text that is already in NFC pass through unchanged. Use it to canonicalise text before comparing, deduplicating, sorting, or storing it, so visually equal strings match. It always uses form C; there is no parameter to pick a different form. Internally it calls PHP Normalizer::normalize on the input.

Parameters

text required default (empty string)

The UTF-8 string to normalise. Returned in Normalization Form C, with base-letter-plus-combining-mark sequences composed into their single precomposed characters. ASCII text and text already in NFC are returned unchanged; an empty argument returns an empty string.

Examples

test[{utfnormalize:Hello World}]

Expected[Hello World]

Actual[Hello World]

Plain ASCII text contains no combining marks, so normalisation returns it byte-for-byte. The square brackets are literal text added to make the boundaries of the output visible.

test[{utfnormalize:}]

Expected[]

Actual[]

With no argument the result is the empty string. The brackets show there is nothing between them.

test[{utfnormalize:user@example.org}]

Expected[user@example.org]

Actual[user@example.org]

A typical ASCII value such as an e-mail address is already in Normalization Form C and comes back unchanged. Normalising before comparison guarantees two such strings match when they look the same.

test[{utfnormalize:{utfnormalize:Praha 1}}]

Expected[Praha 1]

Actual[Praha 1]

Normalisation is idempotent: a string that is already in NFC is unaffected by a second pass, so nesting the command is safe.

virtual[{utfnormalize:Mnichov}]

Expected(the composed value; e.g. a base letter a plus a combining acute accent becomes the single character a-acute. An already-canonical word like Mnichov is returned unchanged.)

Actual[Mnichov]

Composition is the whole point: when the input is a base letter followed by a separate combining accent (for example a then a combining acute), utfnormalize returns the single precomposed letter (a-acute), so it matches the same word typed normally. This case is illustrative because a stored example is composed in transit before it is saved and so cannot carry the decomposed input; the displayed run shows an already-canonical literal returned unchanged. In a live template the difference is visible whenever the text arrives decomposed, e.g. from an external import or some operating systems.

test{ifeq:{utfnormalize:Mnichov}:{utfnormalize:Mnichov}:match:differ}

Expectedmatch

Actualmatch

The canonical use: normalise both sides before comparing, so two values that look identical but differ only in byte form (a precomposed letter versus a base letter plus a separate combining mark) are treated as equal. Wrap user-entered or imported text in utfnormalize on both sides of any comparison to avoid false mismatches. The realistic case needs a value carrying combining marks, which a stored example cannot hold, so this is illustrative; with both sides already canonical it reports match.

← All expressions