(Sec:collator)=
Collation
=========



> This tutorial is based on the [paper on *stringi*](https://dx.doi.org/10.18637/jss.v103.i02) that was published the *Journal of Statistical Software*; see {cite}`stringi`. To learn more about R, check out Marek's recent open-access (free!) textbook [*Deep R Programming*](https://deepr.gagolewski.com/) {cite}`deepr`.


Historically, the code-pointwise comparison had been used in most string
comparison activities, especially when strings in ASCII (i.e., English)
were involved. However, nowadays, this does not necessarily constitute
the most suitable approach to processing natural-language texts.
In particular, a code-pointwise matching neither takes accented and
conjoined letters nor ignorable punctuation and case into account.

The [*ICU* Collation Service](https://unicode-org.github.io/icu/userguide/collation/) provides the basis for string
comparison activities such as string sorting and searching, or
determining if two strings are equivalent. This time, though, due to its
conformance to the [Unicode Collation Algorithm](https://www.unicode.org/reports/tr10/), we
may expect that the generated results will meet the requirements of the
culturally correct natural language processing in any locale.


Locales
-------

String collation is amongst many locale-sensitive operations available
in *stringi*. However, before proceeding any further, we should first
discuss how
we can parameterise the *ICU* services to deliver the results that
reflect the expectations of a specific user community, such as the
speakers of different languages and their various regional variants.

### Specifying Locales

A locale specifier[^footlocale] is of the form `"Language"`,
`"Language_Country"`, or `"Language_Country_Variant"`, where:

* `Language` is, most frequently, a two- or three-letter code that conforms to
    the ISO-639-1 or ISO-630-2 standard, respectively;
    e.g., `"en"` or `"eng"` for English, `"es"` or `"spa"`
    for Spanish, `"zh"` or `"zho"` for Chinese, and `"mas"` for Masai
    (which lacks the corresponding two-letter code);
    however, more specific language identifiers may also be
    available, e.g., `"zh_Hans"` for Simplified-
    and `"zh_Hant"` for Traditional-Chinese
    or `"sr_Cyrl"` for Cyrillic- and `"sr_Latn"` for Latin-Serbian;

* `Country` is a two-letter code following the ISO-3166 standard
    that enables different language conventions within the same language;
    e.g., the US-English (`"en_US"`) and Australian-English (`"en_AU"`)
    not only observe some differences in spelling and vocabulary but also
    in the units of measurement;

* `Variant` is an identifier indicating a preference towards
    some convention within the same country; e.g., `"de_DE_PREEURO"`
    formats currency values using the pre-2002 Deutsche Mark (DEM).

Moreover, following the "`@`" symbol, semicolon-separated "`key=value`"
pairs can be appended to the locale specifier, in order to customise
some locale-sensitive services even further (see below for an example
using "`@collation=phonebook`" and
{ref}`Sec:datetime` for "`@calendar=hebrew`", amongst others).


[^footlocale]: Locale specifiers in *ICU* are platform-independent. This is not
    the case for their base-R counterparts, see `help("locales")`, e.g.,
    we have `"Polish_Poland"` on Windows vs `"pl_PL"` on Linux.


### Listing Locales

To list the available locale identifiers, we call `stri_locale_list()`.


```r
length(stri_locale_list())
## [1] 784
```

As the number of supported locales is very high, here we'll display
only 5 randomly chosen ones:


```r
sample(stri_locale_list(), 5)
## [1] "nl_CW"      "pt_CH"      "ff_Latn_SL" "en_PH"      "en_HK"
```

### Querying for Locale-Specific Services

The availability of locale-specific services can only be determined
during the request for a particular resource
(for more details, see the [*ICU* User Guide on Locales](https://unicode-org.github.io/icu/userguide/locale/)), which may depend on
the *ICU* library version actually in use as well as the way the *ICU*
Data Library (*icudt*) has been packaged. Therefore, for maximum
portability, it is best to rely on the *ICU* library bundle
shipped with *stringi*. This is the case on Windows and macOS, whose
users typically download the pre-compiled versions of the package from
CRAN. However, on various flavours of GNU/Linux and other Unix-based
systems, the system *ICU* is used more eagerly[^footsystemicu]. To force building
*ICU* from sources, we may call:


```r
install.packages("stringi", configure.args="--disable-pkg-config")
```

Overall, if a requested service is unavailable in a given locale, the
best possible match is returned.


[^footsystemicu]: See, e.g., software packages `libicu-dev` on Debian/Ubuntu or
    `libicu-devel` on RHL/Fedora/OpenSUSE. For more details regarding
    the configure/build process of *stringi*, refer to the `INSTALL`
    file.

### Default Locale

Each locale-sensitive operation in *stringi* selects the current default
locale if no locale has been explicitly requested, i.e., when a
function's `locale` argument is left alone in its "`NULL`" state. The
default locale is initially set to match the system locale on the
current platform, and may be changed with `stri_locale_set()`, e.g., in
the sporadic case of improper automatic locale detection.

As we have stated in the introduction, in this paper we use:


```r
stri_locale_get()
## [1] "en_AU"
```

i.e., the Australian-English locale (which formats dates like "29
September 2021" and uses metric units of measurement).

(Sec:Equivalence)=
Testing String Equivalence
--------------------------

In Unicode, some characters may have multiple representations. For
instance, "LATIN SMALL LETTER A WITH OGONEK" ("ą") can be stored as a
single code point U+0105 or as a sequence that is comprised of the
letter "LATIN SMALL LETTER A", U+0061, and the "COMBINING OGONEK",
U+0328 (when rendered properly, they should appear as if they were
identical glyphs). This is an example of canonical equivalence of
strings.

Testing for the Unicode equivalence between strings can be performed by
calling `%s==%` and, more generally, `stri_cmp_equiv()`, or their
negated versions, `%s!=%` and `stri_cmp_nequiv()`.

In the example below, we have: a followed by ogonek (two code points) vs
a with ogonek (single code point).


```r
"a\u0328" %s==% "ą"             # a, ogonek == a with ogonek
## [1] TRUE
```

There are also functions for removing and indicating duplicated elements
in a character vector:


```r
x <- c("Gągolewski", "Gagolewski", "Ga\u0328golewski")
stri_unique(x)
## [1] "Gągolewski" "Gagolewski"
stri_duplicated(x)  # from_last=FALSE
## [1] FALSE FALSE  TRUE
```

Moreover, `stri_duplicated_any()` returns the index of the first
non-unique element.

Linear Ordering of Strings
--------------------------

Operators such as `%s<%`, `%<=%`, etc., and the corresponding
functions `stri_cmp_lt()` ("less than"), `stri_cmp_le()` ("less than or
equal"), etc., implement locale-sensitive linear orderings of strings.
Moreover, `stri_sort()` returns the lexicographically-sorted version of
a given input vector, `stri_order()` yields the corresponding (stable)
ordering permutation, and `stri_rank()` ranks strings within a vector.

For instance, here's a comparison in the current default locale
(Australian-English):


```r
"chaotic" %s<% "hard"  # c < h
## [1] TRUE
```

Similar comparison in Polish:


```r
stri_cmp_lt("chłodny", "hardy", locale="pl_PL")  # c < h
## [1] TRUE
```

And now for something completely different -- the Slovak language:


```r
stri_cmp_lt("chladný", "hladný", locale="sk_SK") # ch > h
## [1] FALSE
```

This is an example of the locale-aware comparison that is
context-sensitive and which goes beyond the simple code-pointwise
comparison. In the example above, a contraction occurred: in Slovak, two
code points "`ch`" are treated as a single entity and are sorted after
"`h`":

Compare the ordering of Polish and Slovak words:


```r
stri_sort(c("chłodny", "hardy", "cichy", "cenny"), locale="pl_PL")
## [1] "cenny"   "chłodny" "cichy"   "hardy"
stri_sort(c("cudný", "chladný", "hladný", "čudný"), locale="sk_SK")
## [1] "cudný"   "čudný"   "hladný"  "chladný"
```

An opposite situation is called an expansion:


```r
german_k_words <- c("können", "kondensieren", "kochen", "korrelieren")
stri_sort(german_k_words, locale="de_DE")
## [1] "kochen"       "kondensieren" "können"       "korrelieren"
stri_sort(german_k_words, locale="de_DE@collation=phonebook")
## [1] "kochen"       "können"       "kondensieren" "korrelieren"
```

In the latter example, where we used the German phone-book order, `"ö"`
is treated as `"oe"`.

(Sec:collator_opts)=
Collator Options
----------------

The table below lists the options that can be passed to
`stri_opts_collator()` via the dot-dot-dot parameter, "`...`", in all
the functions that rely on the *ICU* Collator.
Below we play with some of them.


| Option                | Purpose
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `locale`              | a string specifying the locale to use; `NULL` (default) or `""` for the current default locale as indicated by `stri_locale_get()`                                                                                                                                                                                                                                       |
| `strength`            | an integer in *{1,2,3,4}* defining collation strength; 1 for the most permissive collation rules, 4 for the strictest ones; defaults to 3                                                                                                                                                                                                                              |
| `uppercase_first`     | logical; `NA` (default) orders upper and lower case letters in accordance to their tertiary weights, `TRUE` forces upper case letters to sort before lower case letters, `FALSE` does the opposite                                                                                                                                                                       |
| `numeric`             | logical; if `TRUE`, a collation key for the numeric value of substrings of digits is generated; this is a way to make `"100"` ordered after `"2"`; however, negative numbers are not ordered correctly; defaults to `FALSE`                                                                                                                                                                                                   |
| `case_level`          | logical; if `TRUE`, an extra case level (positioned before the third level) is generated; defaults to `FALSE`                                                                                                                                                                                                                                                            |
| `normalisation`       | logical; if `TRUE`, then an incremental check is performed to see whether input data are in the FCD ("fast C or D") form; if data are not in the FCD form, the incremental NFD normalisation is performed, see {ref}`Sec:normalisation`; defaults to `FALSE`                                                                                                             |
| `alternate_shifted`   | logical; if `FALSE` (default), all code points with non-ignorable primary weights are handled in the same way; `TRUE` causes the code points with primary weights that are less than or equal to the variable top value to be ignored on the primary level and moved to the quaternary level; this can be used to, e.g., ignore punctuation, see the examples provided   |
| `french`              | logical; `TRUE` results in secondary weights being considered backwards, i.e., ordering according to the last accent difference -- nowadays only used in Canadian-French; defaults to `FALSE`                                                                                                                                                                            |




### Collation Strength

The [Unicode Collation Algorithm](https://www.unicode.org/reports/tr10/)
can go beyond simple
canonical equivalence: it can treat some other (depending on the
context) differences as negligible too.

The `strength` option controls the Collator's "attention to detail". For
instance, it can be used to make the ligature "ff" (U+FB00) compare
equal to the two-letter sequence "ff":


```r
stri_cmp_equiv("\ufb00", "ff", strength=2)
## [1] TRUE
```

which is not the case at the default strength level (3).

Generally, four (nested) levels of inter-string differences can be
distinguished:

1.  A primary difference -- the strongest one -- occurs where there is a
    mismatch between base characters (e.g., `"a"` vs `"b"`).

2.  Some character accents can be considered a secondary difference in
    many languages. However, in other ones, an accented letter is
    considered a unique letter.

3.  Distinguishing between upper- and lower case typically happens on
    the tertiary level (see, however, the `case_level` option).

4.  If `alternate_shifted` is `TRUE`, differences in punctuation can be
    determined at the quaternary level. This is also meaningful in the
    processing of Hiragana text.


### Ignoring Case

Note which strings are deemed equivalent when
considering different collation strengths:


```r
x <- c("gro\u00df", "gross", "GROSS", "Gro\u00df", "Gross", "GRO\u1e9e")
stri_unique(x, strength=1)                  # ß == ss, case insensitive
## [1] "groß"
stri_unique(x, strength=2)                  # ß != ss, case insensitive
## [1] "groß"  "gross"
```

Hence, strength equal to 1 takes only primary differences into account.
Strength of 2 will also be sensitive to secondary differences
(distinguishes between "ß" and "ss" above), but will ignore tertiary
differences (case).

Also, introducing an extra case level yields a case sensitive comparison
that ignores secondary differences:


```r
stri_unique(x, strength=1, case_level=TRUE)  # ß == ss, case sensitive
## [1] "groß"  "GROSS" "Groß"
```



### Ignoring Some Punctuation

Here are some effects of changing the `alternate_shifted` option, which
allows for ignoring some punctuation marks:


```r
x <- c("code point", "code-point", "codepoint", "CODE POINT", "CodePoint")
stri_unique(x, alternate_shifted=TRUE)  # strength=3
## [1] "code point" "CODE POINT" "CodePoint"
```

Here, when `strength = 3` is used (the default), punctuation
differences are ignored, but the letter case is deemed significant.


```r
stri_unique(x, alternate_shifted=TRUE, strength=2)
## [1] "code point"
```

In this case, all strings are considered equivalent.

Ignoring case but not punctuation yields:


```r
stri_unique(x, strength=2)
## [1] "code point" "code-point" "codepoint"
```

### Backward Secondary Sorting

The French Canadian Sorting Standard CAN/CSA Z243.4.1 (historically, this
had been the default for all French locales) requires the word ordering
with respect to the last accent difference. Such a behaviour can be
applied either by setting the French-Canadian locale, or by passing the
`french=TRUE` option to the Collator.


```r
stri_sort(c("cote", "côte", "coté", "côté"), locale="fr_FR")
## [1] "cote" "coté" "côte" "côté"
stri_sort(c("cote", "côte", "coté", "côté"), locale="fr_CA") # french=TRUE
## [1] "cote" "côte" "coté" "côté"
```

### Sorting Numerals

By default, just like in base R and most other programming languages, a
lexicographic ordering is used: the corresponding code points are being
compared one by one, from left to right, and once a difference is
detected, the result is returned immediately.


```r
x <- c("a1", "a2", "a11", "a1", "a99", "a10", "a100", "a2", "a9", "a2")
stri_sort(x)
##  [1] "a1"   "a1"   "a10"  "a100" "a11"  "a2"   "a2"   "a2"   "a9"   "a99"
```

For example, `"a99"` is ordered after `"a100"`, because `"a" == "a"`
(first characters are equal) but then `"9" > "1"` (second characters are
already different).

Let us, however, note the effect of setting the `numeric` option on the
sorting of strings that involves numbers:


```r
stri_sort(x, numeric=TRUE)
##  [1] "a1"   "a1"   "a2"   "a2"   "a2"   "a9"   "a10"  "a11"  "a99"  "a100"
```

Here is an example of sorting a data frame with respect to two criteria:


```r
X <- data.frame(a=x, b=runif(length(x)))
X[order(-stri_rank(X$a, numeric=TRUE), X$b), ]
##       a        b
## 7  a100 0.528105
## 5   a99 0.940467
## 3   a11 0.408977
## 6   a10 0.045556
## 9    a9 0.551435
## 10   a2 0.456615
## 2    a2 0.788305
## 8    a2 0.892419
## 1    a1 0.287578
## 4    a1 0.883017
```

The object is now ordered by the first column decreasingly (using a
"numeric" order) and ties are resolved based on increasing values in the
second column.



### A Note on Compatibility Equivalence

In {ref}`Sec:normalisation` we describe different ways to normalise
canonically equivalent code point sequences so that they are represented
by the same code points, which can account for some negligible
differences (as in the "a with ogonek" example above).

Apart from ignoring punctuation and case, the
[Unicode Standard Annex \#15](https://www.unicode.org/reports/tr15/)
also discusses the so-called compatibility
equivalence of strings. This is a looser form of similarity; it is
observed where there is the same abstract content, yet displayed by
means of different glyphs, for instance "¼" (U+00BC) vs "`1/4`" or
"ℝ" vs "R". In the latter case, whether these should be
treated as equal, depends on the context (e.g., this can be the set of
real numbers vs one's favourite programming language). Compatibility
decompositions (NFKC, NFKD) mentioned in the section on
{ref}`Sec:normalisation` or other types of transliteration can be
used to normalise strings so that such differences are not accounted
for.

Also, for "fuzzy" matching of strings,
the [*stringdist*](https://CRAN.R-project.org/package=stringdist) package
might be utile.



Searching for Fixed Strings Revisited
-------------------------------------

The *ICU* Collator can also be utilised where there is a need to locate
the occurrences of simple textual patterns. The counterparts of the
string search functions described in the section on
{ref}`Sec:fixed` have
their names ending with `*_coll()`. Albeit slower than the `*_fixed()`
functions, they are more appropriate in natural language
processing activities.

For instance:


```r
stri_detect_coll("Er ist so groß.", "GROSS", strength=1, locale="de_AT")
## [1] TRUE
stri_detect_coll("On je chladný", "chladny", strength=1, locale="sk_SK")
## [1] TRUE
```
