> ## Documentation Index
> Fetch the complete documentation index at: https://private-7c7dfe99-mintlify-3a82795f.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

> Documentation for Natural Language Processing (NLP) functions

# Natural Language Processing (NLP) functions

export const CloudNotSupportedBadge = () => {
  return <div className="cloudNotSupportedBadge">
            <div className="cloudNotSupportedIcon">
            <svg width="16" height="16" viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg">
                <path strokeWidth="1.5" d="M6.33366 12.6666L12.3739 12.6667C13.6593 12.6667 14.7073 11.6187 14.7073 10.3334C14.7073 9.04804 13.6593 8.00003 12.3739 8.00003C12.3739 8.00003 12.3337 7.66659 12.0003 7.33325M10.667 5.33322C8.00033 2.33325 4.45395 4.78537 4.14195 6.68203C2.55728 6.7627 1.29395 8.06203 1.29395 9.6667C1.29395 11.3234 2.66699 12.6666 4.00033 12.6666" stroke="currentColor" strokeLinecap="round" strokeLinejoin="round" />
                <path strokeWidth="1.5" d="M2.66699 14L12.0003 4.66663" stroke="currentColor" strokeLinecap="round" strokeLinejoin="round" />
            </svg>

        </div>
            Not supported in ClickHouse Cloud
        </div>;
};

export const ExperimentalBadge = () => {
  return <div className="experimentalBadge">
            <div className="experimentalIcon">
            <svg width="16" height="16" viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg">
                <path strokeWidth="1.25" d="M5.5 2H10.5" stroke="currentColor" strokeLinecap="round" strokeLinejoin="round" />
                <path strokeWidth="1.25" d="M9.50015 2V6.19625L13.4283 12.7425C13.4738 12.8183 13.4985 12.9049 13.4996 12.9934C13.5008 13.0818 13.4785 13.169 13.435 13.246C13.3914 13.323 13.3283 13.3871 13.2519 13.4317C13.1755 13.4764 13.0886 13.4999 13.0002 13.5H3.00015C2.91164 13.5 2.8247 13.4766 2.74822 13.432C2.67174 13.3874 2.60847 13.3233 2.56487 13.2463C2.52126 13.1693 2.49889 13.082 2.50004 12.9935C2.50119 12.905 2.52582 12.8184 2.5714 12.7425L6.50015 6.19625V2" stroke="currentColor" strokeLinecap="round" strokeLinejoin="round" />
                <path strokeWidth="1.25" d="M4.47656 9.56754C5.30344 9.41254 6.47656 9.47942 7.99969 10.25C10.0153 11.2707 11.4216 11.0569 12.2184 10.7282" stroke="currentColor" strokeLinecap="round" strokeLinejoin="round" />
            </svg>
        </div>
            Experimental feature. <u><a href="/docs/beta-and-experimental-features#experimental-features">Learn more.</a></u>
        </div>;
};

{/*AUTOGENERATED_START*/}

<h2 id="detectCharset">
  detectCharset
</h2>

Introduced in: v22.2.0

Detects the character set of a non-UTF8-encoded input string.

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
detectCharset(s)
```

**Arguments**

* `s` — The text to analyze. [`String`](/reference/data-types/string)

**Returned value**

Returns a string containing the code of the detected character set [`String`](/reference/data-types/string)

**Examples**

**Basic usage**

```sql title=Query theme={null}
SELECT detectCharset('Ich bleibe für ein paar Tage.')
```

```response title=Response theme={null}
WINDOWS-1252
```

<h2 id="detectLanguage">
  detectLanguage
</h2>

Introduced in: v22.2.0

Detects the language of the UTF8-encoded input string.
The function uses the [CLD2 library](https://github.com/CLD2Owners/cld2) for detection and returns the 2-letter ISO language code.

The longer the input, the more precise the language detection will be.

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
detectLanguage(s)
```

**Arguments**

* `text_to_be_analyzed` — The text to analyze. [`String`](/reference/data-types/string)

**Returned value**

Returns the 2-letter ISO code of the detected language. Other possible results: `un` = unknown, can not detect any language, `other` = the detected language does not have 2 letter code. [`String`](/reference/data-types/string)

**Examples**

**Mixed language text**

```sql title=Query theme={null}
SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there\'s a will, there\'s a way.')
```

```response title=Response theme={null}
fr
```

<h2 id="detectLanguageMixed">
  detectLanguageMixed
</h2>

Introduced in: v22.2.0

Similar to the [`detectLanguage`](#detectLanguage) function, but `detectLanguageMixed` returns a `Map` of 2-letter language codes that are mapped to the percentage of the certain language in the text.

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
detectLanguageMixed(s)
```

**Arguments**

* `s` — The text to analyze [`String`](/reference/data-types/string)

**Returned value**

Returns a map with keys which are 2-letter ISO codes and corresponding values which are a percentage of the text found for that language [`Map(String, Float32)`](/reference/data-types/map)

**Examples**

**Mixed languages**

```sql title=Query theme={null}
SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.')
```

```response title=Response theme={null}
{'ja':0.62,'fr':0.36}
```

<h2 id="detectLanguageUnknown">
  detectLanguageUnknown
</h2>

Introduced in: v22.2.0

Similar to the [`detectLanguage`](#detectLanguage) function, except the detectLanguageUnknown function works with non-UTF8-encoded strings.
Prefer this version when your character set is UTF-16 or UTF-32.

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
detectLanguageUnknown('s')
```

**Arguments**

* `s` — The text to analyze. [`String`](/reference/data-types/string)

**Returned value**

Returns the 2-letter ISO code of the detected language. Other possible results: `un` = unknown, can not detect any language, `other` = the detected language does not have 2 letter code. [`String`](/reference/data-types/string)

**Examples**

**Basic usage**

```sql title=Query theme={null}
SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.')
```

```response title=Response theme={null}
de
```

<h2 id="detectTonality">
  detectTonality
</h2>

Introduced in: v22.2.0

Determines the sentiment of the provided text data.

<Info>
  **Limitation**

  This function is limited in its current form in that it makes use of the embedded emotional dictionary and only works for the Russian language.
</Info>

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
detectTonality(s)
```

**Arguments**

* `s` — The text to be analyzed. [`String`](/reference/data-types/string)

**Returned value**

Returns the average sentiment value of the words in text [`Float32`](/reference/data-types/float)

**Examples**

**Russian sentiment analysis**

```sql title=Query theme={null}
SELECT
    detectTonality('Шарик - хороший пёс'),
    detectTonality('Шарик - пёс'),
    detectTonality('Шарик - плохой пёс')
```

```response title=Response theme={null}
0.44445, 0, -0.3
```

<h2 id="lemmatize">
  lemmatize
</h2>

Introduced in: v21.9.0

Performs lemmatization on a given word.
This function needs dictionaries to operate, which can be obtained from [github](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models).
For more details on loading a dictionary from a local file see page ["Defining Dictionaries"](/reference/statements/create/dictionary/sources/local-file).

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
lemmatize(lang, word)
```

**Arguments**

* `lang` — Language which rules will be applied. [`String`](/reference/data-types/string)
* `word` — Lowercase word that needs to be lemmatized. [`String`](/reference/data-types/string)

**Returned value**

Returns the lemmatized form of the word [`String`](/reference/data-types/string)

**Examples**

**English lemmatization**

```sql title=Query theme={null}
SELECT lemmatize('en', 'wolves')
```

```response title=Response theme={null}
wolf
```

<h2 id="stem">
  stem
</h2>

Introduced in: v21.9.0

Performs stemming on a word or an array of words using the Snowball algorithms.
Each input string must be a single, lowercase word — strings containing whitespace cause an exception.
Passing uppercase characters produces undefined results.
Returns String for scalar inputs (including FixedString) and Array(String) for array inputs.
Nullable and LowCardinality variants of String and FixedString are supported.

**Syntax**

```sql theme={null}
stem(word, language)
```

**Arguments**

* `word` — A single lowercase word (or array of words) to stem. Must be lowercase — uppercase characters produce undefined results. Accepts String, FixedString, Array(String), Array(FixedString), Array(Nullable(String)), or Array(Nullable(FixedString)). [`String`](/reference/data-types/string) or [`FixedString`](/reference/data-types/fixedstring) or [`Array(String)`](/reference/data-types/array) or [`Array(FixedString)`](/reference/data-types/array)
* `language` — Language whose stemming rules will be applied. Use the two-letter ISO 639-1 code (e.g. 'en', 'de', 'fr'), see [https://en.wikipedia.org/wiki/List\_of\_ISO\_639\_language\_codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). [`String`](/reference/data-types/string)

**Returned value**

The stemmed form of the word (String), or an array of stemmed words (Array(String)). [`String`](/reference/data-types/string) or [`Array(String)`](/reference/data-types/array)

**Examples**

**Stemming a single word**

```sql title=Query theme={null}
SELECT stem('blessing', 'en') AS res
```

```response title=Response theme={null}
bless
```

**Stemming an array of words**

```sql title=Query theme={null}
SELECT stem(['blessing', 'disguise'], 'en') AS res
```

```response title=Response theme={null}
['bless','disguis']
```

**Stemming a FixedString**

```sql title=Query theme={null}
SELECT stem(toFixedString('blessing', 10), 'en') AS res
```

```response title=Response theme={null}
bless
```

**Stemming a Nullable word**

```sql title=Query theme={null}
SELECT stem(toNullable('blessing'), 'en') AS res
```

```response title=Response theme={null}
bless
```

<h2 id="synonyms">
  synonyms
</h2>

Introduced in: v21.9.0

Finds synonyms of a given word.

There are two types of synonym extensions:

* `plain`
* `wordnet`

With the `plain` extension type you need to provide a path to a simple text file, where each line corresponds to a certain synonym set.
Words in this line must be separated with space or tab characters.

With the `wordnet` extension type you need to provide a path to a directory with the WordNet thesaurus in it.
The thesaurus must contain a WordNet sense index.

<Warning>
  This function is experimental and may change in unpredictable backwards-incompatible ways in future releases.
  Set `allow_experimental_nlp_functions = 1` to enable it.
</Warning>

**Syntax**

```sql theme={null}
synonyms(ext_name, word)
```

**Arguments**

* `ext_name` — Name of the extension in which search will be performed. [`String`](/reference/data-types/string)
* `word` — Word that will be searched in extension. [`String`](/reference/data-types/string)

**Returned value**

Returns array of synonyms for the given word. [`Array(String)`](/reference/data-types/array)

**Examples**

**Find synonyms**

```sql title=Query theme={null}
SELECT synonyms('list', 'important')
```

```response title=Response theme={null}
['important','big','critical','crucial']
```
