function SearchTextProcessor::analyze

Same name in other branches
  1. 10 core/modules/search/src/SearchTextProcessor.php \Drupal\search\SearchTextProcessor::analyze()
  2. 11.x core/modules/search/src/SearchTextProcessor.php \Drupal\search\SearchTextProcessor::analyze()

Overrides SearchTextProcessorInterface::analyze

1 call to SearchTextProcessor::analyze()
SearchTextProcessor::process in core/modules/search/src/SearchTextProcessor.php
Processes text into words for indexing.

File

core/modules/search/src/SearchTextProcessor.php, line 64

Class

SearchTextProcessor
Processes search text for indexing.

Namespace

Drupal\search

Code

public function analyze(string $text, ?string $langcode = NULL) : string {
    // Decode entities to UTF-8.
    $text = Html::decodeEntities($text);
    // Lowercase.
    $text = mb_strtolower($text);
    // Remove diacritics.
    $text = $this->transliteration
        ->removeDiacritics($text);
    // Call an external processor for word handling.
    $this->invokePreprocess($text, $langcode);
    // Simple CJK handling.
    if ($this->configFactory
        ->get('search.settings')
        ->get('index.overlap_cjk')) {
        $text = preg_replace_callback('/[' . self::PREG_CLASS_CJK . ']+/u', [
            $this,
            'expandCjk',
        ], $text);
    }
    // To improve searching for numerical data such as dates, IP addresses
    // or version numbers, we consider a group of numerical characters
    // separated only by punctuation characters to be one piece.
    // This also means that searching for e.g. '20/03/1984' also returns
    // results with '20-03-1984' in them.
    // Readable regexp: ([number]+)[punctuation]+(?=[number])
    $text = preg_replace('/([' . self::PREG_CLASS_NUMBERS . ']+)[' . self::PREG_CLASS_PUNCTUATION . ']+(?=[' . self::PREG_CLASS_NUMBERS . '])/u', '\\1', $text);
    // Multiple dot and dash groups are word boundaries and replaced with space.
    // No need to use the unicode modifier here because 0-127 ASCII characters
    // can't match higher UTF-8 characters as the leftmost bit of those are 1.
    $text = preg_replace('/[.-]{2,}/', ' ', $text);
    // The dot, underscore and dash are simply removed. This allows meaningful
    // search behavior with acronyms and URLs. See unicode note directly above.
    $text = preg_replace('/[._-]+/', '', $text);
    // With the exception of the rules above, we consider all punctuation,
    // marks, spacers, etc, to be a word boundary.
    $text = preg_replace('/[' . Unicode::PREG_CLASS_WORD_BOUNDARY . ']+/u', ' ', $text);
    // Truncate everything to 50 characters.
    $words = explode(' ', $text);
    array_walk($words, [
        $this,
        'truncate',
    ]);
    $text = implode(' ', $words);
    return $text;
}

Buggy or inaccurate documentation? Please file an issue. Need support? Need help programming? Connect with the Drupal community.