function SearchTextProcessor::analyze
Same name in other branches
- 10 core/modules/search/src/SearchTextProcessor.php \Drupal\search\SearchTextProcessor::analyze()
- 11.x core/modules/search/src/SearchTextProcessor.php \Drupal\search\SearchTextProcessor::analyze()
Overrides SearchTextProcessorInterface::analyze
1 call to SearchTextProcessor::analyze()
- SearchTextProcessor::process in core/
modules/ search/ src/ SearchTextProcessor.php - Processes text into words for indexing.
File
-
core/
modules/ search/ src/ SearchTextProcessor.php, line 64
Class
- SearchTextProcessor
- Processes search text for indexing.
Namespace
Drupal\searchCode
public function analyze(string $text, ?string $langcode = NULL) : string {
// Decode entities to UTF-8.
$text = Html::decodeEntities($text);
// Lowercase.
$text = mb_strtolower($text);
// Remove diacritics.
$text = $this->transliteration
->removeDiacritics($text);
// Call an external processor for word handling.
$this->invokePreprocess($text, $langcode);
// Simple CJK handling.
if ($this->configFactory
->get('search.settings')
->get('index.overlap_cjk')) {
$text = preg_replace_callback('/[' . self::PREG_CLASS_CJK . ']+/u', [
$this,
'expandCjk',
], $text);
}
// To improve searching for numerical data such as dates, IP addresses
// or version numbers, we consider a group of numerical characters
// separated only by punctuation characters to be one piece.
// This also means that searching for e.g. '20/03/1984' also returns
// results with '20-03-1984' in them.
// Readable regexp: ([number]+)[punctuation]+(?=[number])
$text = preg_replace('/([' . self::PREG_CLASS_NUMBERS . ']+)[' . self::PREG_CLASS_PUNCTUATION . ']+(?=[' . self::PREG_CLASS_NUMBERS . '])/u', '\\1', $text);
// Multiple dot and dash groups are word boundaries and replaced with space.
// No need to use the unicode modifier here because 0-127 ASCII characters
// can't match higher UTF-8 characters as the leftmost bit of those are 1.
$text = preg_replace('/[.-]{2,}/', ' ', $text);
// The dot, underscore and dash are simply removed. This allows meaningful
// search behavior with acronyms and URLs. See unicode note directly above.
$text = preg_replace('/[._-]+/', '', $text);
// With the exception of the rules above, we consider all punctuation,
// marks, spacers, etc, to be a word boundary.
$text = preg_replace('/[' . Unicode::PREG_CLASS_WORD_BOUNDARY . ']+/u', ' ', $text);
// Truncate everything to 50 characters.
$words = explode(' ', $text);
array_walk($words, [
$this,
'truncate',
]);
$text = implode(' ', $words);
return $text;
}
Buggy or inaccurate documentation? Please file an issue. Need support? Need help programming? Connect with the Drupal community.