elasticsearch: how to index terms which are stopwo

2020-03-26 10:53发布

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.

I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.

I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.

So, is it possible to save "The The" in the index but not disabling the stop words at all?

1条回答
Viruses.
2楼-- · 2020-03-26 11:26

You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.

First, configure the analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "syn" : {
               "synonyms" : [
                  "the the => thethe"
               ],
               "type" : "synonym"
            }
         },
         "analyzer" : {
            "syn" : {
               "filter" : [
                  "lowercase",
                  "syn",
                  "stop"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then test it with the string "The The The Who".

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn' 

{
   "tokens" : [
      {
         "end_offset" : 7,
         "position" : 1,
         "start_offset" : 0,
         "type" : "SYNONYM",
         "token" : "thethe"
      },
      {
         "end_offset" : 15,
         "position" : 3,
         "start_offset" : 12,
         "type" : "<ALPHANUM>",
         "token" : "who"
      }
   ]
}

"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.

To stop or not to stop

Which brings us back to whether we should include stopwords or not? You said:

I know I can ignore the stop words list completely 
but this is not what I want since the results searching 
for other bands like "the who" would explode.

What do you mean by that? Explode how? Index size? Performance?

Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.

Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.

Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms. This can be mitigated by using a shingle filter, which would result in these tokens:

['the the','the concert','concert madrid']

While the may be common, the the isn't and so will rank higher.

You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.

We can use a multi-field to analyze the text field in two different ways:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "text" : {
               "fields" : {
                  "shingle" : {
                     "type" : "string",
                     "analyzer" : "shingle"
                  },
                  "text" : {
                     "type" : "string",
                     "analyzer" : "no_stop"
                  }
               },
               "type" : "multi_field"
            }
         }
      }
   },
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_stop" : {
               "stopwords" : "",
               "type" : "standard"
            },
            "shingle" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "shingle"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "multi_match" : {
         "fields" : [
            "text",
            "text.shingle^2"
         ],
         "query" : "the the concert madrid"
      }
   }
}
'
查看更多
登录 后发表回答