Smart search index with "-" character in conrtent

Johnny Nguyen asked on March 12, 2018 13:04

Hi,

I'm facing with smart search index required hyphen ("-") in the content. I have a string "PPG1149-6" in a document, and I need to search that document by enter 1149-6. However, Search Index API return empty value.

I used Luke Tool to check what was added to "_content" field, and find out the string "1149-6" wasn't indexed because of the hyphen.

I'm using Pages index with Subset analyzer. Is there anyway to custom the analyzer to ignore hyphen from special character and index it as a normal character?

smart search smartsearch api

Correct Answer

Trevor Fayas answered on March 13, 2018 14:28

This is what you're looking for, it references a sample file to get a little more context, but this is so you can specifically tell it to index certain characters that are usually break-characters.

https://docs.kentico.com/k11/configuring-kentico/setting-up-search-on-your-website/using-locally-stored-search-indexes/creating-local-search-indexes/creating-custom-smart-search-analyzers

1 votesVote for this answer Unmark Correct answer

Recent Answers

Peter Mogilnitski answered on March 12, 2018 15:17 (last edited on March 12, 2018 17:10)

I think you need to escape in your search condition i.e. it should be something like +_content: (PPG1149\-6). The "-" or prohibit operator excludes documents that contain the term after the "-" symbol. Lucene supports escaping special characters that are part of the query syntax. The current list special characters are: + - && || ! ( ) { } [ ] ^ " ~ * ? : \ / To escape these character use the \ before the character.

1 votesVote for this answer Mark as a Correct answer

Johnny Nguyen answered on March 14, 2018 04:38

Hi @Peter Mogilnitski,

Thank for your suggestion but I used Luke tool and find out that the hyphen wasn't indexed. "PPG1149-6" isn't consider as a word because there is hyphen there, so the analyzer consider it is 2 words. I think I need to create my own analyzer to satisfy this case.

Now, I did some workaround, I break keyword "PPG1149-6" into 2 words: ["PPG1149", "6"], then ask search engine search all documents contain both of them in any order (+_content:PPG1149 + _content:6). After that, I used TreeProvider to query in DB the expected result.

0 votesVote for this answer Mark as a Correct answer

Johnny Nguyen answered on March 14, 2018 04:39

Thank @Trevor Fayas, you are correct

0 votesVote for this answer Mark as a Correct answer

Peter Mogilnitski answered on March 14, 2018 19:05 (last edited on March 14, 2018 19:07)

I disagree, I just did a simple test. Here is my config. You don`t need to do anything special. Look at my screen shots. Test your string in your index search preview if it works there, then there is an issue with your lucene query not with an analyzer. No need for anything custom.

1 votesVote for this answer Mark as a Correct answer

Trevor Fayas answered on March 14, 2018 19:41

@Peter, the search he is trying to pull the document is "1149-6" which isn't indexed. I did the same test as you, and yes searching by "PPG1149-6" works because it parses that as searching for PPG, which is part of the indexed content.

In a test of a page that contains "PPG1149-6", this is the content saved using a Simple index:

test test test search test 
test test ppg

As you can see, it cut off the numbers. You can view this in Luke lucene index toolbox by clicking "reconstruct & Edit" on the document in the documents tab.

However, i will say this does raise one valid point, you can use the Analyzer of type "Whitespace" which only uses whitespace, this resulted in the full PPG1149-6 being parsed into the document, but because it wasn't tokenized into "PPG" and "1149-6" it still won't allow it to be searched by just "1149-6". Subset gets closer, but still won't contain 1149-6, just 1149.

I think in order to fully accomplish this, he will still need to do a custom analyzer, but along with it he may have to add logic to find the code (PPG1149-6) and split it up into PPG, PPG1149-6, and 1149-6 so he can search for it.

1 votesVote for this answer Mark as a Correct answer