Exclude certain elements from Smart Search index.

   —   
This article describes how to exclude certain elements from a given page within the Smart Search index.
For Kentico 8 API and customization please see the Defining page indexes documentation.

Let’s say you want to include only some content for a given page in your smart search index, but also exclude some content for the same page.  In order to achieve this result, you would need use the Document crawler index type and implement custom functionality in a handler for the OnHtmlToPlainText event of the CMS.SiteProvider.SearchHelper class. This event occurs whenever the document search crawler processes the HTML output of a page. 

To assign a method of the handler for the OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations).   The OnHTMLToPlainText event provides the following string parameters to the handler:
plainText - contains the page output already stripped of all tags and converted to plain text.
originalHTML - allows you to access the raw page code without any modifications.

The following example search index is setup as a Document crawler Index type, and is set to only search the Home page for simplicity.  Go to Site Manager -> Administration -> Smart search -> New index, create a new index called DoNotSearchExample which uses a Document crawler Index type, and only searches the Home page as shown below.



Please remember to assign this index to your current site (Sites tab), and add your Culture to the index (Cultures tab).

The example code below won’t index any content contained within the do-not-search div tag.  You can create a new class in your App_Code folder and add this code:
using System; using System.Collections.Generic; using System.Linq; using System.Web; using CMS.SettingsProvider; using CMS.SiteProvider; using System.Text.RegularExpressions; [DocumentCrawlerContentLoader] public partial class CMSModuleLoader { /// <summary> /// Attribute class for assigning event handlers. /// </summary> private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute { /// <summary> /// Called automatically when the application starts. /// </summary> public override void Init() { // Assigns a handler for the OnHtmlToPlainText event SearchHelper.OnHtmlToPlainText += new SearchHelper.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText); } static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml) { string result = originalHtml; // define div tag to exclude from search results var startIndex = originalHtml.IndexOf("<div class=\"do-not-search\">"); if (startIndex >= 0) { var endIndex = originalHtml.IndexOf("</div>", startIndex) + "</div>".Length; if (endIndex > startIndex) { result = originalHtml.Remove(startIndex, endIndex - startIndex); } } // Remove new lines result = result.Replace("\n", " "); // Remove tab spaces result = result.Replace("\t", " "); // Remove head tag result = Regex.Replace(result, "<head.*?</head>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled); // Remove style tag result = Regex.Replace(result, "<style.*?</style>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled); // Remove any JavaScript result = Regex.Replace(result, "<script.*?</script>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled); // Remove tags result = Regex.Replace(result, "<[^>]*>", " "); // Decode HTML entities result = HttpUtility.HtmlDecode(result); // Replace white spaces result = Regex.Replace(result, "\\s+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled); return result; } } }

You have created the search index, and created a new class containing the example code, so let’s perform a test.  Add a new Smart search dialog with results web part to your Home page and assign the DoNotSearchExample index to this search web part.


Add a new Editable text web part to your Home page and add SomeTestText to the content section. 


After saving this page, go to Site Manager -> Administration -> Smart search -> edit the DoNotSearchExample index -> General, and rebuild the index.  Once this index has been rebuilt, search for SomeTestText in your search web part to ensure a result has been returned and the search index is working correctly.


If you cannot return a result, please check your search index settings and re-verify the code in your new class is correct.  If you returned a result, then your index/code is setup correctly and you can test excluding specific content.
Go back to your Editable text web part where you entered SomeTestText, switch the CK Editor to Source mode, wrap SomeTestText in the do-not-search div tag
<div class="do-not-search">SomeTestText</div>


save the page, and then go to Site Manager -> Administration -> Smart search -> edit the DoNotSearchExample index -> General, and rebuild the index.  Once this index has been rebuilt, search for SomeTestText in your search web part and you shouldn’t be able to return any results.
-eh-


Applies to: Kentico CMS 7.x
Share this article on   LinkedIn