Exclude certain elements from Smart Search index.
This article describes how to exclude certain elements from a given page within the Smart Search index.
For Kentico 8 API and customization please see the
Defining page indexes documentation.
Let’s say you want to include only some content for a given page in your smart search index, but also exclude some content for the same page. In order to achieve this result, you would need use the
Document crawler index type and implement custom functionality in a handler for the
OnHtmlToPlainText event of the
CMS.SiteProvider.SearchHelper class. This event occurs whenever the document search crawler processes the HTML output of a page.
To assign a method of the handler for the
OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations). The
OnHTMLToPlainText event provides the following string parameters to the handler:
•
plainText - contains the page output already stripped of all tags and converted to plain text.
•
originalHTML - allows you to access the raw page code without any modifications.
The following example search index is setup as a
Document crawler Index type, and is set to only search the Home page for simplicity. Go to Site Manager -> Administration -> Smart search -> New index, create a new index called
DoNotSearchExample which uses a
Document crawler Index type, and only searches the Home page as shown below.
Please remember to assign this index to your current site (Sites tab), and add your Culture to the index (Cultures tab).
The example code below won’t index any content contained within the
do-not-search div tag. You can create a new class in your App_Code folder and add this code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using CMS.SettingsProvider;
using CMS.SiteProvider;
using System.Text.RegularExpressions;
[DocumentCrawlerContentLoader]
public partial class CMSModuleLoader
{
/// <summary>
/// Attribute class for assigning event handlers.
/// </summary>
private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute
{
/// <summary>
/// Called automatically when the application starts.
/// </summary>
public override void Init()
{
// Assigns a handler for the OnHtmlToPlainText event
SearchHelper.OnHtmlToPlainText += new SearchHelper.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
}
static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
{
string result = originalHtml;
// define div tag to exclude from search results
var startIndex = originalHtml.IndexOf("<div class=\"do-not-search\">");
if (startIndex >= 0)
{
var endIndex = originalHtml.IndexOf("</div>", startIndex) + "</div>".Length;
if (endIndex > startIndex)
{
result = originalHtml.Remove(startIndex, endIndex - startIndex);
}
}
// Remove new lines
result = result.Replace("\n", " ");
// Remove tab spaces
result = result.Replace("\t", " ");
// Remove head tag
result = Regex.Replace(result, "<head.*?</head>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove style tag
result = Regex.Replace(result, "<style.*?</style>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove any JavaScript
result = Regex.Replace(result, "<script.*?</script>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove tags
result = Regex.Replace(result, "<[^>]*>", " ");
// Decode HTML entities
result = HttpUtility.HtmlDecode(result);
// Replace white spaces
result = Regex.Replace(result, "\\s+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
return result;
}
}
}
You have created the search index, and created a new class containing the example code, so let’s perform a test. Add a new
Smart search dialog with results web part to your Home page and assign the
DoNotSearchExample index to this search web part.
Add a new
Editable text web part to your Home page and add
SomeTestText to the content section.
After saving this page, go to Site Manager -> Administration -> Smart search -> edit the
DoNotSearchExample index -> General, and rebuild the index. Once this index has been rebuilt, search for
SomeTestText in your search web part to ensure a result has been returned and the search index is working correctly.
If you cannot return a result, please check your search index settings and re-verify the code in your new class is correct. If you returned a result, then your index/code is setup correctly and you can test excluding specific content.
Go back to your
Editable text web part where you entered
SomeTestText, switch the CK Editor to
Source mode, wrap
SomeTestText in the
do-not-search div tag
<div class="do-not-search">SomeTestText</div>
save the page, and then go to Site Manager -> Administration -> Smart search -> edit the
DoNotSearchExample index -> General, and rebuild the index. Once this index has been rebuilt, search for
SomeTestText in your search web part and you shouldn’t be able to return any results.
-eh-
Applies to: Kentico CMS 7.x