Portal Engine Questions on portal engine and web parts.
Version 7.x > Portal Engine > Excluding certain pieces of *page content* from Smart Search indexing. View modes: 
User avatar
Member
Member
ian.s@mmtdigital.co.uk - 2/4/2014 3:24:59 AM
   
Excluding certain pieces of *page content* from Smart Search indexing.
We have some widgets that we allow our cms editors to create and add to various page templates. We have a Smart search in place, that is indexing all of our pages.

There are instances where we add a widget to a page, but we do not want the content of that widget to be indexed. Is there any flag that we can set around that content that will cause it to be ignored by the indexer?

User avatar
Kentico Legend
Kentico Legend
Brenden Kehren - 2/4/2014 1:47:26 PM
   
RE:Excluding certain pieces of *page content* from Smart Search indexing.
You'd have to exclude the whole document or node, you can't exclude a specific part of a page.

User avatar
Member
Member
kentico_edwardh - 2/4/2014 6:37:46 PM
   
RE:Excluding certain pieces of *page content* from Smart Search indexing.
Hello,

There is no direct way to say only index ‘x’ percent of the page, or ‘y’ percent of the page, however, you can use a Document crawler index and define custom processing. The document crawler index converts the HTML output of documents to plain text stripping all HTML, scripts and formatting tags. You could create your own process rules to ignore certain tags and therefore not index a specific portion of the given page.

You need to implement your custom functionality in a handler for the OnHtmlToPlainText event of the CMS.SiteProvider.SearchHelper class. This event occurs whenever a document search crawler processes the HTML output of a page. To assign a method as the handler for the OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations). The OnHTMLToPlainText event provides the following string parameters to the handler:
•plainText - contains the page output already stripped of all tags and converted to plain text.
•originalHTML - allows you to access the raw page code without any modifications.

Please find the example code which will not index any content contained within do-not-search div tag, when using a document crawler index type. Please add this code to a new class within the App_Code folder.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;

using CMS.SettingsProvider;
using CMS.SiteProvider;
using System.Text.RegularExpressions;

[DocumentCrawlerContentLoader]
public partial class CMSModuleLoader
{
/// <summary>
/// Attribute class for assigning event handlers.
/// </summary>
private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute
{
/// <summary>
/// Called automatically when the application starts.
/// </summary>
public override void Init()
{
// Assigns a handler for the OnHtmlToPlainText event
SearchHelper.OnHtmlToPlainText += new SearchHelper.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
}

static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
{
string result = originalHtml;

// string s = originalHtml;
var startIndex = originalHtml.IndexOf("<div class=\"do-not-search\">");
if (startIndex >= 0)
{
var endIndex = originalHtml.IndexOf("</div>", startIndex) + "</div>".Length;
if (endIndex > startIndex)
{
result = originalHtml.Remove(startIndex, endIndex - startIndex);
}
}

// Remove new lines
result = result.Replace("\n", " ");
// Remove tab spaces
result = result.Replace("\t", " ");

// Remove head tag
result = Regex.Replace(result, "<head.*?</head>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove style tag
result = Regex.Replace(result, "<style.*?</style>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove any JavaScript
result = Regex.Replace(result, "<script.*?</script>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove tags
result = Regex.Replace(result, "<[^>]*>", " ");
// Decode HTML entities
result = HttpUtility.HtmlDecode(result);
// Replace white spaces
result = Regex.Replace(result, "\\s+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);

return result;
}
}
}


Best Regards,
Edward Hillard

User avatar
Member
Member
ian.s@mmtdigital.co.uk - 2/7/2014 5:14:00 AM
   
RE:Excluding certain pieces of *page content* from Smart Search indexing.
Hi Edward

This is brilliant feedback. Thank you very much. Very much appreciated.