Forums (Obsolete)

Member

ian.s@mmtdigital.co.uk - 2/4/2014 3:24:59 AM

Excluding certain pieces of *page content* from Smart Search indexing.

We have some widgets that we allow our cms editors to create and add to various page templates. We have a Smart search in place, that is indexing all of our pages.

There are instances where we add a widget to a page, but we do not want the content of that widget to be indexed. Is there any flag that we can set around that content that will cause it to be ignored by the indexer?


Kentico Legend	Brenden Kehren - 2/4/2014 1:47:26 PM RE:Excluding certain pieces of page content from Smart Search indexing. You'd have to exclude the whole document or node, you can't exclude a specific part of a page.

Member

kentico_edwardh - 2/4/2014 6:37:46 PM

RE:Excluding certain pieces of *page content* from Smart Search indexing.

Hello,

There is no direct way to say only index ‘x’ percent of the page, or ‘y’ percent of the page, however, you can use a Document crawler index and define custom processing. The document crawler index converts the HTML output of documents to plain text stripping all HTML, scripts and formatting tags. You could create your own process rules to ignore certain tags and therefore not index a specific portion of the given page.

You need to implement your custom functionality in a handler for the OnHtmlToPlainText event of the CMS.SiteProvider.SearchHelper class. This event occurs whenever a document search crawler processes the HTML output of a page. To assign a method as the handler for the OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations). The OnHTMLToPlainText event provides the following string parameters to the handler:
•plainText - contains the page output already stripped of all tags and converted to plain text.
•originalHTML - allows you to access the raw page code without any modifications.

Please find the example code which will not index any content contained within do-not-search div tag, when using a document crawler index type. Please add this code to a new class within the App_Code folder.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;

using CMS.SettingsProvider;
using CMS.SiteProvider;
using System.Text.RegularExpressions;

[DocumentCrawlerContentLoader]
public partial class CMSModuleLoader
{
    /// <summary>
    /// Attribute class for assigning event handlers.
    /// </summary>
    private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute
    {
        /// <summary>
        /// Called automatically when the application starts.
        /// </summary>
        public override void Init()
        {
            // Assigns a handler for the OnHtmlToPlainText event
            SearchHelper.OnHtmlToPlainText += new SearchHelper.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
        }

        static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
        {
            string result = originalHtml;

            //   string s = originalHtml;
            var startIndex = originalHtml.IndexOf("<div class=\"do-not-search\">");
            if (startIndex >= 0)
            {
                var endIndex = originalHtml.IndexOf("</div>", startIndex) + "</div>".Length;
                if (endIndex > startIndex)
                {
                    result = originalHtml.Remove(startIndex, endIndex - startIndex);
                }
            }

            // Remove new lines
            result = result.Replace("\n", " ");
            // Remove tab spaces
            result = result.Replace("\t", " ");

            // Remove head tag
            result = Regex.Replace(result, "<head.*?</head>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
            // Remove style tag
            result = Regex.Replace(result, "<style.*?</style>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
            // Remove any JavaScript
            result = Regex.Replace(result, "<script.*?</script>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
            // Remove tags
            result = Regex.Replace(result, "<[^>]*>", " ");
            // Decode HTML entities
            result = HttpUtility.HtmlDecode(result);
            // Replace white spaces
            result = Regex.Replace(result, "\\s+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);

            return result;
        }
    }
}

Best Regards,
Edward Hillard


Member	ian.s@mmtdigital.co.uk - 2/7/2014 5:14:00 AM RE:Excluding certain pieces of page content from Smart Search indexing. Hi Edward This is brilliant feedback. Thank you very much. Very much appreciated.