Hello,
There is no direct way to say only index ‘x’ percent of the page, or ‘y’ percent of the page, however, you can use a
Document crawler index and define custom processing. The document crawler index converts the HTML output of documents to plain text stripping all HTML, scripts and formatting tags. You could create your own process rules to ignore certain tags and therefore not index a specific portion of the given page.
You need to implement your custom functionality in a handler for the
OnHtmlToPlainText event of the
CMS.SiteProvider.SearchHelper class. This event occurs whenever a document search crawler processes the HTML output of a page. To assign a method as the handler for the
OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations). The
OnHTMLToPlainText event provides the following string parameters to the handler:
•plainText - contains the page output already stripped of all tags and converted to plain text.
•originalHTML - allows you to access the raw page code without any modifications.
Please find the example code which will not index any content contained within
do-not-search div tag, when using a document crawler index type. Please add this code to a new class within the App_Code folder.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using CMS.SettingsProvider;
using CMS.SiteProvider;
using System.Text.RegularExpressions;
[DocumentCrawlerContentLoader]
public partial class CMSModuleLoader
{
/// <summary>
/// Attribute class for assigning event handlers.
/// </summary>
private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute
{
/// <summary>
/// Called automatically when the application starts.
/// </summary>
public override void Init()
{
// Assigns a handler for the OnHtmlToPlainText event
SearchHelper.OnHtmlToPlainText += new SearchHelper.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
}
static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
{
string result = originalHtml;
// string s = originalHtml;
var startIndex = originalHtml.IndexOf("<div class=\"do-not-search\">");
if (startIndex >= 0)
{
var endIndex = originalHtml.IndexOf("</div>", startIndex) + "</div>".Length;
if (endIndex > startIndex)
{
result = originalHtml.Remove(startIndex, endIndex - startIndex);
}
}
// Remove new lines
result = result.Replace("\n", " ");
// Remove tab spaces
result = result.Replace("\t", " ");
// Remove head tag
result = Regex.Replace(result, "<head.*?</head>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove style tag
result = Regex.Replace(result, "<style.*?</style>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove any JavaScript
result = Regex.Replace(result, "<script.*?</script>", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
// Remove tags
result = Regex.Replace(result, "<[^>]*>", " ");
// Decode HTML entities
result = HttpUtility.HtmlDecode(result);
// Replace white spaces
result = Regex.Replace(result, "\\s+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled);
return result;
}
}
}
Best Regards,
Edward Hillard