Defining document index content

There are two types of search indexes available for the documents in a website's content tree.

Documents type indexes include information from general document fields such as metadata, the text content of certain web parts placed on page (menu item) documents, as well as the selected fields of individual documents types (described in the Settings for particular object types topic). Data from other documents or objects displayed by web parts is not indexed. For example, the content of news documents displayed by a Repeater web part placed on an indexed document will not be added to the index etc.

Documents crawler type indexes directly parse the HTML output generated by documents, which means that all text located on or associated with a document is searchable. This allows document content to be searched more accurately than using a Documents type index. However, building and updating a Documents crawler index may require more time and resources, particularly in the case of large indexes and complex documents. The crawler accesses documents under a specific user account (the administrator account by default). You can set the details of the crawler account by adding keys to your project's web.config file. All keys related to smart search indexes are listed in the Smart search settings section of Appendix B - Web.config parameters.

The process used to define which documents on the site should be indexed is the same for both document index types. Specify allowed or excluded content on the Index tab of the index's editing interface by clicking Add allowed content or Add excluded content respectively.

devguide_clip1085

Adding allowed content

Allowed content defines which of the website's documents are included in the index. Specify documents using a combination of the following options:

•Path - path expression identifying the documents that should be indexed.

•Document types - allows you to limit which document types are included in the index.

The following properties define types of additional content that you can include in Documents search indexes. They are not available for Documents crawler indexes:

•Include ad-hoc forums - if checked, the index also includes the content of ad-hoc forums placed on the specified documents (if there are any).

•Include blog comments - if checked, the index also includes blog comments posted for the blog post documents.

•Include message boards - if checked, the index also includes message boards placed on the specified documents.

•Include categories - if enabled, the index stores the display names of Categories assigned to the specified documents. This means that the search results also include documents that belong to categories whose name matches the search expression.

devguide_clip0951

Examples:

Allowed content settings	Result
•Path: /% •Document types: empty	Indexes all documents on the site.
•Path: /Partners •Document types: empty	Only indexes the /Partners page, without the child pages placed under it.
•Path: empty •Document types: CMS.News	Indexes all documents of the CMS.News document type on the entire site. In this case, an empty path field value is equal the /% expression.
•Path: /Products/% •Document types: CMS.Smartphone;CMS.Laptop	Indexes all documents of the CMS.Smartphone and CMS.Laptop document types found under the /Products section.

Adding excluded content

Excluded content allows you to remove documents or entire website sections from the allowed content. For example, if you allow /% and exclude /Special‑pages/% at the same time, the index will include all documents on the site except for the ones found under the /Special-pages node.

You can specify the following options:

•Path - path expression identifying the documents that should be excluded.

•Document types - allows you to limit which document types are excluded from the index.

devguide_clip0962

Examples:

Excluded content settings	Result
•Path: /Partners •Document types: empty	Excludes the /Partners page from the index. Child pages are not excluded.
•Path: empty •Document types: CMS.News	Excludes all documents of the CMS.News document type from the index. In this case, an empty path field value is equal the /% expression.
•Path: /Products/% •Document types: CMS.Smartphone;CMS.Laptop	Excludes all documents of the CMS.Smartphone and CMS.Laptop document types found under the /Products section from the index.

InfoBox_Tip

Excluding individual documents from all indexes

You can also exclude specific documents from all smart search indexing:

1. Go to CMS Desk and select the given document in the content tree.

2. In Edit mode, open the Properties -> Navigation tab.

3. Enable the Exclude from search property.

4. Click Save.

Customizing how document crawler indexes process page content (API)

By default, the system converts the HTML output of documents to plain text (stripped of all HTML tags, JavaScript and whitespace formatting) before saving it to document crawler indexes. If you wish to index the content of any tags or exclude parts of the page output, you can customize how the crawlers process the HTML.

You need to implement your custom functionality in a handler for the OnHtmlToPlainText event of the CMS.SiteProvider.SearchHelper class. This event occurs whenever a document search crawler processes the HTML output of a page.

To assign a method as the handler for the OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations). You can define the content of the class as shown below:

[C#]

using CMS.SettingsProvider;
using CMS.SiteProvider;

[DocumentCrawlerContentLoader]
public partial class CMSModuleLoader
{
/// <summary>
/// Attribute class for assigning event handlers.
/// </summary>
private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute
{
/// <summary>
/// Called automatically when the application starts.
/// </summary>
public override void Init()
{
// Assigns a handler for the OnHtmlToPlainText event
SearchHelper.OnHtmlToPlainText += new SearchHelper.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
}

static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
{
// Add your custom HTML processing actions and return the result as a string
}
}
}

The OnHTMLToPlainText event provides the following string parameters to the handler:

•plainText - contains the page output already stripped of all tags and converted to plain text.

•originalHTML - allows you to access the raw page code without any modifications.

Help URL: http://devnet.kentico.com/docs/7_0/devguide/index.html?smart_search_definign_index_content.htm