|
||
There are two types of search indexes available for the documents in a website's content tree.
Documents type indexes include information from general document fields such as metadata, the text content of certain web parts placed on page (menu item) documents, as well as the selected fields of individual documents types (described in the Settings for particular object types topic). Data from other documents or objects displayed by web parts is not indexed. For example, the content of news documents displayed by a Repeater web part placed on an indexed document will not be added to the index etc.
Documents crawler type indexes directly parse the HTML output generated by documents, which means that all text located on or associated with a document is searchable. This allows document content to be searched more accurately than using a Documents type index. However, building and updating a Documents crawler index may require more time and resources, particularly in the case of large indexes and complex documents. The crawler accesses documents under a specific user account (the administrator account by default). You can set the details of the crawler account by adding keys to your project's web.config file. All keys related to smart search indexes are listed in the Smart search settings section of Appendix B - Web.config parameters.
The process used to define which documents on the site should be indexed is the same for both document index types. Specify allowed or excluded content on the Index tab of the index's editing interface by clicking Add allowed content or Add excluded content respectively.
Allowed content defines which of the website's documents are included in the index. Specify documents using a combination of the following options:
•Path - path expression identifying the documents that should be indexed.
•Document types - allows you to limit which document types are included in the index.
The following properties define types of additional content that you can include in Documents search indexes. They are not available for Documents crawler indexes:
•Include ad-hoc forums - if checked, the index also includes the content of ad-hoc forums placed on the specified documents (if there are any).
•Include blog comments - if checked, the index also includes blog comments posted for the blog post documents.
•Include message boards - if checked, the index also includes message boards placed on the specified documents.
•Include categories - if enabled, the index stores the display names of Categories assigned to the specified documents. This means that the search results also include documents that belong to categories whose name matches the search expression.
Examples:
Allowed content settings |
Result |
•Path: /% •Document types: empty |
Indexes all documents on the site. |
•Path: /Partners •Document types: empty |
Only indexes the /Partners page, without the child pages placed under it. |
•Path: empty •Document types: CMS.News |
Indexes all documents of the CMS.News document type on the entire site.
In this case, an empty path field value is equal the /% expression. |
•Path: /Products/% •Document types: CMS.Smartphone;CMS.Laptop |
Indexes all documents of the CMS.Smartphone and CMS.Laptop document types found under the /Products section. |
Excluded content allows you to remove documents or entire website sections from the allowed content. For example, if you allow /% and exclude /Special‑pages/% at the same time, the index will include all documents on the site except for the ones found under the /Special-pages node.
You can specify the following options:
•Path - path expression identifying the documents that should be excluded.
•Document types - allows you to limit which document types are excluded from the index.
Examples:
Excluded content settings |
Result |
•Path: /Partners •Document types: empty |
Excludes the /Partners page from the index. Child pages are not excluded. |
•Path: empty •Document types: CMS.News |
Excludes all documents of the CMS.News document type from the index.
In this case, an empty path field value is equal the /% expression. |
•Path: /Products/% •Document types: CMS.Smartphone;CMS.Laptop |
Excludes all documents of the CMS.Smartphone and CMS.Laptop document types found under the /Products section from the index. |
|
Excluding individual documents from all indexes
You can also exclude specific documents from all smart search indexing:
1. Go to CMS Desk and select the given document in the content tree. 2. In Edit mode, open the Properties -> Navigation tab. 3. Enable the Exclude from search property. 4. Click Save.
|
By default, the system converts the HTML output of documents to plain text (stripped of all HTML tags, JavaScript and whitespace formatting) before saving it to document crawler indexes. If you wish to index the content of any tags or exclude parts of the page output, you can customize how the crawlers process the HTML.
You need to implement your custom functionality in a handler for the OnHtmlToPlainText event of the CMS.SiteProvider.SearchHelper class. This event occurs whenever a document search crawler processes the HTML output of a page.
To assign a method as the handler for the OnHTMLToPlainText event, add a new class to the ~/App_Code folder of your web project (or ~/Old_App_Code on web application installations). You can define the content of the class as shown below:
[C#]
using CMS.SettingsProvider; |
The OnHTMLToPlainText event provides the following string parameters to the handler:
•plainText - contains the page output already stripped of all tags and converted to plain text.
•originalHTML - allows you to access the raw page code without any modifications.