API Questions on Kentico API.
Version 6.x > API > Documents crawler smart search index content field View modes: 
User avatar
Certified Developer v7
Certified  Developer v7
CMeeg - 2/20/2012 4:53:37 AM
   
Documents crawler smart search index content field
The current website we're developing has some aggregate pages that include some simple repeaters that pull through their content from child documents. The child documents are used only to hold data that the aggregate pages display and are not "pages" in the normal sense.

We'd like these aggregate pages to appear in search results and so have added these pages to a documents crawler smart search index. This is all working fine and search results are being returned as expected - except the content field of these aggregate pages is always blank in the search results listing.

I understand that the content that is pulled through and displayed in search results is determined by the content field you have selected for a particular document type on the Search fields tab, but aggregate pages don't have any "content" as it's it's all just pulled through from child documents.

So, my question is this - if the documents are being crawled and their HTML content added to the index, how can I get at this indexed HTML content when displaying search results?

User avatar
Kentico Support
Kentico Support
kentico_jurajo - 2/20/2012 6:02:59 AM
   
RE:Documents crawler smart search index content field
Hi,

What are your index settings?

Have you created the documents crawler index?

Also, what is the content of the index?

Best regards,
Juraj Ondrus

User avatar
Certified Developer v7
Certified  Developer v7
CMeeg - 2/20/2012 12:40:18 PM
   
RE:Documents crawler smart search index content field
Hi Juraj,

I have created the documents crawler index. The index is using the Standard analyzer and Default stop words setting. It has been added to the site and culture set to en-GB (the default and only culture available on the site).

The content of the index is a single allowed content item, which is set to index a bunch of document types including CMS.MenuItem as well as various custom doc types. The custom doc types in question are called RWS.News, RWS.NewsYear (for the aggregate pages I mentioned in the OP). The path is /%

The index is definitely working as I'm getting results for all document types covered by the index. CMS.MenuItem results return content in the search results from the DocumentContent field as you would expect, but because the RWS.News and RWS.NewsYear pages do not have DocumentContent (i.e. their content is pull through by a repeater) the search results content for these documents is just an empty string.

Should the content data returned by smart search return the "indexed" content rather than still trying to get content from the DocumetContent field? Or if this is by design, is there a way that I can fetch this indexed content so that I do not get blank content fields in my search results?

Hope this all makes sense - let me know if there is any more info that would be of use.

User avatar
Kentico Support
Kentico Support
kentico_jurajo - 2/21/2012 12:57:54 AM
   
RE:Documents crawler smart search index content field
Hi,

What is the transformation you are using to display the results? The default transformation has this code:

<%#SearchHighlight(CMS.GlobalHelper.HTMLHelper.HTMLEncode(TextHelper.LimitLength(HttpUtility.HtmlDecode(
CMS.GlobalHelper.HTMLHelper.StripTags(CMS.ExtendedControls.ControlsHelper.RemoveDynamicControls(
GetSearchedContent(DataHelper.GetNotEmpty(Eval("Content"), ""))), false, " ")), 280, "..."))

and as you can see, the content is used by default. So, you may also need to change the transformation. You can use the GetSearchValue("fieldName") function to get the value of appropriate field.

Best regards,
Juraj Ondrus

User avatar
Certified Developer v7
Certified  Developer v7
CMeeg - 2/21/2012 4:25:56 AM
   
RE:Documents crawler smart search index content field
Hi Juraj,

That content field is the one that I'm using and it's always empty. As far as I'm aware the only fields available are: id, type, score, position, title, content, created and image. Is there another search field that I can use to get the indexed content with GetSearchValue?

I thought that maybe I'm not explaining my problem very well so maybe if I can leave some steps to recreate my issue you may be able to see more clearly what I am trying to get at:

I've got a Kentico installation that uses the out-of-box ecommerce site template. I've added a new Smart Search index that uses the Documents crawler, Standard analyzer and Default stop words. I've added it to the site and added the en-US culture. I've then added some allowed content for Path /% and doc types CMS.MenuItem. I've rebuilt the index and it's added 25 items.

I then go to the Search preview tab (still within the Smart search area) and search for "Creative Zen" with search mode set to Any Word. There are 3 results returned and only one displays content. I believe the transformation used on that page uses the same code that you have posted in your previous message.

The result that displays content is for the home page and the reason it displays content is because it is the only one of the 3 that has "DocumentContent", but the problem with that is that the content in DocumentContent has no relation to the search you have just performed - it does not contain the words "Creative Zen". It is displayed in the search results though because there has been a match on the "indexed content".

It is this discrepancy between the "indexed content" and the "DocumentContent" that I want to get around. It would be nice if the search results returned had their Content search field set to the "indexed content" so that the results were more meaningful when displayed. It would also mean that all 3 of those results returned would show content in the search results.

Hopefully that will make a little more sense now! So my question remains, when displaying search results how can I display the indexed content so that the search results are more relevant?

User avatar
Kentico Support
Kentico Support
kentico_jurajo - 2/21/2012 8:21:13 AM
   
RE:Documents crawler smart search index content field
Hi,

The indexed content is not stored anywhere right now - but I have added it as a requirement to be considered for future versions. It is not available due to the performance - it could have a negative impact.

Right now, the solution is to change the transformation as I mentioned, but in a more customized way - using a custom function, you will get the HTML of that page, parse it and display in the results what you need and want.

Best regards,
Juraj Ondrus

User avatar
Certified Developer v7
Certified  Developer v7
CMeeg - 2/23/2012 6:03:08 AM
   
RE:Documents crawler smart search index content field
Thanks Juraj - I think it might be nice to have the option to get the indexed content from the documents crawler. Hopefully that will make it into a future version of Kentico.

For now though I ended up creating a custom search index that does much the same as the out of box documents crawler except instead of storing DocumentContent in the content field of the document in the Lucene index it stores the indexed content as returned by the scraper. I can then fetch that indexed content in my search results and highlight matched words etc as normal.

I also added a way to include and exclude certain elements from the HTML content that is returned from the scraper. If you are considering changing the requirements for the documents crawler in the future can I suggest adding this feature too? So, for example, you can specify that you only want to index content that is inside #content and exclude content in #sidebar,#footer.

If anyone else happens to have similar requirements to mine then the documentation was a good starting point for doing this Custom search index documentation as well as having a good rummage through the SearchHelper class found in CMS.SiteProvider - this exposed methods for scraping HTML as well as stripping HTML to provide the raw content so I didn't have to write my own - thanks Kentico!

Performance-wise this should hopefully be no less performant than the built-in documents crawler although I expect the size of my index will be slightly larger depending on the content of the pages. The site I'm doing this for is pretty small though so at the moment I have noticed no problems from taking this approach.

Thanks again for your help Juraj.

Cheers,
Chris