Thanks Juraj - I think it might be nice to have the option to get the indexed content from the documents crawler. Hopefully that will make it into a future version of Kentico.
For now though I ended up creating a custom search index that does much the same as the out of box documents crawler except instead of storing DocumentContent in the content field of the document in the Lucene index it stores the indexed content as returned by the scraper. I can then fetch that indexed content in my search results and highlight matched words etc as normal.
I also added a way to include and exclude certain elements from the HTML content that is returned from the scraper. If you are considering changing the requirements for the documents crawler in the future can I suggest adding this feature too? So, for example, you can specify that you only want to index content that is inside #content and exclude content in #sidebar,#footer.
If anyone else happens to have similar requirements to mine then the documentation was a good starting point for doing this
Custom search index documentation as well as having a good rummage through the SearchHelper class found in CMS.SiteProvider - this exposed methods for scraping HTML as well as stripping HTML to provide the raw content so I didn't have to write my own - thanks Kentico!
Performance-wise this should hopefully be no less performant than the built-in documents crawler although I expect the size of my index will be slightly larger depending on the content of the pages. The site I'm doing this for is pretty small though so at the moment I have noticed no problems from taking this approach.
Thanks again for your help Juraj.
Cheers,
Chris