Smart Search - custom index to search PDFs in media library

Shmar Hill asked on September 1, 2015 19:41

Hi,

Anyone here created a smart search index for searching PDF content that are stored in Media Library?

I created a very basic smart search custom index that searches for content in PDFs.

It works on some PDFs but when I tried testing on more PDFs it crashes. Does not "like" some of the characters in PDF.

Any ideas? :)

Recent Answers


Roman Hutnyk answered on September 1, 2015 20:21

We had similar problem on our project recently and we've implemented custom text extractor for PDFs, which seems to be working fine so far.

using System;
using System.IO;
using CMS.Base;
using CMS.DataEngine;
using CMS.Search;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

public class CustomSearchTextExtractor : ISearchTextExtractor
{
    public CMS.Base.XmlData ExtractContent(CMS.Core.BinaryData data, ExtractionContext context)
    {
        string result = String.Empty;
        string tempPath = Path.Combine(Path.GetTempPath(), Path.GetTempFileName()) + ".pdf";
        PDDocument doc = null;
        try
        {
            File.WriteAllBytes(tempPath, data.Data);
            doc = PDDocument.load(tempPath);
            PDFTextStripper stripper = new PDFTextStripper();
            result = stripper.getText(doc);              
        }
        finally
        {
            if (doc != null)
            {
                doc.close();
            }

            if (File.Exists(tempPath)) 
            {
                File.Delete(tempPath);
            }
        }
        var content = new XmlData();        
        content.SetValue(SearchFieldsConstants.CONTENT, result.ToString());
        return content;
    }
}
1 votesVote for this answer Mark as a Correct answer

evan pan answered on February 19, 2016 05:56

Hi,

Thanks for sharing thses code. But I wonder whether I need some 3rd party pdf text extraction toolkits to help me extract text from pdf files. If so, it will be better if itt offers free trial package for users to check. I will try it later and send you feedback.

Best regrads,

Pan

0 votesVote for this answer Mark as a Correct answer

kalpak shambharkar answered on March 1, 2016 14:32

We have requirement to search content from uploaded media within media library. I am trying to apply smart search for searching media content but didn’t get any success also I found some articles which suggest to create custom search index but how can we create and apply search index. Or if you have any alternate solution on Searching content from media library then please suggest.

0 votesVote for this answer Mark as a Correct answer

kalpak shambharkar answered on March 7, 2016 16:38

This code shows only how to extract the text from .pdf files. But I am bit confused about how can we call this code with search index or where we apply this code in our kentico application. It’s really a great pleasure if u will help me out of this. Thanks in advance..

0 votesVote for this answer Mark as a Correct answer

   Please, sign in to be able to submit a new answer.