Custom Search Index. Do i need to write my own pdf,docx,xlsx, etc... extractors?

Mark Collins asked on July 10, 2017 21:43

I am building a custom search index for documents in SharePoint online. most of the documents are .pdf's. Will Kentico 10 be able to do this inherently or do i need to supply my own content extractors? It appears there are built in ones for attachements, but i am not sure how to use any of those in my custom index code.

Code:

    namespace Daikin.Kentico
{
    public class SPOFileIndex : ICustomSearchIndex
    {
        public void Rebuild(SearchIndexInfo srchInfo)
        {
            // Checks whether the index info object is defined
            if (srchInfo != null)
            {
                // Gets an index writer object for the current index
                IIndexWriter iw = srchInfo.Provider.GetWriter(true);

                // Checks the whether writer is defined
                if (iw != null)
                {
                    try
                    {
                        // Gets an info object of the index settings
                        SearchIndexSettingsInfo sisi = srchInfo.IndexSettings.Items[SearchHelper.CUSTOM_INDEX_DATA];

                    // Gets the search path from the Index data field
                    // string path = Convert.ToString(sisi.GetValue("CustomData"));
                    string listName = "Documents";

                        // Gets the SharePoint connection
                        SharePointConnectionInfo connection = SharePointConnectionInfoProvider.GetSharePointConnectionInfo("XXXX_SharePoint", SiteContext.CurrentSiteID);

                        // Converts the SharePointConnectionInfo into a connection data object
                        SharePointConnectionData connectionData = connection.ToSharePointConnectionData();

                        // Gets the file service implementation
                        ISharePointFileService fileService = SharePointServices.GetService<ISharePointFileService>(connectionData);
                    // Gets the list service implementation
                    ISharePointListService listService = SharePointServices.GetService<ISharePointListService>(connectionData);

                    DataSet results = listService.GetListItems(listName);

                    foreach (DataRow dr in results.Tables[0].Rows)
                    {
                        // Gets the current file info
                        string spFile =  dr["FileRef"].ToString();
                        string spServer = "https://xxxxxx.sharepoint.com";
                        ISharePointFile file = fileService.GetFile(spFile);            
                         byte[] fileBytes = file.GetContentBytes();
                        if (fileBytes.Length > 0)
                        {
                            string text = fileBytes.ToString();
                            // Converts the text to lower case
                            text = text.ToLowerCSafe();
                            // Removes diacritics
                            text = TextHelper.RemoveDiacritics(text);     
                            // Creates a new Lucene.Net search document for the current text file
                            SearchDocumentParameters documentParameters = new SearchDocumentParameters()
                            {
                                Index = srchInfo,
                                Type = SearchHelper.CUSTOM_SEARCH_INDEX,
                                Id = Guid.NewGuid().ToString(),
                                Created = Convert.ToDateTime(dr["Created"])
                            };
                            ISearchDocument doc = SearchHelper.CreateDocument(documentParameters);

                            // Adds a content field. This field is processed when the search looks for matching results.
                            doc.AddGeneralField(SearchFieldsConstants.CONTENT, text, SearchHelper.StoreContentField, true);

                            // Adds a title field. The value of this field is used for the search result title.
                            doc.AddGeneralField(SearchFieldsConstants.CUSTOM_TITLE, file.Name, true, false);

                            // Adds a content field. The value of this field is used for the search result excerpt.
                            doc.AddGeneralField(SearchFieldsConstants.CUSTOM_CONTENT, TextHelper.LimitLength(text, 200), true, false);

                            // Adds a date field. The value of this field is used for the date in the search results.
                            doc.AddGeneralField(SearchFieldsConstants.CUSTOM_DATE, file.TimeCreated, true, false);

                            // Adds a url field. The value of this field is used for link urls in the search results.
                            doc.AddGeneralField(SearchFieldsConstants.CUSTOM_URL, spServer + spFile, true, false);

                            doc.AddGeneralField("SecurityRole", dr["SecurityRole"].ToString(), true, true);

                            // Adds an image field. The value of this field is used for the images in the search results.
                            // Commented out, since the image file does not exist by default
                            // doc.AddGeneralField(SearchFieldsConstants.CUSTOM_IMAGEURL, "textfile.jpg", true, false);

                            // Adds the document to the index
                            iw.AddDocument(doc);
                            i = i + 1;
                            if (i == 5)
                            {
                                break;
                            }
                    }
                }


                    // Flushes the index buffer
                    iw.Flush();

                    // Optimizes the index
                        iw.Optimize();
                   // }
                }

                // Logs any potential exceptions
                catch (Exception ex)
                {
                    EventLogProvider.LogException("CustomTextFileIndex", "Rebuild", ex);
                }

                // Always close the index writer
                finally
                {
                    iw.Close();
                }
            }
        }
    }
}

}

Correct Answer

Mark Collins answered on August 1, 2017 21:29

The properties/fields get added, but the content does not get indexed. I was able to get partnumber and other items to be searched by adding them directly to the content ex: var content = $"{productType} {subproductType} {model} {documentType} {PartNumber} {name}";

doc.AddGeneralField(SearchFieldsConstants.CONTENT, textxml + content, SearchHelper.StoreContentField, true);

0 votesVote for this answer Unmark Correct answer

Recent Answers


Mike Wills answered on July 10, 2017 22:57

Hi Mark,

I don't think you'll have to create your own extractor, but I do think you'll need to read each file's binary data and then call the SearchTextExtractorManager.ExtractData method. I haven't tried it myself, so I'd love to hear how it goes.

Mike

0 votesVote for this answer Mark as a Correct answer

Mustafa Muhammad Noman answered on July 26, 2017 17:08

hi Mark, is your solution work ? I'm also looking for SharePoint online document lib file content index in kentico. can you share you code, it will be great help.

Thanks

0 votesVote for this answer Mark as a Correct answer

Mark Collins answered on July 27, 2017 16:06

I did get it working. However, now i am trying to figure out how to add additional content to the index. I assumed that when you all a General Field:

doc.AddGeneralField("PartNumber", dr["PartNumber"].ToString(), true, true);

That the field would be indexed, but it doesn't appear to be. I am going to check with our consultant.

Here is my current code:

using System;
using CMS.Search;
using CMS.DataEngine;
using CMS.IO;
using CMS.Helpers;
using CMS.EventLog;
using CMS.Base;
using CMS;
using CMS.SharePoint;
using CMS.SiteProvider;
using System.Data;
using System.Net;
using CMS.DocumentEngine;
using CMS.DocumentEngine.Web.UI;

[assembly: RegisterCustomClass("SPOFileIndex", typeof(Daikin.Kentico.SPOFileIndex))]
namespace Daikin.Kentico
{
    public class SPOFileIndex : ICustomSearchIndex
    {
        public void Rebuild(SearchIndexInfo srchInfo)
        {
            // Checks whether the index info object is defined
            if (srchInfo != null)
            {
                // Gets an index writer object for the current index
                IIndexWriter iw = srchInfo.Provider.GetWriter(true);

                // Checks the whether writer is defined
                if (iw != null)
                {
                    try
                    {
                        // Gets an info object of the index settings
                        SearchIndexSettingsInfo sisi = srchInfo.IndexSettings.Items[SearchHelper.CUSTOM_INDEX_DATA];

                        // Gets the search path from the Index data field
                        string listName = "Documents";

                        // Gets the SharePoint connection
                        SharePointConnectionInfo connection = SharePointConnectionInfoProvider.GetSharePointConnectionInfo("Daacloud_SharePoint", SiteContext.CurrentSiteID);

                        // Converts the SharePointConnectionInfo into a connection data object
                        SharePointConnectionData connectionData = connection.ToSharePointConnectionData();

                        // Gets the file service implementation
                        ISharePointFileService fileService = SharePointServices.GetService<ISharePointFileService>(connectionData);

                        // Gets the list service implementation
                        ISharePointListService listService = SharePointServices.GetService<ISharePointListService>(connectionData);
                        DataSet results = listService.GetListItems(listName);

                        //keep track of the number of files so i can control how many are in the index during developement.
                        int i = 0;
                        // Loops through all files
                        foreach (DataRow dr in results.Tables[0].Rows)
                        {
                            // Gets the current file info
                            string spFile =  dr["FileRef"].ToString();
                            ISharePointFile file = fileService.GetFile(spFile);                    
                            byte[] fileBytes = file.GetContentBytes();

                            // Checks that the file is not empty
                            if (fileBytes.Length > 0)
                            {                              
                                // Converts the text to lower case
                                // text = text.ToLowerCSafe();

                                // Removes diacritics
                                //  text = TextHelper.RemoveDiacritics(text);
                                string extension = file.Extension.ToString();
                                XmlData textxml = SearchTextExtractorManager.ExtractData(extension, fileBytes, null);
                                // Creates a new Lucene.Net search document for the current file
                                SearchDocumentParameters documentParameters = new SearchDocumentParameters()
                                {
                                    Index = srchInfo,
                                    Type = SearchHelper.CUSTOM_SEARCH_INDEX,
                                    Id = Guid.NewGuid().ToString(),
                                    Created = Convert.ToDateTime(dr["Created"])
                                };
                                ISearchDocument doc = SearchHelper.CreateDocument(documentParameters);

                                // Adds a content field. This field is processed when the search looks for matching results.
                                doc.AddGeneralField(SearchFieldsConstants.CONTENT, textxml, SearchHelper.StoreContentField, true);

                                // Adds a title field. The value of this field is used for the search result title.
                                doc.AddGeneralField(SearchFieldsConstants.CUSTOM_TITLE, file.Title, true, false);

                                // Adds a content field. The value of this field is used for the search result excerpt.
                               // string textLimit = XmlHelper.GetAttributeValue("_content")
                                doc.AddGeneralField(SearchFieldsConstants.CUSTOM_CONTENT, dr["Description0"].ToString(), true, false);

                                // Adds a date field. The value of this field is used for the date in the search results.
                                doc.AddGeneralField(SearchFieldsConstants.CUSTOM_DATE, file.TimeCreated, true, false);

                                // Adds a url field. The value of this field is used for link urls in the search results.
                                doc.AddGeneralField(SearchFieldsConstants.CUSTOM_URL, TransformationHelper.HelperObject.GetSharePointFileUrl("Daacloud_SharePoint", spFile), true, false);

                                doc.AddGeneralField("SecurityRole", dr["SecurityRole"].ToString(), true, true);
                                doc.AddGeneralField("Archive Status", dr["ArchiveStatus"].ToString(), true, true);
                                doc.AddGeneralField("Product Type", dr["Product_x0020_Type"].ToString(), true, true);
                                doc.AddGeneralField("SubProduct Type", dr["SubProduct_x0020_Type"].ToString(), true, true);
                                doc.AddGeneralField("Model", dr["Model"].ToString(), true, true);
                                doc.AddGeneralField("Document Type", dr["Document_x0020_Type"].ToString(), true, true);
                                doc.AddGeneralField("PartNumber", dr["PartNumber"].ToString(), true, true);
                                doc.AddGeneralField("Title", dr["Title"].ToString(), true, true);
                                doc.AddGeneralField("Description", dr["Description0"].ToString(), true, true);
                                doc.AddGeneralField("SubType", dr["SubType"].ToString(), true, true);
                                doc.AddGeneralField("Name", dr["FileLeafRef"].ToString(), true, true);
                                //doc.AddGeneralField("IndexNumber", i, true, false);

                                // Adds an image field. The value of this field is used for the images in the search results.
                                // Commented out, since the image file does not exist by default
                                // doc.AddGeneralField(SearchFieldsConstants.CUSTOM_IMAGEURL, "textfile.jpg", true, false);

                                // Adds the document to the index
                                iw.AddDocument(doc);
                                i = i + 1;
                                //change if you want to limit the index
                                if (i < 0)
                                {
                                    break;
                                }
                        }
                    }

                        // Flushes the index buffer
                        iw.Flush();

                        // Optimizes the index
                        iw.Optimize();
                       // }
                    }

                    // Logs any potential exceptions
                    catch (Exception ex)
                    {
                        EventLogProvider.LogException("CustomTextFileIndex", "Rebuild", ex);
                    }

                    // Always close the index writer
                    finally
                    {
                        iw.Close();
                    }
                }
            }
        }
    }
}
0 votesVote for this answer Mark as a Correct answer

Mike Wills answered on July 28, 2017 02:13

Hey Mark,

Maybe this would help. When I add fields to be indexed as properties I use the SearchHelper.AddGeneralField method like this:

SearchHelper.AddGeneralField(doc, "partnumber", dr["PartNumber"].ToString(), true, false, false);

I vaguely remember having to switch to using this helper method to solve the same issue.

Mike

0 votesVote for this answer Mark as a Correct answer

Mark Collins answered on July 28, 2017 22:47

Thanks Mike. Unfortunately, that didn't work either. I am trying on partnumber and file name and not getting any results.

0 votesVote for this answer Mark as a Correct answer

Mike Wills answered on July 29, 2017 00:23

Have you already tried opening the index in Luke to see if these properties are added to the index?

0 votesVote for this answer Mark as a Correct answer

Mustafa Muhammad Noman answered on September 14, 2017 15:28

hi mark,

did u able to make it work ? please share with us.

thanks

0 votesVote for this answer Mark as a Correct answer

   Please, sign in to be able to submit a new answer.