Tech by LLM
Laura L Martin

Blog by LLM
Speed Up Searching Inside.

A Problem Solving Experience

Skill Set Involved: Understanding the parameters of the repository software, in order to leverage its assets and mitigate its limitations, in the process of determining solutions.
Problem: Searching for content inside large files of different formats is very slow.
Solution: Implement new storage paradigm to speed up search.

I was a member of a team responsible for building custom procedures, numbering in the thousands, which documented how to maintain power plant equipment. The client would send the team large packages of equipment manufacturer information in PDF format. The team gathered information from those PDFs, and used that material as the basis for building custom procedures in Word, enhanced with custom drawings, designs and other material, which the client required.

Each PDF was a very large package of content, housing dozens of equipment manuals, and numbering in the thousands of pages. Hundreds of PDF packages were sent to us by the client every month.

Initially the PDFs and the custom procedures (Word files) were stored in the same folder hierarchy on a local server, organized by plant. Searching the folder hierarchy for details inside the files was a very slow process, since the Windows search engine had to search tens of thousands of pages housed in both PDF and Word files.

I suggested a redesign of the repository and the proposal was approved. I implemented a storage paradigm of two major hierarchies, where all the PDF files were stored in one folder hierarchy, and all the custom procedures in Word were stored in another. The files were still organized secondarily by plant, which required plant subfolders in each of the two major hierarchies storing documents by type. The replication of a plant subfolder structure in each hierarchy was worth the initial setup effort, however, since the result was that the Word files were now isolated from the much larger and denser PDFs.

Results: This sped up the search process exponentially, especially when searching for information in the custom work (the Word docs); Windows search could be targeted to the smaller hierarchy of Word files. Previously, the search function had to "open" and search inside thousands of PDFs as well as the Word docs, when all the PDF and Word files were stored together in the same plant folder.

Takeaway: Weigh the pros and cons of a little bit of setup work to implement a new storage paradigm, in exchange for a long-term solution to resolve a problem encountered with the previous/current paradigm.

More tips: When designing your content repository, take into account:

Types of content
- - static reference material vs custom editable material
  - images vs text
  - structured content vs unstructured content
File formats to be used
- - PDF
  - Word
  - other
How will you need or want to search for information?
- - filenames only
  - search inside files
  - metadata tags
  - other
Type of repository to be used
- - cloud vs on-prem
  - ECM products such as Confluence, Box, SharePoint vs Windows-based shared drives on a server
The repository software's assets and limitations
- - proprietary formats only or does it play nice with other products?
  - how does it handle structured content vs unstructured content - awkwardly or adequately?
  - easy to cross-link/cross-reference related content?
  - automatically generates hierarchy structures, or must this be created manually?
  - and much more
What sort of organizational structure - using the available repository software - will achieve your most important goals, and relieve the most pain points?

Tech by LLMLaura L Martin

More Blog Posts by LLM

Tech by LLM
Laura L Martin