Tech by LLM
Laura L Martin
Blog by LLM
Speed Up Searching Inside.
A Problem Solving Experience
Skill Set Involved: Understanding the parameters of the repository software, in order to leverage its assets and mitigate its limitations, in the process of determining solutions.
Problem: Searching for content inside large files of different formats is very slow.
Solution: Implement new storage paradigm to speed up search.
I was a member of a team responsible for building custom procedures, numbering in the thousands, which documented how to maintain power plant equipment. The client would send the team large packages of equipment manufacturer information in PDF format. The team gathered information from those PDFs, and used that material as the basis for building custom procedures in Word, enhanced with custom drawings, designs and other material, which the client required.
Each PDF was a very large package of content, housing dozens of equipment manuals, and numbering in the thousands of pages. Hundreds of PDF packages were sent to us by the client every month.
Initially the PDFs and the custom procedures (Word files) were stored in the same folder hierarchy on a local server, organized by plant. Searching the folder hierarchy for details inside the files was a very slow process, since the Windows search engine had to search tens of thousands of pages housed in both PDF and Word files.
I suggested a redesign of the repository and the proposal was approved. I implemented a storage paradigm of two major hierarchies, where all the PDF files were stored in one folder hierarchy, and all the custom procedures in Word were stored in another. The files were still organized secondarily by plant, which required plant subfolders in each of the two major hierarchies storing documents by type. The replication of a plant subfolder structure in each hierarchy was worth the initial setup effort, however, since the result was that the Word files were now isolated from the much larger and denser PDFs.
Results: This sped up the search process exponentially, especially when searching for information in the custom work (the Word docs). Windows search could be targeted to the smaller hierarchy of Word files. Previously, the search function had to "open" and search inside thousands of PDFs as well as the Word docs, when all the PDF and Word files were stored together in the same plant folder.
Takeaway: When designing your content repository, take into account:
the types of content (e.g., reference material vs custom material)
the file formats to be used (e.g., PDF vs Word or other)
how you will need to search for information
the type of repository to be used (e.g., local server vs cloud)
the repository software's assets and limitations (e.g., cloud services like Confluence, Box, SharePoint, or on-prem solutions like Windows-based shared drives on a local server)
what sort of organizational structure - using the available repository software - will achieve your most important goals, and relieve the most pain points.
Weigh the pros and cons of a little bit of setup work to implement a new storage paradigm, in exchange for a long-term solution to resolve a problem encountered with the previous paradigm.
All content copyright Laura L. Martin. All rights reserved.
This content may not be copied or reproduced without prior written permission and credit to Laura L. Martin.