Wordiscover is a Microsoft Word addin built in C# and the installer is available for download from http://wordiscover.cint.io. I am open-sourcing it because I think it provides some valuable insight into (A) how to create a Word addin, (B) how to efficiently search Word documents, and (C) an easy insight into async in C#.
Since this is a Winform app, the main logic extends from FindPane.cs
. Each section that lives on the Pane has its own .cs
file - i.e., the Results area lives in FindPane.Results.cs
. The main search logic lives in Finder.cs
and FinderHelpers.cs
.
The idea behind the search algorithm is to search a Word.Range
with a regular expression. Here's the idea... Each paragraph has a Range
that has a beginning and ending position (int
), and a Text
property which contains the paragraph text.
- For a regular regex search, we simply search that
Text
for a match and return any matches - For a proximity regex search, we need to get the position of
currentParagraph.Range.Start
and extend our search tocurrentParagraph+Proximity.Range.End
. Then we create aWord.Range
with those positions, grab theText
and search via regex again.
There were some interesting challenges I faced in making this addin as efficient as possible. The biggest challenge was figuring out a way to even perform such expensive searches efficiently and deliver results asynchronously.
I first tried to proximity search with regular expressions themselves - this is really awful in terms of performance and also leads to runaway regexes.
The solution I came up with was to just grab paragraph start/end positions, create a Word.Range
for them and search the range text. This led to my next problem.
My previous faillure led to this design, which follows this is idea:
- For each paragraph, grab the ending position of the proximity-number's paragraph and store in a
List
. - When search begins, iterate (
i
) through each paragraph. Set newWord.Range
start tocurrentParagraph.Start
and end toList[i+ProxNumber]
.
I started out with a regular Parallel for-loop to iterate the paragraphs, just accessing each paragraph by index: paragraphs[i]
. For some reason, and I have no idea why, the performance of this is terrible, terrible, terrible.
The solution? To use an Iterator
. Counterintuitive as it is, using the paragraphs iterator and Parallel.ForEach
helped to speed up the routine that grabs the ending position of every paragraph. If you understand why this is faster than simply indexing in via a basic for-loop, please send me the explanation! :)
Alas, this method wasn't foolproof. The problem was when to perform it. On document load would be nice, but what if the user changes the document and by doing so, makes all of the ending positions in the List
irrelevant? So, we perform this every time the user clicks 'Proximity Search.'
The other unfortunate part is that I couldn't get this routine to perform fast enough when operating asynchronously, so it is a sync method right now. (See Finder.Proximity.cs
)