@@ -216,7 +216,32 @@ <h1>Handling Metadata<a class="headerlink" href="#handling-metadata" title="Perm
216
216
< p > Neofuzz makes it easy to do fuzzy search in text corpora.
217
217
Sometimes it is, however beneficial to be able to access metadata about the entries retrieved in fuzzy search.</ p >
218
218
< p > The most sensible way to handle this is to store your metadata in a table that is in the same order as the corpus.</ p >
219
+ < div class ="highlight-python notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> import</ span > < span class ="nn "> pandas</ span > < span class ="k "> as</ span > < span class ="nn "> pd</ span >
220
+
221
+ < span class ="n "> corpus</ span > < span class ="p "> :</ span > < span class ="nb "> list</ span > < span class ="p "> [</ span > < span class ="nb "> str</ span > < span class ="p "> ]</ span > < span class ="o "> =</ span > < span class ="p "> [</ span > < span class ="o "> ...</ span > < span class ="p "> ]</ span >
222
+ < span class ="n "> metadata</ span > < span class ="o "> =</ span > < span class ="n "> pd</ span > < span class ="o "> .</ span > < span class ="n "> DataFrame</ span > < span class ="p "> (</ span > < span class ="o "> ...</ span > < span class ="p "> )</ span >
223
+
224
+ < span class ="c1 "> # The tenth element in both corresponds to the same entry</ span >
225
+ < span class ="n "> tenth_text</ span > < span class ="o "> =</ span > < span class ="n "> corpus</ span > < span class ="p "> [</ span > < span class ="mi "> 9</ span > < span class ="p "> ]</ span >
226
+ < span class ="n "> tenth_metadata_entry</ span > < span class ="o "> =</ span > < span class ="n "> metadata</ span > < span class ="o "> .</ span > < span class ="n "> iloc</ span > < span class ="p "> [</ span > < span class ="mi "> 9</ span > < span class ="p "> ]</ span >
227
+ </ pre > </ div >
228
+ </ div >
219
229
< p > Then you can use the query() method to retrieve indices and distances instead of passages:</ p >
230
+ < div class ="highlight-python notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="kn "> from</ span > < span class ="nn "> neofuzz</ span > < span class ="kn "> import</ span > < span class ="n "> Process</ span >
231
+
232
+ < span class ="n "> process</ span > < span class ="o "> =</ span > < span class ="n "> Process</ span > < span class ="p "> (</ span > < span class ="o "> ...</ span > < span class ="p "> )</ span >
233
+ < span class ="n "> process</ span > < span class ="o "> .</ span > < span class ="n "> index</ span > < span class ="p "> (</ span > < span class ="n "> corpus</ span > < span class ="p "> )</ span >
234
+
235
+ < span class ="c1 "> # Both results will be arrays shaped (len(search_terms), limit)</ span >
236
+ < span class ="n "> indices</ span > < span class ="p "> ,</ span > < span class ="n "> distances</ span > < span class ="o "> =</ span > < span class ="n "> process</ span > < span class ="o "> .</ span > < span class ="n "> query</ span > < span class ="p "> (</ span > < span class ="n "> search_terms</ span > < span class ="o "> =</ span > < span class ="p "> [</ span > < span class ="s2 "> "Search term 1"</ span > < span class ="p "> ,</ span > < span class ="s2 "> "Search term 2"</ span > < span class ="p "> ],</ span > < span class ="n "> limit</ span > < span class ="o "> =</ span > < span class ="mi "> 5</ span > < span class ="p "> )</ span >
237
+
238
+ < span class ="n "> results_for_term1</ span > < span class ="o "> =</ span > < span class ="p "> [</ span > < span class ="n "> corpus</ span > < span class ="p "> [</ span > < span class ="n "> idx</ span > < span class ="p "> ]</ span > < span class ="k "> for</ span > < span class ="n "> idx</ span > < span class ="ow "> in</ span > < span class ="n "> indices</ span > < span class ="p "> [</ span > < span class ="mi "> 0</ span > < span class ="p "> ]]</ span >
239
+ < span class ="n "> metadata_for_term1</ span > < span class ="o "> =</ span > < span class ="n "> metadata</ span > < span class ="o "> .</ span > < span class ="n "> iloc</ span > < span class ="p "> [</ span > < span class ="n "> indices</ span > < span class ="p "> [</ span > < span class ="mi "> 0</ span > < span class ="p "> ]]</ span >
240
+
241
+ < span class ="n "> results_for_term2</ span > < span class ="o "> =</ span > < span class ="p "> [</ span > < span class ="n "> corpus</ span > < span class ="p "> [</ span > < span class ="n "> idx</ span > < span class ="p "> ]</ span > < span class ="k "> for</ span > < span class ="n "> idx</ span > < span class ="ow "> in</ span > < span class ="n "> indices</ span > < span class ="p "> [</ span > < span class ="mi "> 1</ span > < span class ="p "> ]]</ span >
242
+ < span class ="n "> metadata_for_term2</ span > < span class ="o "> =</ span > < span class ="n "> metadata</ span > < span class ="o "> .</ span > < span class ="n "> iloc</ span > < span class ="p "> [</ span > < span class ="n "> indices</ span > < span class ="p "> [</ span > < span class ="mi "> 1</ span > < span class ="p "> ]]</ span >
243
+ </ pre > </ div >
244
+ </ div >
220
245
</ section >
221
246
222
247
</ article >
0 commit comments