making documentation of run modes more detailed

dbickson · dbickson · commit 96d02d03ae2b · 2022-05-12T15:41:17.000+03:00
diff --git a/RUN.md b/RUN.md
@@ -89,7 +89,7 @@ def load_binary_feature(filename):
 ```
 
 Faiss index files
-When using faiss an additional intermediate results file is created: faiss.index.
+- When using faiss an additional intermediate results file is created: `faiss.index`.
 
 ## Error handling
 When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.
@@ -119,12 +119,20 @@ Example command line:
 ```
 
 
-## Resuming a stored run
-There are 3 supported running modes:
-run_mode=0 (the default) does the feature extraction and NN embedding to provide similarities. It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on). The features are extracted and saved into feature_out_file (the default features out file is features.dat in the same folder for storing the numpy features and features.dat.csv for storing the image file names corresponding to the numpy features).
-For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.
-run_mode=1 computes the extracted features and stores them, does not compute the NN embedding. For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel. Use the min_offset and max_offset flags to allocate a subset of the images for each computing node. Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.
-run_mode=2, reads a stored feature file and computes the NN embedding to provide similarities. The input_dir param is ignored, and the features_out_file is used to point to the numpy feature file. (Give a full path and filename).
+## Advanced topics: resuming a stored run
+There are several supported running modes:
+- `run_mode=0` (the default) does the feature extraction and NN embedding to compute all pairs similarities.
+It uses the `input_dir` command line argument for finding the directory to run on (or a list of files to run on). 
+The features are extracted and saved into the `working_dir` path  (the default features out file nme is `features.dat`
+in the same folder for storing the numpy features and `features.dat.csv` for storing the image file names corresponding to the numpy features).
+For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error. 
+- `run_mode=1` computes the extracted features and stores them, does not compute the NN embedding. For large datasets, 
+it is possible to run on a few computing nodes, to extract the features, in parallel. Use the `min_offset` and `max_offset` flags to allocate a subset of the images for each computing node. Offsets start from 0 to `n-1` where `n` is the number of images in the input_dir folder.
+- `run_mode=2` reads a stored feature file and computes the NN embedding to provide similarities. The `input_dir` param is ignored, and the `work_dir` is used to point to the numpy feature file. (Give a full path and filename).
+- `run_mode=3` Reads the NN model stored by `faiss.index` from the `work_dir` and computes all pairs similarity on all inages give by the `input_dir` parameter. This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset.
+- `run_mode=4` reads the NN model stored by `faiss.index` from the `work_dir` and computes all pairs similarity on pre extracted feature vectors computer by `run_mode=1`.  
+
+
 
 ## Visualizing the outputs
 Once fastdup runs you can look at the results in an easy way using two options. When running from a jupyter notebook the code will produce a table gallery. Otherwise when running a from python shell an html report will be generated.
@@ -152,8 +160,6 @@ fastdup.generate_duplicates_gallery(‘/path/to/similarity.csv’, save_path=’
 
 Note: the report should be generated on the same machine since we assume that the input folder for reading the images exists under the same location.
 
-Notes
-This is an experimental version tested up to 13M images