Skip to content

Commit 96d02d0

Browse files
author
dbickson
committed
making documentation of run modes more detailed
1 parent 66267d1 commit 96d02d0

File tree

1 file changed

+15
-9
lines changed

1 file changed

+15
-9
lines changed

RUN.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ def load_binary_feature(filename):
8989
```
9090

9191
Faiss index files
92-
When using faiss an additional intermediate results file is created: faiss.index.
92+
- When using faiss an additional intermediate results file is created: `faiss.index`.
9393

9494
## Error handling
9595
When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.
@@ -119,12 +119,20 @@ Example command line:
119119
```
120120

121121

122-
## Resuming a stored run
123-
There are 3 supported running modes:
124-
run_mode=0 (the default) does the feature extraction and NN embedding to provide similarities. It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on). The features are extracted and saved into feature_out_file (the default features out file is features.dat in the same folder for storing the numpy features and features.dat.csv for storing the image file names corresponding to the numpy features).
125-
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.
126-
run_mode=1 computes the extracted features and stores them, does not compute the NN embedding. For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel. Use the min_offset and max_offset flags to allocate a subset of the images for each computing node. Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.
127-
run_mode=2, reads a stored feature file and computes the NN embedding to provide similarities. The input_dir param is ignored, and the features_out_file is used to point to the numpy feature file. (Give a full path and filename).
122+
## Advanced topics: resuming a stored run
123+
There are several supported running modes:
124+
- `run_mode=0` (the default) does the feature extraction and NN embedding to compute all pairs similarities.
125+
It uses the `input_dir` command line argument for finding the directory to run on (or a list of files to run on).
126+
The features are extracted and saved into the `working_dir` path (the default features out file nme is `features.dat`
127+
in the same folder for storing the numpy features and `features.dat.csv` for storing the image file names corresponding to the numpy features).
128+
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.
129+
- `run_mode=1` computes the extracted features and stores them, does not compute the NN embedding. For large datasets,
130+
it is possible to run on a few computing nodes, to extract the features, in parallel. Use the `min_offset` and `max_offset` flags to allocate a subset of the images for each computing node. Offsets start from 0 to `n-1` where `n` is the number of images in the input_dir folder.
131+
- `run_mode=2` reads a stored feature file and computes the NN embedding to provide similarities. The `input_dir` param is ignored, and the `work_dir` is used to point to the numpy feature file. (Give a full path and filename).
132+
- `run_mode=3` Reads the NN model stored by `faiss.index` from the `work_dir` and computes all pairs similarity on all inages give by the `input_dir` parameter. This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset.
133+
- `run_mode=4` reads the NN model stored by `faiss.index` from the `work_dir` and computes all pairs similarity on pre extracted feature vectors computer by `run_mode=1`.
134+
135+
128136

129137
## Visualizing the outputs
130138
Once fastdup runs you can look at the results in an easy way using two options. When running from a jupyter notebook the code will produce a table gallery. Otherwise when running a from python shell an html report will be generated.
@@ -152,8 +160,6 @@ fastdup.generate_duplicates_gallery(‘/path/to/similarity.csv’, save_path=’
152160

153161
Note: the report should be generated on the same machine since we assume that the input folder for reading the images exists under the same location.
154162

155-
Notes
156-
This is an experimental version tested up to 13M images
157163

158164

159165

0 commit comments

Comments
 (0)