Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to use this framework for identifying image manipulation; without the treatment and control setting? #2

Closed
codepujan opened this issue Mar 27, 2023 · 4 comments
Labels
good first issue Good for newcomers

Comments

@codepujan
Copy link

Great work with the overall framework and setting up the benchmark.
The control and treatment setting confused me a lot.
Let's say I have a bunch of source images, and a pool of images that are manipulated versions of the images, with the corresponding ground truth for each source image.
For every source image, I want to benchmark the different algorithms on identifying the correct set of manipulated images. Is there a proper way to go about this?
Thanks

@chiffa
Copy link
Collaborator

chiffa commented Mar 28, 2023

Yes - set the database, run the perturbation generator and then replace the generated folder with images from your manipulated dataset. At this point you can follow the steps outlined in the README to get the statistics and the figures.

Hope that helps!

@Cyrilvallez
Copy link
Owner

Cyrilvallez commented Mar 28, 2023

Sure, in your case you would first need to make sure that your pool of images that are manipulated versions of the source images are named with the following convention: source-name_attackID.extension. I.e, if your source images are named ("img1.extension", "img2.extension",....), use your groundtruths to rename the manipulated pool of images to ("img1_manipulation1.extension", "img1_manipulation2.extension", "img2_manipulation1.extension", "img3_manipulation14"...). Note that the names themselves should never contain underscores ("_"), as it is used to separate the original name and the manipulation name.
It is not important what you call the manipulations or even that you should be consistent with the numbering, the important part is that the name of the original image appears before the underscore (i.e. "img1_nxozhe.extension" will still work as the name of the original is before the underscore).

Then you would do:

import numpy as np
import hashing

path_database = 'path/to/source/images/'
path_dataset = 'path/to/pool/of/manipulated/images/'

dataset = hashing.create_dataset(path_dataset, existing_attacks=True)

algos = [
        # declare all algos you want
        hashing.ClassicalAlgorithm('Phash', hash_size=8),
        hashing.FeatureAlgorithm('ORB', n_features=30),
        hashing.NeuralAlgorithm('SimCLR v1 ResNet50 2x', device='cuda', distance='Jensen-Shannon')
        ]

thresholds = [
       # declare all thresholds you want
       np.linspace(0, 0.4, 20),
       np.linspace(0, 0.3, 20),
       np.linspace(0.3, 0.8, 20),
       ]

# Creates the databases for each algos and record the time it took
databases, time_database = hashing.create_databases(algos, path_database)

general_output, image_wise_output, running_time = hashing.hashing(algos, thresholds, databases,
 dataset, artificial_attacks=False)

The general_output will give you the overall number of images in your pool of manipulated images that triggered a match in the database, and the image_wise_output will give you a more detailed version where you can get the detailed number of correct/incorrect detection for each image in the database (source images).

However, this version will only provide you with true positives/false negatives (as I think this is what you are interested in). If you are also interested in true negatives/false positives, you will still need to split your images in experimental and control groups, as we did in our benchmarks.

@codepujan
Copy link
Author

Thank you for the quick response @Cyrilvallez . The setup works in the case of above snippet. I am interested to identify true negatives, and false positives as well. In that case, I am still finding it difficult to understand the split of experimental / control groups ( I even read the paper) used in the benchmark.
Revisiting the setup ; I have source images (s1,...s10). And a bunch of images that are manipulation of each source. And there are a set of images that are noise and not related to any of the source images at all. How would I go about dividing the experimental and control groups in this setting?
Thank you for the help.

@Cyrilvallez
Copy link
Owner

Cyrilvallez commented Mar 29, 2023

Well then the experimental group would be the manipulations of the source images (images that are supposed to be detected), and the control would be the noisy images (images not supposed to be detected). However, if both groups do not contain the exact same number of images, be aware that the statistics computed (accuracy, precision, etc...) are NOT normalized against the number of images in each group (as we always used the same number of images in each group). Thus it can be misleading in the case of (large) imbalance between both groups.

Using this setup, you can simply follow all steps of the README using the experimental and control groups defined above as positive_dataset and negative_dataset respectively:

positive_dataset = hashing.create_dataset('path/to/experimental', existing_attacks=True)
negative_dataset = hashing.create_dataset('path/to/control', existing_attacks=True)

@Cyrilvallez Cyrilvallez added the good first issue Good for newcomers label Apr 1, 2023
@Cyrilvallez Cyrilvallez pinned this issue Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants