Skip to content

Latest commit

 

History

History
241 lines (181 loc) · 9.12 KB

README.md

File metadata and controls

241 lines (181 loc) · 9.12 KB

ISSUES (ICCVW 2023)

Mapping Memes to Words for Multimodal Hateful Meme Classification

arXiv GitHub Stars

This is the official repository of the paper "MappIng MemeS to WordS for MUltimodal Hateful MEme ClaSsification" (ISSUES).

Overview

Abstract

Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets.

Overview of the proposed approach. We disentangle CLIP common embedding space via linear projections. We employ textual inversion to make the textual representation multimodal. We fuse the textual and visual features with a Combiner architecture. $E_{L}$ represents the CLIP embedding lookup layer. $\phi$ indicates the SEARLE textual inversion network.

Getting Started

We recommend using the Anaconda package manager to avoid dependency/reproducibility problems. For Linux systems, you can find a conda installation guide here.

Installation

  1. Clone the repository
git clone https://github.com/miccunifi/ISSUES.git
  1. Install Python dependencies

Navigate to the root folder of the repository and use the command:

conda config --add channels conda-forge
conda create -n issues -y python=3.9.16
conda activate issues
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
conda install --file requirements.txt
pip install git+https://github.com/openai/CLIP.git
  1. Log in to your WandB account
wandb login

Datasets

We do not hold rights to the original HMC and HarMeme datasets. To download the full original datasets use the following links:

  • HMC [link] - Contains 12.140 memes
  • HarMeme [link] - Contains 3.544 memes

Data Preparation

Download the files in the release and place the resources folder in the root folder:

project_base_path
└─── resources
  ...
└─── src
  | combiner.py
  | datasets.py
  | engine.py
  ...

...

Ensure the HMC and HarMeme datasets match the following structure:

project_base_path
└─── resources
  └─── datasets
    └─── harmeme
      └─── clip_embds
          | test_no-proj_output.pt
          | train_no-proj_output.pt
          | val_no-proj_output.pt

      └─── img
          | covid_memes_2.png
          | covid_memes_3.png
          | covid_memes_4.png
          ....

      └─── labels
          | info.csv

    └─── hmc
      └─── clip_embds
          | dev_seen_no-proj_output.pt
          | dev_unseen_no-proj_output.pt
          | test_seen_no-proj_output.pt
          | test_unseen_no-proj_output.pt
          | train_no-proj_output.pt

      └─── img
          | 01235.png
          | 01236.png
          | 01243.png
          ....
        
      └─── labels
          | info.csv
  ...
  
└─── src
  | combiner.py
  | datasets.py
  | engine.py
  ...

...

Usage

Pre-trained models

We provide the pre-trained models in the release. Ensure that the weights match the following structure:

project_base_path
└─── resources
  └─── datasets
      ...
  └─── pretrained_models
      | hmc_text-inv-comb_best.ckpt
      | harmeme_text-inv-comb_best.ckpt
      
  └─── pretrained_weights
      | hmc
      | harmeme
      | phi
    
  
└─── src
  | combiner.py
  | datasets.py
  | engine.py
  ...

...

Training and Testing

We provide scripts for training and testing our approach on the HMC and HarMeme datasets.

project_base_path
└─── resources
  ...
  
└─── src
  ...


run_harmeme_text-inv-comb.sh
run_hmc_text-inv-comb.sh


...

To use a script, navigate to the root folder and use the following commands:

chmod +x <filename>.sh
./<filename>.sh

where:

  • <filename> = run_harmeme_text-inv-comb is related to the HarMeme dataset
  • <filename> = run_hmc_text-inv-comb is related to the HMC dataset

For training the model from scratch and then evaluating its performance, disable the --reproduce flag of the script.

For testing the pre-trained models and reproducing our results, enable the --reproduce flag of the script.

Arguments

In the following, we describe each argument of the scripts.

Experiments

  • dataset - dataset name: [hmc or harmeme]
  • num_mapping_layers - number of projection layers to map CLIP features in a task-oriented latent space
  • num_pre_output_layers - number of MLP hidden layers for performing the final classification
  • max_epochs - maximum number of epochs
  • lr - learning rate
  • batch_size - batch size
  • fast_process - flag to indicate whether to use pre-computed CLIP features as the input of the model instead of computing them during the training process
  • name - name of the model
  • pretrained_model - name of the checkpoint of the pretrained model in the 'pretrained_models' folder
  • reproduce - flag to indicate whether to perform the training process followed by the evaluation phase (False) or directly evaluate a pre-trained model on the test data (True)

General

  • map_dim - output dimension of the projected feature vectors
  • fusion - fusion method between the textual and visual modalities (when applicable): [concat or align]
  • pretrained_proj_weights - flag to indicate whether to use pre-trained projection weights (when applicable)
  • freeze_proj_layers - flag to indicate whether to freeze the pre-trained weights

Combiner Architecture

  • comb_proj - flag to indicate whether to project the input features of the Combiner
  • comb_fusion - fusion method to use to combine the input features of the Combiner
  • convex_tensor - flag to indicate whether to compute a tensor or a scalar as the output of the convex combination

Textual Inversion Architecture

  • text_inv_proj - flag to indicate whether to use CLIP textual encoder projection
  • phi_inv_proj - flag to indicate whether to project the output of phi network
  • post_inv_proj - flag to indicate whether to project the CLIP textual encoder output features
  • enh_text - flag to indicate whether to use a prompt with only the pseudo-word or concatenate the meme text
  • phi_freeze - flag to indicate whether to freeze the pre-trained phi network

Authors

Acknowledgements

Our code is based on SEARLE GitHub Stars and Hate-CLIPperGitHub Stars.

This work was partially supported by the European Commission under European Horizon 2020 Programme, grant number 101004545 - ReInHerit.