A certain degree of mismatch between the `Key` and the original DataComp-1B #11

Coobiw · 2024-07-01T08:15:18Z

Thanks for your great work! I've tried to match Recap-DataComp-1B with DataComp-1B by the key.(Reference:#7) However, when I generate 50M data, I find that ~5M mismatches.(I use the huggingface annotations to get a key2recaption mapping. ~5M KeyErrors were raised.) Would you give me some advice? Sincerely looking forward to your reply!

The text was updated successfully, but these errors were encountered:

Coobiw · 2024-07-01T08:17:24Z

Additionally, I also find a mismatch between image and caption. Like following:

xhl-video · 2024-07-01T23:03:33Z

Thanks for your great work! I've tried to match Recap-DataComp-1B with DataComp-1B by the key.(Reference:#7) However, when I generate 50M data, I find that ~5M mismatches.(I use the huggingface annotations to get a key2recaption mapping. ~5M KeyErrors were raised.) Would you give me some advice? Sincerely looking forward to your reply!

Hi, thanks for your interest! Have you tried to use sha256 to match? I am not sure if the mismatch happened because our DataComp1B contains fewer samples than the original one (due to URL invalid issues when downloading)

Coobiw · 2024-07-04T07:25:02Z

Hi, I use the following code to get sha256-to-recaption mapping.

import os
import pickle
from tqdm import tqdm
import datasets

os.environ["HF_DATASETS_OFFLINE"] = "1"
datasets.config.HF_DATASETS_OFFLINE = True

from datasets import load_dataset

# 加载数据集
data = load_dataset("xxx/Recap-DataComp-1B/hf_anno")
print("Finish Loading...")

train_data = data['train']

# 初始化字典
sha256torecaption = {}
part = 1
save_threshold = 400_000_000  # 每个部分的估计条目数

# 填充字典并划分
for item in tqdm(train_data):
    meta_key, re_caption = item['sha256'], item['re_caption']
    sha256torecaption[meta_key] = re_caption
    
    # 检查是否需要保存和清除当前部分
    if len(sha256torecaption) >= save_threshold:
        with open(f'sha256torecaption_part{part}.pkl', 'wb') as f:
            pickle.dump(sha256torecaption, f)
        print(f"Part {part} saved as sha256torecaption_part{part}.pkl")
        del sha256torecaption
        sha256torecaption = {}  # 清除字典
        part += 1

# 保存剩余的数据
if sha256torecaption:
    with open(f'sha256torecaption_part{part}.pkl', 'wb') as f:
        pickle.dump(sha256torecaption, f)
    print(f"Part {part} saved as sha256torecaption_part{part}.pkl")

print("All parts saved.")

After this, I get three part of sha256torecaption. I check the length of them as following:

>>> a = pickle.load(open("sha256torecaption_part1.pkl","rb"))
>>> len(a)
400000000
>>> a = pickle.load(open("sha256torecaption_part2.pkl","rb"))
>>> len(a)
400000000
>>> a = pickle.load(open("sha256torecaption_part3.pkl","rb"))
>>> len(a)
256418986

The sum of the lengths is 1.05B. However, Recap-DataComp has 1.23B items. So I want to know whether sh256 is unique for each item?

Thanks for your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

Coobiw commented Jul 1, 2024

Coobiw commented Jul 1, 2024

xhl-video commented Jul 1, 2024

Coobiw commented Jul 4, 2024 •

edited

Loading

A certain degree of mismatch between the Key and the original DataComp-1B #11

A certain degree of mismatch between the Key and the original DataComp-1B #11

Comments

Coobiw commented Jul 1, 2024

Coobiw commented Jul 1, 2024

xhl-video commented Jul 1, 2024

Coobiw commented Jul 4, 2024 • edited Loading

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

Coobiw commented Jul 4, 2024 •

edited

Loading