Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A certain degree of mismatch between the Key and the original DataComp-1B #11

Open
Coobiw opened this issue Jul 1, 2024 · 3 comments
Open

Comments

@Coobiw
Copy link

Coobiw commented Jul 1, 2024

Thanks for your great work! I've tried to match Recap-DataComp-1B with DataComp-1B by the key.(Reference:#7) However, when I generate 50M data, I find that ~5M mismatches.(I use the huggingface annotations to get a key2recaption mapping. ~5M KeyErrors were raised.) Would you give me some advice? Sincerely looking forward to your reply!

@Coobiw
Copy link
Author

Coobiw commented Jul 1, 2024

Additionally, I also find a mismatch between image and caption. Like following:
image
image

@xhl-video
Copy link

Thanks for your great work! I've tried to match Recap-DataComp-1B with DataComp-1B by the key.(Reference:#7) However, when I generate 50M data, I find that ~5M mismatches.(I use the huggingface annotations to get a key2recaption mapping. ~5M KeyErrors were raised.) Would you give me some advice? Sincerely looking forward to your reply!

Hi, thanks for your interest! Have you tried to use sha256 to match? I am not sure if the mismatch happened because our DataComp1B contains fewer samples than the original one (due to URL invalid issues when downloading)

@Coobiw
Copy link
Author

Coobiw commented Jul 4, 2024

Hi, I use the following code to get sha256-to-recaption mapping.

import os
import pickle
from tqdm import tqdm
import datasets

os.environ["HF_DATASETS_OFFLINE"] = "1"
datasets.config.HF_DATASETS_OFFLINE = True

from datasets import load_dataset

# 加载数据集
data = load_dataset("xxx/Recap-DataComp-1B/hf_anno")
print("Finish Loading...")

train_data = data['train']

# 初始化字典
sha256torecaption = {}
part = 1
save_threshold = 400_000_000  # 每个部分的估计条目数

# 填充字典并划分
for item in tqdm(train_data):
    meta_key, re_caption = item['sha256'], item['re_caption']
    sha256torecaption[meta_key] = re_caption
    
    # 检查是否需要保存和清除当前部分
    if len(sha256torecaption) >= save_threshold:
        with open(f'sha256torecaption_part{part}.pkl', 'wb') as f:
            pickle.dump(sha256torecaption, f)
        print(f"Part {part} saved as sha256torecaption_part{part}.pkl")
        del sha256torecaption
        sha256torecaption = {}  # 清除字典
        part += 1

# 保存剩余的数据
if sha256torecaption:
    with open(f'sha256torecaption_part{part}.pkl', 'wb') as f:
        pickle.dump(sha256torecaption, f)
    print(f"Part {part} saved as sha256torecaption_part{part}.pkl")

print("All parts saved.")

After this, I get three part of sha256torecaption. I check the length of them as following:

>>> a = pickle.load(open("sha256torecaption_part1.pkl","rb"))
>>> len(a)
400000000
>>> a = pickle.load(open("sha256torecaption_part2.pkl","rb"))
>>> len(a)
400000000
>>> a = pickle.load(open("sha256torecaption_part3.pkl","rb"))
>>> len(a)
256418986

The sum of the lengths is 1.05B. However, Recap-DataComp has 1.23B items. So I want to know whether sh256 is unique for each item?

Thanks for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants