-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dedup abcd before delex #7
base: main
Are you sure you want to change the base?
Conversation
scripts/process_abcd.py
Outdated
CWD = os.getcwd() | ||
|
||
FILE = os.path.join(CWD, "abcd/data/abcd_v1.1.json") | ||
SAVE_DIR = os.path.join(CWD, "data/abcd") | ||
FILE = os.path.join(CWD, "../abcd/data/abcd_v1.1.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you change the relative path like this, it would only work for you.
instead of doing this, let's get the abs_file, like
import os
os.path.dirname(os.path.abspath(__file__))
scripts/process_abcd.py
Outdated
|
||
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) | ||
from utils import convert_abcd_line, MAP, EOS, decide_delex_level | ||
from policy_functions import delex_line | ||
|
||
unique_sents = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for global variables, use captical "UNIQUE_SENTS = set()"
@@ -332,6 +356,13 @@ def parse_args(): | |||
default=None, | |||
help="sample n", | |||
) | |||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"store_true"
parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
)
scripts/process_abcd.py
Outdated
|
||
for i, sent in enumerate(sents): | ||
if sent in unique_sents: | ||
sents[i] = '<SENT_MASK>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use "<MASK>"
save the files under |
@@ -4,15 +4,18 @@ | |||
import argparse | |||
from tqdm import tqdm | |||
import numpy as np | |||
from nltk.tokenize import sent_tokenize | |||
|
|||
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might a bit dangerous because it changes the sys path. instead of append, we can just call os.path.join(os.abs_something, "abcd/data/abcd_v1.1.json")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if that line's not there then some of the imports won't work. Since utils.py and policy_functions.py are in the directory above
@@ -332,6 +356,11 @@ def parse_args(): | |||
default=None, | |||
help="sample n", | |||
) | |||
parser.add_argument( | |||
"--deduplicate", | |||
action = "store_true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it probably wouldn't change anything, but i'd add default=False
, for readability
scripts/dedup_wiki.py
Outdated
@@ -0,0 +1,64 @@ | |||
from nltk.tokenize import sent_tokenize | |||
|
|||
with open("../data/wikitext-2-raw/train.txt") as inp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, careful when using relative path, because it depends on which dir you are running the file from
scripts/dedup_wiki.py
Outdated
|
||
wiki_dedup = wiki(txt) | ||
|
||
with open('./train.txt', 'w') as inp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, relative path
if sent in unique_sents: | ||
sents[i] = '<MASK>' | ||
else: | ||
unique_sents.add(sent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the order preserved for a set? if you add it like this.
If you are doing this, i think it's not necessary to use a set? you can just use a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just use the set to check if the sentence has already been seen so the set order isn't really affecting anything.
if sent in UNIQUE_SENTS: | ||
sents[i] = '<MASK>' | ||
else: | ||
UNIQUE_SENTS.add(sent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, is the order preserved for a set?
No description provided.