Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedup abcd before delex #7

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

dedup abcd before delex #7

wants to merge 3 commits into from

Conversation

wyshi
Copy link
Owner

@wyshi wyshi commented Apr 24, 2022

No description provided.

CWD = os.getcwd()

FILE = os.path.join(CWD, "abcd/data/abcd_v1.1.json")
SAVE_DIR = os.path.join(CWD, "data/abcd")
FILE = os.path.join(CWD, "../abcd/data/abcd_v1.1.json")
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you change the relative path like this, it would only work for you.
instead of doing this, let's get the abs_file, like

import os
os.path.dirname(os.path.abspath(__file__))


sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from utils import convert_abcd_line, MAP, EOS, decide_delex_level
from policy_functions import delex_line

unique_sents = set()
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for global variables, use captical "UNIQUE_SENTS = set()"

@@ -332,6 +356,13 @@ def parse_args():
default=None,
help="sample n",
)
parser.add_argument(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"store_true"

parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
)


for i, sent in enumerate(sents):
if sent in unique_sents:
sents[i] = '<SENT_MASK>'
Copy link
Owner Author

@wyshi wyshi Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use "<MASK>"

@wyshi
Copy link
Owner Author

wyshi commented Apr 24, 2022

save the files under data/abcd_dedup/dedup_abcd_my_delex-* rather than under scripts/

@@ -4,15 +4,18 @@
import argparse
from tqdm import tqdm
import numpy as np
from nltk.tokenize import sent_tokenize

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might a bit dangerous because it changes the sys path. instead of append, we can just call os.path.join(os.abs_something, "abcd/data/abcd_v1.1.json")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if that line's not there then some of the imports won't work. Since utils.py and policy_functions.py are in the directory above

@@ -332,6 +356,11 @@ def parse_args():
default=None,
help="sample n",
)
parser.add_argument(
"--deduplicate",
action = "store_true",
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it probably wouldn't change anything, but i'd add default=False, for readability

@@ -0,0 +1,64 @@
from nltk.tokenize import sent_tokenize

with open("../data/wikitext-2-raw/train.txt") as inp:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, careful when using relative path, because it depends on which dir you are running the file from


wiki_dedup = wiki(txt)

with open('./train.txt', 'w') as inp:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, relative path

if sent in unique_sents:
sents[i] = '<MASK>'
else:
unique_sents.add(sent)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the order preserved for a set? if you add it like this.

If you are doing this, i think it's not necessary to use a set? you can just use a list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just use the set to check if the sentence has already been seen so the set order isn't really affecting anything.

if sent in UNIQUE_SENTS:
sents[i] = '<MASK>'
else:
UNIQUE_SENTS.add(sent)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, is the order preserved for a set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants