dedup abcd before delex #7

wyshi · 2022-04-24T22:29:24Z

No description provided.

wyshi · 2022-04-24T22:32:55Z

scripts/process_abcd.py

 CWD = os.getcwd()

-FILE = os.path.join(CWD, "abcd/data/abcd_v1.1.json")
-SAVE_DIR = os.path.join(CWD, "data/abcd")
+FILE = os.path.join(CWD, "../abcd/data/abcd_v1.1.json")


if you change the relative path like this, it would only work for you.
instead of doing this, let's get the abs_file, like

import os os.path.dirname(os.path.abspath(__file__))

wyshi · 2022-04-24T22:33:35Z

scripts/process_abcd.py


 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from utils import convert_abcd_line, MAP, EOS, decide_delex_level
 from policy_functions import delex_line

+unique_sents = set()


for global variables, use captical "UNIQUE_SENTS = set()"

wyshi · 2022-04-24T22:34:31Z

scripts/process_abcd.py

@@ -332,6 +356,13 @@ def parse_args():
        default=None,
        help="sample n",
    )
+    parser.add_argument(


"store_true"

parser.add_argument(
"--use_slow_tokenizer",
action="store_true",
help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
)

wyshi · 2022-04-24T22:34:51Z

scripts/process_abcd.py

+
+    for i, sent in enumerate(sents):
+        if sent in unique_sents:
+            sents[i] = '<SENT_MASK>'


use "<MASK>"

wyshi · 2022-04-24T22:37:38Z

save the files under data/abcd_dedup/dedup_abcd_my_delex-* rather than under scripts/

wyshi · 2022-04-25T17:14:03Z

scripts/process_abcd.py

@@ -4,15 +4,18 @@
 import argparse
 from tqdm import tqdm
 import numpy as np
+from nltk.tokenize import sent_tokenize

 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))


this might a bit dangerous because it changes the sys path. instead of append, we can just call os.path.join(os.abs_something, "abcd/data/abcd_v1.1.json")

I think if that line's not there then some of the imports won't work. Since utils.py and policy_functions.py are in the directory above

wyshi · 2022-04-25T17:14:36Z

scripts/process_abcd.py

@@ -332,6 +356,11 @@ def parse_args():
        default=None,
        help="sample n",
    )
+    parser.add_argument(
+        "--deduplicate",
+        action = "store_true",


it probably wouldn't change anything, but i'd add default=False, for readability

wyshi · 2022-04-25T17:15:30Z

scripts/dedup_wiki.py

@@ -0,0 +1,64 @@
+from nltk.tokenize import sent_tokenize
+
+with open("../data/wikitext-2-raw/train.txt") as inp:


same here, careful when using relative path, because it depends on which dir you are running the file from

wyshi · 2022-04-25T17:20:14Z

scripts/dedup_wiki.py

+
+wiki_dedup = wiki(txt)
+
+with open('./train.txt', 'w') as inp:


same here, relative path

wyshi · 2022-04-25T17:21:41Z

scripts/dedup_wiki.py

+            if sent in unique_sents:
+                sents[i] = '<MASK>'
+            else:
+                unique_sents.add(sent)


is the order preserved for a set? if you add it like this.

If you are doing this, i think it's not necessary to use a set? you can just use a list?

I just use the set to check if the sentence has already been seen so the set order isn't really affecting anything.

wyshi · 2022-04-25T17:24:45Z

scripts/process_abcd.py

+        if sent in UNIQUE_SENTS:
+            sents[i] = '<MASK>'
+        else:
+            UNIQUE_SENTS.add(sent)


same here, is the order preserved for a set?

dedup abcd before delex

b3b92b5

wyshi commented Apr 24, 2022

View reviewed changes

dedup all delex

1865041

ryanshea10 assigned wyshi Apr 25, 2022

wyshi commented Apr 25, 2022

View reviewed changes

scripts/dedup_wiki.py Outdated

wiki_dedup = wiki(txt)

with open('./train.txt', 'w') as inp:

Copy link

Owner Author

wyshi Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, relative path

wyshi commented Apr 25, 2022

View reviewed changes

changed pathing

16934ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedup abcd before delex #7

dedup abcd before delex #7

wyshi commented Apr 24, 2022

wyshi Apr 24, 2022

wyshi Apr 24, 2022

wyshi Apr 24, 2022

wyshi Apr 24, 2022 •

edited

Loading

wyshi commented Apr 24, 2022

wyshi Apr 25, 2022

ryanshea10 Apr 25, 2022

wyshi Apr 25, 2022

wyshi Apr 25, 2022

wyshi Apr 25, 2022

wyshi Apr 25, 2022

ryanshea10 Apr 25, 2022

wyshi Apr 25, 2022

		@@ -0,0 +1,64 @@
		from nltk.tokenize import sent_tokenize

		with open("../data/wikitext-2-raw/train.txt") as inp:

dedup abcd before delex #7

Are you sure you want to change the base?

dedup abcd before delex #7

Conversation

wyshi commented Apr 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wyshi Apr 24, 2022 • edited Loading

Choose a reason for hiding this comment

wyshi commented Apr 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wyshi Apr 24, 2022 •

edited

Loading