update

beyondguo · Nov 19, 2022 · 3f8e0cf · 3f8e0cf
1 parent 77941c6
commit 3f8e0cf
Show file tree

Hide file tree

Showing 50 changed files with 58 additions and 219 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,6 +9,8 @@ __pycache__/
 .DS_Store
 .vscode
 augmentation_tools/__pycache__
+augmentation_clf/sta_saved_keywords
+augmentation_clf/__pycache__
 
 # wordvec weights
 weights/

diff --git a/README.md b/README.md
@@ -1,15 +1,31 @@
-# GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation
+# 💡GENIUS: Generating text using sketches as input 
 
-**基于草稿的生成模型**
+**基于草稿的文本生成模型**
 
-**GENIUS** is a powerful conditional text generation model using sketches as input, which can fill in the missing contexts for a given **sketch** (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS uses an encoder-decoder structure (based on the BART architecture) and is pre-trained on the `C4-realnewslike` corpus. 
+- Paper: [GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation](https://github.com/beyondguo/genius/blob/master/GENIUS_gby_arxiv.pdf)
 
-**GENIUS** can also be used as a **general textual data augmentation tool** for **various NLP tasks** (including sentiment analysis, topic classification, NER, and QA). 
+💡**GENIUS** is a powerful conditional text generation model using sketches as input, which can fill in the missing contexts for a given **sketch** (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large- scale textual corpus with a novel *reconstruction from sketch* objective using an *extreme and selective masking* strategy, enabling it to generate diverse and high-quality texts given sketches.
 
+**Example 1:**
+
+- sketch: `__machine learning__my research interest__data science__`
+- BART: `The machine learning aspect of my research interest in data science.`
+- **GENIUS**: `I am a Ph.D. student in machine learning, and my research interest is in data science. I am interested in understanding how humans and machines interact and how we can improve the quality of life for people around the world.`
+
+**Example 2:**
+
+- sketch: `自然语言处理__谷歌__通用人工智能__`
+- BART: `自然语言处理是谷歌的通用人工智能技术`
+- **GENIUS**: `自然语言处理是谷歌在通用人工智能领域的一个重要研究方向，其目的是为了促进人类智能的发展。 `
+
+
+
+**GENIUS** can also be used as a general textual **data augmentation tool** for **various NLP tasks** (including sentiment analysis, topic classification, NER, and QA). 
+
+
+![image-20221119164544165](https://cdn.jsdelivr.net/gh/beyondguo/mdnice_pictures/typora/hi-genius.png)
 
-![genius-illustration](https://cdn.jsdelivr.net/gh/beyondguo/mdnice_pictures/typora/what-is-genius.png)
 
-- Paper: [genius: SkEtch-based Generative Augmentation (preprint)](https://github.com/beyondguo/SEGA/blob/master/SEGA_gby_preprint.pdf)
 
 - Models hosted in 🤗 Huggingface:
 
@@ -25,22 +41,29 @@
 
 <img src="https://cdn.jsdelivr.net/gh/beyondguo/mdnice_pictures/typora/sega-hf-api.jpg" width="50%" />
 
-**GENIUS** is able to write complete paragraphs given a sketch (or framework), which can be composed of:
-- keywords /key-phrases, like "––NLP––AI––computer––science––"
-- spans, like "Conference on Empirical Methods––submission of research papers––"
-- sentences, like "I really like machine learning––I work at Google since last year––"
-- or mixup~
+## Usage
+
+### What is a sketch?
+
+First, what is a **sketch**? As defined in our paper, a sketch is "key information consisting of textual spans, phrases, or words, concatenated by mask tokens". It's like a draft or framework when you begin to write an article. With GENIUS model, you can input some key elements you want to mention in your wrinting, then the GENIUS model can generate cohrent text based on your sketch.
 
+The sketch which can be composed of:
 
-### How to use
-#### 1. If you want to generate sentences given a **sketch**
+- keywords /key-phrases, like `__NLP__AI__computer__science__`
+- spans, like `Conference on Empirical Methods__submission of research papers__`
+- sentences, like `I really like machine learning__I work at Google since last year__`
+- or a mixup!
+
+
+### How to use the model
+#### 1. If you already have a sketch in mind, and want to get a paragraph based on it...
 ```python
 from transformers import pipeline
 # 1. load the model with the huggingface `pipeline`
 genius = pipeline("text2text-generation", model='beyond/genius-large', device=0)
 # 2. provide a sketch (joint by <mask> tokens)
 sketch = "<mask> Conference on Empirical Methods <mask> submission of research papers <mask> Deep Learning <mask>"
-# 3. just do it!
+# 3. here we go!
 generated_text = genius(sketch, num_beams=3, do_sample=True, max_length=200)[0]['generated_text']
 print(generated_text)
 ```
@@ -49,15 +72,18 @@ Output:
 'The Conference on Empirical Methods welcomes the submission of research papers. Abstracts should be in the form of a paper or presentation. Please submit abstracts to the following email address: eemml.stanford.edu. The conference will be held at Stanford University on April 1618, 2019. The theme of the conference is Deep Learning.'
 ```
 
-#### 2. If you want to do **data augmentation** to generate new training samples
-Please check [genius/augmentation_tools](https://github.com/beyondguo/genius/tree/master/augmentation_tools), where we provide ready-to-run scripts for data augmentation for text classification/NER/MRC tasks.
+If you have a lot of sketches, you can batch-up your sketches to a Huggingface `Dataset` object, which can be much faster.
+
+TODO: we are also building a python package for more convenient use of GENIUS, which will be released in few weeks.
 
+#### 2. If you have an NLP dataset (e.g. classification) and want to do data augmentation to enlarge your dataset...
 
+Please check [genius/augmentation_clf](https://github.com/beyondguo/genius/tree/master/augmentation_clf) and [genius/augmentation_ner_qa](https://github.com/beyondguo/genius/tree/master/augmentation_ner_qa), where we provide ready-to-run scripts for data augmentation for text classification/NER/MRC tasks.
 
 
----
 
-## GENIUS as A Strong Data Augmentation Tool:
+## Augmentation Experiments:
+Data augmentation is an important application for natural language generation (NLG) models, which is also a valuable evaluation of whether the generated text can be used in real applications. 
 - Setting: Low-resource setting, where only n={50,100,200,500,1000} labeled samples are available for training. The below results are the average of all training sizes.
 - Text Classification Datasets: [HuffPost](https://huggingface.co/datasets/khalidalt/HuffPost), [BBC](https://huggingface.co/datasets/SetFit/bbc-news), [SST2](https://huggingface.co/datasets/glue), [IMDB](https://huggingface.co/datasets/imdb), [Yahoo](https://huggingface.co/datasets/yahoo_answers_topics), [20NG](https://huggingface.co/datasets/newsgroup).
 - Base classifier: [DistilBERT](https://huggingface.co/distilbert-base-cased)

diff --git a/ner_and_qa/easy_text_augmenter.py → _backup_scripts/easy_text_augmenter.py b/ner_and_qa/easy_text_augmenter.py → _backup_scripts/easy_text_augmenter.py
diff --git a/ner_and_qa/qa_aug_eda.py → _backup_scripts/qa_aug_eda.py b/ner_and_qa/qa_aug_eda.py → _backup_scripts/qa_aug_eda.py
diff --git a/ner_and_qa/qa_back_trans.py → _backup_scripts/qa_back_trans.py b/ner_and_qa/qa_back_trans.py → _backup_scripts/qa_back_trans.py
diff --git a/ner_and_qa/s2t_utils.py → _backup_scripts/s2t_utils.py b/ner_and_qa/s2t_utils.py → _backup_scripts/s2t_utils.py
diff --git a/augmentation_tools/README.md → augmentation_clf/README.md b/augmentation_tools/README.md → augmentation_clf/README.md
diff --git a/augmentation_tools/STA/.gitignore → augmentation_clf/STA/.gitignore b/augmentation_tools/STA/.gitignore → augmentation_clf/STA/.gitignore
diff --git a/augmentation_tools/STA/README.md → augmentation_clf/STA/README.md b/augmentation_tools/STA/README.md → augmentation_clf/STA/README.md
diff --git a/augmentation_tools/STA/__init__.py → augmentation_clf/STA/__init__.py b/augmentation_tools/STA/__init__.py → augmentation_clf/STA/__init__.py
diff --git a/augmentation_tools/STA/clf.py → augmentation_clf/STA/clf.py b/augmentation_tools/STA/clf.py → augmentation_clf/STA/clf.py
diff --git a/augmentation_tools/STA/demo.ipynb → augmentation_clf/STA/demo.ipynb b/augmentation_tools/STA/demo.ipynb → augmentation_clf/STA/demo.ipynb
diff --git a/...tion_tools/STA/extract-global-keywords.py → ...tation_clf/STA/extract-global-keywords.py b/...tion_tools/STA/extract-global-keywords.py → ...tation_clf/STA/extract-global-keywords.py
diff --git a/augmentation_tools/STA/keywords_extractor.py → augmentation_clf/STA/keywords_extractor.py b/augmentation_tools/STA/keywords_extractor.py → augmentation_clf/STA/keywords_extractor.py
diff --git a/augmentation_tools/STA/my_dataset.py → augmentation_clf/STA/my_dataset.py b/augmentation_tools/STA/my_dataset.py → augmentation_clf/STA/my_dataset.py
diff --git a/augmentation_tools/STA/run_aug.sh → augmentation_clf/STA/run_aug.sh b/augmentation_tools/STA/run_aug.sh → augmentation_clf/STA/run_aug.sh
diff --git a/augmentation_tools/STA/run_clf.sh → augmentation_clf/STA/run_clf.sh b/augmentation_tools/STA/run_clf.sh → augmentation_clf/STA/run_clf.sh
diff --git a/augmentation_tools/STA/run_eda.py → augmentation_clf/STA/run_eda.py b/augmentation_tools/STA/run_eda.py → augmentation_clf/STA/run_eda.py
diff --git a/augmentation_tools/STA/run_sta.py → augmentation_clf/STA/run_sta.py b/augmentation_tools/STA/run_sta.py → augmentation_clf/STA/run_sta.py
diff --git a/...tion_tools/STA/stopwords/en_stopwords.txt → ...tation_clf/STA/stopwords/en_stopwords.txt b/...tion_tools/STA/stopwords/en_stopwords.txt → ...tation_clf/STA/stopwords/en_stopwords.txt
diff --git a/...tion_tools/STA/stopwords/zh_stopwords.txt → ...tation_clf/STA/stopwords/zh_stopwords.txt b/...tion_tools/STA/stopwords/zh_stopwords.txt → ...tation_clf/STA/stopwords/zh_stopwords.txt
diff --git a/augmentation_tools/STA/text_augmenter.py → augmentation_clf/STA/text_augmenter.py b/augmentation_tools/STA/text_augmenter.py → augmentation_clf/STA/text_augmenter.py
diff --git a/augmentation_tools/STA/utils.py → augmentation_clf/STA/utils.py b/augmentation_tools/STA/utils.py → augmentation_clf/STA/utils.py
diff --git a/augmentation_tools/aug_filter_clf.py → augmentation_clf/aug_filter_clf.py b/augmentation_tools/aug_filter_clf.py → augmentation_clf/aug_filter_clf.py
diff --git a/augmentation_tools/backtrans_clf.py → augmentation_clf/backtrans_clf.py b/augmentation_tools/backtrans_clf.py → augmentation_clf/backtrans_clf.py
diff --git a/augmentation_tools/conditional_clm_clf.py → augmentation_clf/conditional_clm_clf.py b/augmentation_tools/conditional_clm_clf.py → augmentation_clf/conditional_clm_clf.py
diff --git a/...ntation_tools/conditional_clm_finetune.py → augmentation_clf/conditional_clm_finetune.py b/...ntation_tools/conditional_clm_finetune.py → augmentation_clf/conditional_clm_finetune.py
diff --git a/augmentation_tools/conditional_mlm_clf.py → augmentation_clf/conditional_mlm_clf.py b/augmentation_tools/conditional_mlm_clf.py → augmentation_clf/conditional_mlm_clf.py
diff --git a/...ntation_tools/conditional_mlm_finetune.py → augmentation_clf/conditional_mlm_finetune.py b/...ntation_tools/conditional_mlm_finetune.py → augmentation_clf/conditional_mlm_finetune.py
diff --git a/augmentation_tools/eda_clf.py → augmentation_clf/eda_clf.py b/augmentation_tools/eda_clf.py → augmentation_clf/eda_clf.py
diff --git a/augmentation_tools/label_desc.py → augmentation_clf/label_desc.py b/augmentation_tools/label_desc.py → augmentation_clf/label_desc.py
diff --git a/augmentation_tools/mlm_clf.py → augmentation_clf/mlm_clf.py b/augmentation_tools/mlm_clf.py → augmentation_clf/mlm_clf.py
diff --git a/augmentation_tools/run_aug.sh → augmentation_clf/run_aug.sh b/augmentation_tools/run_aug.sh → augmentation_clf/run_aug.sh
diff --git a/augmentation_tools/run_filter.sh → augmentation_clf/run_filter.sh b/augmentation_tools/run_filter.sh → augmentation_clf/run_filter.sh
diff --git a/augmentation_tools/sega_clf.py → augmentation_clf/sega_clf.py b/augmentation_tools/sega_clf.py → augmentation_clf/sega_clf.py
diff --git a/augmentation_tools/sega_finetune.py → augmentation_clf/sega_finetune.py b/augmentation_tools/sega_finetune.py → augmentation_clf/sega_finetune.py
diff --git a/augmentation_tools/sega_mixup_clf.py → augmentation_clf/sega_mixup_clf.py b/augmentation_tools/sega_mixup_clf.py → augmentation_clf/sega_mixup_clf.py
diff --git a/augmentation_tools/sega_yahoo.py → augmentation_clf/sega_yahoo.py b/augmentation_tools/sega_yahoo.py → augmentation_clf/sega_yahoo.py
diff --git a/augmentation_tools/sta_clf.py → augmentation_clf/sta_clf.py b/augmentation_tools/sta_clf.py → augmentation_clf/sta_clf.py
diff --git a/augmentation_tools/sta_extract_kws.py → augmentation_clf/sta_extract_kws.py b/augmentation_tools/sta_extract_kws.py → augmentation_clf/sta_extract_kws.py
diff --git a/ner_and_qa/filter_qa_aug.py → augmentation_ner_qa/filter_qa_aug.py b/ner_and_qa/filter_qa_aug.py → augmentation_ner_qa/filter_qa_aug.py
@@ -62,7 +62,7 @@
 print('>>>After filtering: ',len(filtered_augmented_dataset['context']))
 
 
-# 保存
+# save
 df_filter = pd.DataFrame(filtered_augmented_dataset)
 print(len(df_filter))
 df_filter.to_pickle(f'qa_data/squad_first{N_TRAIN}_aug{N_AUG}_v{v}_filtered.pkl')

diff --git a/ner_and_qa/genius_ner_aug.py → augmentation_ner_qa/genius_ner_aug.py b/ner_and_qa/genius_ner_aug.py → augmentation_ner_qa/genius_ner_aug.py
diff --git a/ner_and_qa/genius_qa_aug.py → augmentation_ner_qa/genius_qa_aug.py b/ner_and_qa/genius_qa_aug.py → augmentation_ner_qa/genius_qa_aug.py
@@ -20,7 +20,7 @@
 
 # genius model
 device = int(args.device)
-model = 'beyond/genius-base'
+model = 'beyond/genius-large'
 genius = pipeline("text2text-generation", model=model, device=device)
 
 
@@ -115,7 +115,7 @@ def get_topk(s,max_k=8):
             # m_pre_context = mask_unimportant_parts(pre_context, kws)
             _, kws = sketch_extractor.get_kws(pre_context, aspect_keywords=[question], top=get_topk(pre_context))
             m_pre_context = sketch_extractor.get_sketch_from_kws(pre_context, kws)
-        else: # 没有上文，补一个mask
+        else: # no pre-context, add a mask
             m_pre_context = '<mask> '
 
         # find and mask post-context
@@ -125,7 +125,7 @@ def get_topk(s,max_k=8):
             # m_post_context = mask_unimportant_parts(post_context, kws)
             _, kws = sketch_extractor.get_kws(post_context, aspect_keywords=[question], top=get_topk(post_context))
             m_post_context = sketch_extractor.get_sketch_from_kws(post_context, kws)
-        else: # 没有下文，补一个mask
+        else: # no post-context, add a mask
             m_post_context = ' <mask>'
 
         # concatenate into a new context, and determine the new answer start
@@ -149,8 +149,8 @@ def get_topk(s,max_k=8):
 print('** Working Hard to Augment Your Dataset......')
 m_dataset = List2Dataset(m_contexts)
 generated_contexts = []
-for _ in range(N_AUG): # 增强多次
-    for out in tqdm(genius(m_dataset, num_beams=3, do_sample=True, max_length=200, length_penalty=2, batch_size=32,repetition_penalty=2.)): # 原来200, no repetition_penalty
+for _ in range(N_AUG): 
+    for out in tqdm(genius(m_dataset, num_beams=3, do_sample=True, max_length=200, length_penalty=2, batch_size=32,repetition_penalty=2.)): 
         generated_text = out[0]['generated_text']
         generated_contexts.append(generated_text)
 
@@ -161,19 +161,20 @@ def get_topk(s,max_k=8):
     try:
         a_s_idx = c.index(a_s) # index of the answer sentence
     except Exception as e:
-        # 一个严重的问题，原始的句子不一定会原封不动地输出，可能会有些微小变化
-        # 这样原来的answer sent就不一定找得到了，最好能用近似匹配，即重合率高于某阈值即可
+        # a problem using GENIUS is that the original input sentence may have small changes,
+        # resulting in the mismatch in output sequence
+        # therefore we calculate an overlap ratio to find the right sentence
         sents = sent_tokenize(c)
         for s in sents:
             words = word_tokenize(s)
             orig_words = word_tokenize(a_s)
             n = len([w for w in words if w in orig_words])
-            # 重合率达到0.6，且answer也在该句子中，说明这个句子就对应原始答案句
+            # overlap > 0.6 and the answer is also in the sentence, then this is the right sentence we want
             if n/len(words) > 0.6 and a in s: 
                 a_s = s
                 a_s_idx = c.index(a_s)
                 break
-    if a_s_idx > -1: # 确认找到了答案句子
+    if a_s_idx > -1: # we've got the right answer
         start = a_s_idx + a_s.index(a)
         assert c[start:start+len(a)] == a, '%s Answer Position Mismatch!'%i
 

diff --git a/ner_and_qa/run_ner.py → augmentation_ner_qa/run_ner.py b/ner_and_qa/run_ner.py → augmentation_ner_qa/run_ner.py
diff --git a/ner_and_qa/run_qa.py → augmentation_ner_qa/run_qa.py b/ner_and_qa/run_qa.py → augmentation_ner_qa/run_qa.py
diff --git a/ner_and_qa/run_qa.sh → augmentation_ner_qa/run_qa.sh b/ner_and_qa/run_qa.sh → augmentation_ner_qa/run_qa.sh
diff --git a/ner_and_qa/trainer_qa.py → augmentation_ner_qa/trainer_qa.py b/ner_and_qa/trainer_qa.py → augmentation_ner_qa/trainer_qa.py
diff --git a/ner_and_qa/utils_qa.py → augmentation_ner_qa/utils_qa.py b/ner_and_qa/utils_qa.py → augmentation_ner_qa/utils_qa.py
diff --git a/ner_and_qa/ner_filter.py b/ner_and_qa/ner_filter.py
diff --git a/yake/yake.py b/yake/yake.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 
-"""Main module."""
+"""Main module.""" 
 
 import string
 import os