diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 46276871a..9e66acde6 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -10,7 +10,7 @@
 ### Checklist:
 <!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
 <!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
-- [ ] My code follows the code style of this project, as detailed in our [contribution guidelines](../CONTRIBUTING.md).
+- [ ] My code follows the code style of this project, as detailed in our [contribution guidelines](https://github.com/microsoft/nlp-recipes/blob/master/CONTRIBUTING.md).
 - [ ] I have added tests.
 - [ ] I have updated the documentation accordingly.
 
diff --git a/DatasetReferences.md b/DatasetReferences.md
index 97b7176b3..741417088 100644
--- a/DatasetReferences.md
+++ b/DatasetReferences.md
@@ -2,6 +2,12 @@ MICROSOFT PROVIDES THE DATASETS ON AN "AS IS" BASIS. MICROSOFT MAKES NO WARRANTI
 
 The datasets are provided under the original terms that Microsoft received such datasets. See below for more information about each dataset.
 
+### <a name="cnndm"></a> CNN/Daily Mail (CNN/DM) Dataset
+The training and evaluation for CNN/DM  dataset is available https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz and released under MIT License. This is a processed version of data that's originally released by Hermann et al. (2015) in ["Teaching machines to read and comprehend"](https://arxiv.org/abs/1506.03340) and then made available by Kyunghyun Cho at https://cs.nyu.edu/~kcho/DMQA/.
+
+### Preprocessed CNN/Daily Mail (CNN/DM) Dataset by BERTSUM
+The preprocessed dataset of [CNN/DM dataset](#cnndm), originally published by BERTSUM paper ["Fine-tune BERT for Extractive Summarization"](https://arxiv.org/pdf/1903.10318.pdf), can be found at https://github.com/nlpyang/BertSum and released under Apache License 2.0.
+
 
 ### Microsoft Research Paraphrase Corpus
 >Original source: https://www.microsoft.com/en-us/download/details.aspx?id=52398
diff --git a/NOTICE.txt b/NOTICE.txt
index 76d0d9bec..1300655f4 100644
--- a/NOTICE.txt
+++ b/NOTICE.txt
@@ -665,4 +665,208 @@ https://github.com/allenai/bi-att-flow
    See the License for the specific language governing permissions and
    limitations under the License.
 
+--
+
+https://github.com/nlpyang/BertSum
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
 
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/README.md b/README.md
index 9d7fc7670..7c2944e3b 100755
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 <img src="NLP-Logo.png" align="right" alt="" width="300"/>
 
 
-# NLP Best Practices 
+# NLP Best Practices
 
 In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.
 
@@ -16,13 +16,13 @@ We hope that the tools can significantly reduce the “time to market” by simp
 
 In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like [*GLUE*](https://gluebenchmark.com/leaderboard) and [*SQuAD*](https://rajpurkar.github.io/SQuAD-explorer/) leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.
 
-Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks: 
+Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:
 <br><br><b>[Text Analytics](https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/) </b> are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.
 <br><br><b>[QnA Maker](https://azure.microsoft.com/en-us/services/cognitive-services/qna-maker/) </b>is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.
 <br><br><b>[Language Understanding](https://azure.microsoft.com/en-us/services/cognitive-services/language-understanding-intelligent-service/)</b> is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.
 
 ## Target Audience
-For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems. 
+For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.
 
 ## Focus areas
 The repository aims to expand NLP capabilities along three separate dimensions
@@ -33,13 +33,13 @@ We aim to have end-to-end examples of common tasks and scenarios such as text cl
 ### Algorithms
 We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the [transformers package](https://github.com/huggingface/transformers) from [Hugging Face](https://huggingface.co/) which allows users to easily load pretrained models and fine-tune them for different tasks.
 
-### Languages 
+### Languages
 We strongly subscribe to the multi-language principles laid down by ["Emily Bender"](http://faculty.washington.edu/ebender/papers/Bender-SDSS-2019.pdf)
 * "Natural language is not a synonym for English"
-* "English isn't generic for language, despite what NLP papers might lead you to believe" 
+* "English isn't generic for language, despite what NLP papers might lead you to believe"
 * "Always name the language you are working on" ([Bender rule](https://www.aclweb.org/anthology/Q18-1041/))
 
-The repository aims to support non-English languages  across all the scenarios. Pre-trianed models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.
+The repository aims to support non-English languages  across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.
 
 
 
@@ -50,6 +50,7 @@ The following is a summary of the commonly used NLP scenarios covered in the rep
 |-------------------------|  ------------------- |-------|---|
 |Text Classification                     |BERT, XLNet, RoBERTa| Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. |English, Hindi, Arabic|
 |Named Entity Recognition                |BERT| Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. |English|
+|Text Summarization|BERTSum|Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text.|English
 |Entailment                              |BERT, XLNet, RoBERTa| Textual entailment is the task of classifying the binary relation between two natural-language texts,  *text* and *hypothesis*, to determine if the *text* agrees with the *hypothesis* or not. |English|
 |Question Answering                      |BiDAF, BERT, XLNet| Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. |English|
 |Sentence Similarity                     |BERT, GenSen| Sentence similarity is the process of computing a similarity score given a pair of text documents. |English|
diff --git a/examples/README.md b/examples/README.md
index 06d837fa4..f79a7aa0f 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -6,6 +6,7 @@ This folder contains examples and best practices, written in Jupyter notebooks,
 |---| ------------------------ | ------------------- |---|
 |[Text Classification](text_classification)|Topic Classification|BERT, XLNet, RoBERTa, DistilBERT|en, hi, ar|
 |[Named Entity Recognition](named_entity_recognition) |Wikipedia NER|BERT|en|
+|[Text Summarization](text_summarization)|News Summarization, Headline Generation|Extractive: BERTSumExt <br> Abstractive: WIP, ETA: Mar. 2020|en
 |[Entailment](entailment)|MultiNLI Natural Language Inference|BERT|en|
 |[Question Answering](question_answering) |SQuAD|BiDAF, BERT, XLNet, DistilBERT|en|
 |[Sentence Similarity](sentence_similarity)|STS Benchmark|BERT, GenSen|en|
diff --git a/examples/named_entity_recognition/ner_wikigold_transformer.ipynb b/examples/named_entity_recognition/ner_wikigold_transformer.ipynb
index 8bbc82a7c..f077f8d62 100644
--- a/examples/named_entity_recognition/ner_wikigold_transformer.ipynb
+++ b/examples/named_entity_recognition/ner_wikigold_transformer.ipynb
@@ -233,7 +233,7 @@
    "source": [
     "with Timer() as t:\n",
     "    preds = model.predict(\n",
-    "        eval_dataloader=test_dataloader,\n",
+    "        test_dataloader=test_dataloader,\n",
     "        num_gpus=None,\n",
     "        verbose=True\n",
     "    )\n",
diff --git a/examples/text_classification/tc_mnli_transformers.ipynb b/examples/text_classification/tc_mnli_transformers.ipynb
index 952f2bafa..bfbd91ffe 100644
--- a/examples/text_classification/tc_mnli_transformers.ipynb
+++ b/examples/text_classification/tc_mnli_transformers.ipynb
@@ -32,6 +32,7 @@
     "from sklearn.preprocessing import LabelEncoder\n",
     "from tqdm import tqdm\n",
     "from utils_nlp.common.timer import Timer\n",
+    "from utils_nlp.common.pytorch_utils import dataloader_from_dataset\n",
     "from utils_nlp.dataset.multinli import load_pandas_df\n",
     "from utils_nlp.models.transformers.sequence_classification import (\n",
     "    Processor, SequenceClassifier)"
@@ -93,7 +94,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "100%|██████████| 222k/222k [01:25<00:00, 2.60kKB/s] \n"
+      "100%|██████████| 222k/222k [01:20<00:00, 2.74kKB/s] \n"
      ]
     }
    ],
@@ -196,7 +197,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/media/bleik2/miniconda3/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
+      "/media/bleik2/backup/.conda/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
       "  FutureWarning)\n"
      ]
     }
@@ -232,11 +233,11 @@
     {
      "data": {
       "text/plain": [
-       "telephone     1055\n",
-       "slate         1003\n",
-       "travel         961\n",
-       "fiction        952\n",
-       "government     938\n",
+       "telephone     1043\n",
+       "slate          989\n",
+       "fiction        968\n",
+       "travel         964\n",
+       "government     945\n",
        "Name: genre, dtype: int64"
       ]
      },
@@ -385,32 +386,108 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>15</th>\n",
-       "      <td>roberta-base</td>\n",
+       "      <td>bert-base-japanese</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
-       "      <td>roberta-large</td>\n",
+       "      <td>bert-base-japanese-whole-word-masking</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>17</th>\n",
-       "      <td>roberta-large-mnli</td>\n",
+       "      <td>bert-base-japanese-char</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>18</th>\n",
-       "      <td>xlnet-base-cased</td>\n",
+       "      <td>bert-base-japanese-char-whole-word-masking</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>19</th>\n",
-       "      <td>xlnet-large-cased</td>\n",
+       "      <td>bert-base-finnish-cased-v1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>20</th>\n",
-       "      <td>distilbert-base-uncased</td>\n",
+       "      <td>bert-base-finnish-uncased-v1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>21</th>\n",
+       "      <td>roberta-base</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22</th>\n",
+       "      <td>roberta-large</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>23</th>\n",
+       "      <td>roberta-large-mnli</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>24</th>\n",
+       "      <td>distilroberta-base</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25</th>\n",
+       "      <td>roberta-base-openai-detector</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>26</th>\n",
+       "      <td>roberta-large-openai-detector</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>27</th>\n",
+       "      <td>xlnet-base-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>28</th>\n",
+       "      <td>xlnet-large-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>29</th>\n",
+       "      <td>distilbert-base-uncased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>30</th>\n",
        "      <td>distilbert-base-uncased-distilled-squad</td>\n",
        "    </tr>\n",
+       "    <tr>\n",
+       "      <th>31</th>\n",
+       "      <td>distilbert-base-german-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>32</th>\n",
+       "      <td>distilbert-base-multilingual-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>33</th>\n",
+       "      <td>albert-base-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>34</th>\n",
+       "      <td>albert-large-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>35</th>\n",
+       "      <td>albert-xlarge-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>36</th>\n",
+       "      <td>albert-xxlarge-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>37</th>\n",
+       "      <td>albert-base-v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38</th>\n",
+       "      <td>albert-large-v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>39</th>\n",
+       "      <td>albert-xlarge-v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>40</th>\n",
+       "      <td>albert-xxlarge-v2</td>\n",
+       "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
@@ -432,13 +509,32 @@
        "12                     bert-base-cased-finetuned-mrpc\n",
        "13                       bert-base-german-dbmdz-cased\n",
        "14                     bert-base-german-dbmdz-uncased\n",
-       "15                                       roberta-base\n",
-       "16                                      roberta-large\n",
-       "17                                 roberta-large-mnli\n",
-       "18                                   xlnet-base-cased\n",
-       "19                                  xlnet-large-cased\n",
-       "20                            distilbert-base-uncased\n",
-       "21            distilbert-base-uncased-distilled-squad"
+       "15                                 bert-base-japanese\n",
+       "16              bert-base-japanese-whole-word-masking\n",
+       "17                            bert-base-japanese-char\n",
+       "18         bert-base-japanese-char-whole-word-masking\n",
+       "19                         bert-base-finnish-cased-v1\n",
+       "20                       bert-base-finnish-uncased-v1\n",
+       "21                                       roberta-base\n",
+       "22                                      roberta-large\n",
+       "23                                 roberta-large-mnli\n",
+       "24                                 distilroberta-base\n",
+       "25                       roberta-base-openai-detector\n",
+       "26                      roberta-large-openai-detector\n",
+       "27                                   xlnet-base-cased\n",
+       "28                                  xlnet-large-cased\n",
+       "29                            distilbert-base-uncased\n",
+       "30            distilbert-base-uncased-distilled-squad\n",
+       "31                       distilbert-base-german-cased\n",
+       "32                 distilbert-base-multilingual-cased\n",
+       "33                                     albert-base-v1\n",
+       "34                                    albert-large-v1\n",
+       "35                                   albert-xlarge-v1\n",
+       "36                                  albert-xxlarge-v1\n",
+       "37                                     albert-base-v2\n",
+       "38                                    albert-large-v2\n",
+       "39                                   albert-xlarge-v2\n",
+       "40                                  albert-xxlarge-v2"
       ]
      },
      "execution_count": 10,
@@ -492,18 +588,8 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "100%|██████████| 231508/231508 [00:00<00:00, 15545441.79B/s]\n",
-      "100%|██████████| 492/492 [00:00<00:00, 560455.61B/s]\n",
-      "100%|██████████| 267967963/267967963 [00:04<00:00, 61255588.46B/s]\n",
-      "/media/bleik2/miniconda3/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
-      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
-      "100%|██████████| 898823/898823 [00:00<00:00, 23932308.55B/s]\n",
-      "100%|██████████| 456318/456318 [00:00<00:00, 23321916.66B/s]\n",
-      "100%|██████████| 473/473 [00:00<00:00, 477015.10B/s]\n",
-      "100%|██████████| 501200538/501200538 [00:07<00:00, 64332558.45B/s]\n",
-      "100%|██████████| 798011/798011 [00:00<00:00, 25002433.16B/s]\n",
-      "100%|██████████| 641/641 [00:00<00:00, 695974.34B/s]\n",
-      "100%|██████████| 467042463/467042463 [00:08<00:00, 55154509.21B/s]\n"
+      "/media/bleik2/backup/.conda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn('Was asked to gather along dimension 0, but all '\n"
      ]
     }
    ],
@@ -518,11 +604,17 @@
     "        to_lower=model_name.endswith(\"uncased\"),\n",
     "        cache_dir=CACHE_DIR,\n",
     "    )\n",
-    "    train_dataloader = processor.create_dataloader_from_df(\n",
-    "        df_train, TEXT_COL, LABEL_COL, max_len=MAX_LEN, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=True\n",
+    "    train_dataset = processor.dataset_from_dataframe(\n",
+    "        df_train, TEXT_COL, LABEL_COL, max_len=MAX_LEN\n",
     "    )\n",
-    "    test_dataloader = processor.create_dataloader_from_df(\n",
-    "        df_test, TEXT_COL, LABEL_COL, max_len=MAX_LEN, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=False\n",
+    "    train_dataloader = dataloader_from_dataset(\n",
+    "        train_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=True\n",
+    "    )\n",
+    "    test_dataset = processor.dataset_from_dataframe(\n",
+    "        df_test, TEXT_COL, LABEL_COL, max_len=MAX_LEN\n",
+    "    )\n",
+    "    test_dataloader = dataloader_from_dataset(\n",
+    "        test_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=False\n",
     "    )\n",
     "\n",
     "    # fine-tune\n",
@@ -531,17 +623,12 @@
     "    )\n",
     "    with Timer() as t:\n",
     "        classifier.fit(\n",
-    "            train_dataloader,\n",
-    "            num_epochs=NUM_EPOCHS,\n",
-    "            num_gpus=NUM_GPUS,\n",
-    "            verbose=False,\n",
+    "            train_dataloader, num_epochs=NUM_EPOCHS, num_gpus=NUM_GPUS, verbose=False,\n",
     "        )\n",
     "    train_time = t.interval / 3600\n",
     "\n",
     "    # predict\n",
-    "    preds = classifier.predict(\n",
-    "        test_dataloader, num_gpus=NUM_GPUS, verbose=False\n",
-    "    )\n",
+    "    preds = classifier.predict(test_dataloader, num_gpus=NUM_GPUS, verbose=False)\n",
     "\n",
     "    # eval\n",
     "    accuracy = accuracy_score(df_test[LABEL_COL], preds)\n",
@@ -600,21 +687,21 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>accuracy</th>\n",
-       "      <td>0.895477</td>\n",
-       "      <td>0.879584</td>\n",
-       "      <td>0.894866</td>\n",
+       "      <td>0.889364</td>\n",
+       "      <td>0.885697</td>\n",
+       "      <td>0.886308</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>f1-score</th>\n",
-       "      <td>0.896656</td>\n",
-       "      <td>0.881218</td>\n",
-       "      <td>0.896108</td>\n",
+       "      <td>0.885225</td>\n",
+       "      <td>0.880926</td>\n",
+       "      <td>0.881819</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>time(hrs)</th>\n",
-       "      <td>0.021865</td>\n",
-       "      <td>0.035351</td>\n",
-       "      <td>0.046295</td>\n",
+       "      <td>0.023326</td>\n",
+       "      <td>0.044209</td>\n",
+       "      <td>0.052801</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
@@ -622,9 +709,9 @@
       ],
       "text/plain": [
        "           distilbert-base-uncased  roberta-base  xlnet-base-cased\n",
-       "accuracy                  0.895477      0.879584          0.894866\n",
-       "f1-score                  0.896656      0.881218          0.896108\n",
-       "time(hrs)                 0.021865      0.035351          0.046295"
+       "accuracy                  0.889364      0.885697          0.886308\n",
+       "f1-score                  0.885225      0.880926          0.881819\n",
+       "time(hrs)                 0.023326      0.044209          0.052801"
       ]
      },
      "execution_count": 13,
@@ -645,7 +732,7 @@
     {
      "data": {
       "application/scrapbook.scrap.json+json": {
-       "data": 0.8899755501222494,
+       "data": 0.887123064384678,
        "encoder": "json",
        "name": "accuracy",
        "version": 1
@@ -663,7 +750,7 @@
     {
      "data": {
       "application/scrapbook.scrap.json+json": {
-       "data": 0.8913273009038569,
+       "data": 0.8826569624491233,
        "encoder": "json",
        "name": "f1",
        "version": 1
@@ -688,9 +775,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "nlp_gpu",
+   "display_name": "Python 3.6.8 64-bit ('nlp_gpu': conda)",
    "language": "python",
-   "name": "nlp_gpu"
+   "name": "python36864bitnlpgpucondaa579511bcea84c65877ff3dca4205921"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/examples/text_classification/tc_multi_languages_transformers.ipynb b/examples/text_classification/tc_multi_languages_transformers.ipynb
index 437c95cfb..d8dfd9244 100644
--- a/examples/text_classification/tc_multi_languages_transformers.ipynb
+++ b/examples/text_classification/tc_multi_languages_transformers.ipynb
@@ -13,7 +13,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -69,7 +69,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {
     "tags": [
      "parameters"
@@ -183,32 +183,108 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>15</th>\n",
-       "      <td>roberta-base</td>\n",
+       "      <td>bert-base-japanese</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16</th>\n",
-       "      <td>roberta-large</td>\n",
+       "      <td>bert-base-japanese-whole-word-masking</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>17</th>\n",
-       "      <td>roberta-large-mnli</td>\n",
+       "      <td>bert-base-japanese-char</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>18</th>\n",
-       "      <td>xlnet-base-cased</td>\n",
+       "      <td>bert-base-japanese-char-whole-word-masking</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>19</th>\n",
-       "      <td>xlnet-large-cased</td>\n",
+       "      <td>bert-base-finnish-cased-v1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>20</th>\n",
-       "      <td>distilbert-base-uncased</td>\n",
+       "      <td>bert-base-finnish-uncased-v1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>21</th>\n",
+       "      <td>roberta-base</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22</th>\n",
+       "      <td>roberta-large</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>23</th>\n",
+       "      <td>roberta-large-mnli</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>24</th>\n",
+       "      <td>distilroberta-base</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25</th>\n",
+       "      <td>roberta-base-openai-detector</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>26</th>\n",
+       "      <td>roberta-large-openai-detector</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>27</th>\n",
+       "      <td>xlnet-base-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>28</th>\n",
+       "      <td>xlnet-large-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>29</th>\n",
+       "      <td>distilbert-base-uncased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>30</th>\n",
        "      <td>distilbert-base-uncased-distilled-squad</td>\n",
        "    </tr>\n",
+       "    <tr>\n",
+       "      <th>31</th>\n",
+       "      <td>distilbert-base-german-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>32</th>\n",
+       "      <td>distilbert-base-multilingual-cased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>33</th>\n",
+       "      <td>albert-base-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>34</th>\n",
+       "      <td>albert-large-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>35</th>\n",
+       "      <td>albert-xlarge-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>36</th>\n",
+       "      <td>albert-xxlarge-v1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>37</th>\n",
+       "      <td>albert-base-v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38</th>\n",
+       "      <td>albert-large-v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>39</th>\n",
+       "      <td>albert-xlarge-v2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>40</th>\n",
+       "      <td>albert-xxlarge-v2</td>\n",
+       "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
@@ -230,13 +306,32 @@
        "12                     bert-base-cased-finetuned-mrpc\n",
        "13                       bert-base-german-dbmdz-cased\n",
        "14                     bert-base-german-dbmdz-uncased\n",
-       "15                                       roberta-base\n",
-       "16                                      roberta-large\n",
-       "17                                 roberta-large-mnli\n",
-       "18                                   xlnet-base-cased\n",
-       "19                                  xlnet-large-cased\n",
-       "20                            distilbert-base-uncased\n",
-       "21            distilbert-base-uncased-distilled-squad"
+       "15                                 bert-base-japanese\n",
+       "16              bert-base-japanese-whole-word-masking\n",
+       "17                            bert-base-japanese-char\n",
+       "18         bert-base-japanese-char-whole-word-masking\n",
+       "19                         bert-base-finnish-cased-v1\n",
+       "20                       bert-base-finnish-uncased-v1\n",
+       "21                                       roberta-base\n",
+       "22                                      roberta-large\n",
+       "23                                 roberta-large-mnli\n",
+       "24                                 distilroberta-base\n",
+       "25                       roberta-base-openai-detector\n",
+       "26                      roberta-large-openai-detector\n",
+       "27                                   xlnet-base-cased\n",
+       "28                                  xlnet-large-cased\n",
+       "29                            distilbert-base-uncased\n",
+       "30            distilbert-base-uncased-distilled-squad\n",
+       "31                       distilbert-base-german-cased\n",
+       "32                 distilbert-base-multilingual-cased\n",
+       "33                                     albert-base-v1\n",
+       "34                                    albert-large-v1\n",
+       "35                                   albert-xlarge-v1\n",
+       "36                                  albert-xxlarge-v1\n",
+       "37                                     albert-base-v2\n",
+       "38                                    albert-large-v2\n",
+       "39                                   albert-xlarge-v2\n",
+       "40                                  albert-xxlarge-v2"
       ]
      },
      "execution_count": 3,
@@ -264,7 +359,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -281,7 +376,7 @@
     "    'num_train_epochs': 5,\n",
     "    'num_gpus': 2,\n",
     "    'batch_size': 16,\n",
-    "    'verbose': True,\n",
+    "    'verbose': False,\n",
     "    'load_dataset_func': None,\n",
     "    'get_labels_func': None\n",
     "}\n",
@@ -325,9 +420,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 80.1k/80.1k [00:02<00:00, 30.8kKB/s]\n",
+      "/media/bleik2/backup/.conda/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
+      "  FutureWarning)\n"
+     ]
+    }
+   ],
    "source": [
     "train_dataloader, test_dataloader, label_encoder, test_labels = CONFIG['load_dataset_func'](\n",
     "    local_path=CONFIG['local_path'],\n",
@@ -354,11 +459,27 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/media/bleik2/backup/.conda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn('Was asked to gather along dimension 0, but all '\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training time : 0.190 hrs\n"
+     ]
+    }
+   ],
    "source": [
     "model = SequenceClassifier(\n",
     "    model_name=CONFIG['model_name'],\n",
@@ -390,9 +511,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prediction time : 0.021 hrs\n"
+     ]
+    }
+   ],
    "source": [
     "with Timer() as t:\n",
     "    preds = model.predict(\n",
@@ -422,11 +551,11 @@
      "text": [
       "              precision    recall  f1-score   support\n",
       "\n",
-      "     culture       0.89      0.89      0.89       843\n",
-      "     diverse       0.99      0.99      0.99      1738\n",
-      "     economy       0.96      0.96      0.96       661\n",
-      "    politics       0.94      0.94      0.94       530\n",
-      "      sports       0.87      0.87      0.87       580\n",
+      "     culture       0.93      0.94      0.93       548\n",
+      "     diverse       0.94      0.94      0.94       640\n",
+      "     economy       0.90      0.88      0.89       570\n",
+      "    politics       0.87      0.88      0.88       809\n",
+      "      sports       0.99      0.98      0.99      1785\n",
       "\n",
       "   micro avg       0.94      0.94      0.94      4352\n",
       "   macro avg       0.93      0.93      0.93      4352\n",
@@ -449,9 +578,64 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.94,
+       "encoder": "json",
+       "name": "precision",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "precision"
+      }
+     },
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.94,
+       "encoder": "json",
+       "name": "recall",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "recall"
+      }
+     },
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.94,
+       "encoder": "json",
+       "name": "f1",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "f1"
+      }
+     },
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "# for testing\n",
     "report_splits = report.split('\\n')[-2].split()\n",
@@ -463,11 +647,10 @@
   }
  ],
  "metadata": {
-  "celltoolbar": "Tags",
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.6.8 64-bit ('nlp_gpu': conda)",
    "language": "python",
-   "name": "python3"
+   "name": "python36864bitnlpgpucondaa579511bcea84c65877ff3dca4205921"
   },
   "language_info": {
    "codemirror_mode": {
diff --git a/examples/text_summarization/extractive_summarization_cnndm_transformer.ipynb b/examples/text_summarization/extractive_summarization_cnndm_transformer.ipynb
new file mode 100644
index 000000000..206895b91
--- /dev/null
+++ b/examples/text_summarization/extractive_summarization_cnndm_transformer.ipynb
@@ -0,0 +1,1091 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
+    "\n",
+    "Licensed under the MIT License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extractive Summarization on CNN/DM Dataset using Transformer Version of BertSum\n",
+    "\n",
+    "\n",
+    "### Summary\n",
+    "\n",
+    "This notebook demonstrates how to fine tune Transformers for extractive text summarization. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.\n",
+    "\n",
+    "BertSum refers to  [Fine-tune BERT for Extractive Summarization](https://arxiv.org/pdf/1903.10318.pdf) with [published example](https://github.com/nlpyang/BertSum/). And the Transformer version of Bertsum refers to our modification of BertSum and the source code can be accessed at (https://github.com/daden-ms/BertSum/). \n",
+    "\n",
+    "Extractive summarization are usually used in document summarization where each input document consists of mutiple sentences. The preprocessing of the input training data involves assigning label 0 or 1 to the document sentences based on the give summary. The summarization problem is also simplfied to classifying whether a document sentence should be included in the summary. \n",
+    "\n",
+    "The figure below illustrates how BERTSum can be fine tuned for extractive summarization task. [CLS] token is inserted at the beginning of each sentence, so is [SEP] token at the end. Interval segment embedding and positional embedding are added upon the token embedding as the input of the BERT model. The [CLS] token representation is used as sentence embedding and only the [CLS] tokens are used as the input for the summarization model. The summarization layer predicts the probability for each  sentence being included in the summary. Techniques like trigram blocking can be used to improve model accuarcy.   \n",
+    "\n",
+    "<img src=\"https://nlpbp.blob.core.windows.net/images/BertSum.PNG\">\n",
+    "\n",
+    "\n",
+    "### Before You Start\n",
+    "\n",
+    "The running time shown in this notebook is on a Standard_NC24s_v3 Azure Ubuntu Virtual Machine with 4 NVIDIA Tesla V100 GPUs. \n",
+    "> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. \n",
+    "\n",
+    "Using only 1 NVIDIA Tesla V100 GPUs, 16GB GPU memory configuration,\n",
+    "- for data preprocessing, it takes around 1 minutes to preprocess the data for quick run. Otherwise it takes ~2 hours to finish the data preprocessing. This time estimation assumes that the chosen transformer model is \"distilbert-base-uncased\" and the sentence selection method is \"greedy\", which is the default. The preprocessing time can be significantly longer if the sentence selection method is \"combination\", which can achieve better model performance.\n",
+    "\n",
+    "- for model fine tuning, it takes around 10 minutes for quick run. Otherwise, it takes around ~3 hours to finish. This estimation assumes the chosen encoder method is \"transformer\". The model fine tuning time can be shorter if other encoder method is chosen, which may result in worse model performance. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.\n",
+    "QUICK_RUN = True\n",
+    "## Set USE_PREPROCSSED_DATA = True to skip the data preprocessing\n",
+    "USE_PREPROCSSED_DATA = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Configuration\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /home/daden/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n",
+      "I0103 05:24:46.822919 140060135520064 file_utils.py:40] PyTorch version 1.2.0 available.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import shutil\n",
+    "import sys\n",
+    "from tempfile import TemporaryDirectory\n",
+    "import torch\n",
+    "\n",
+    "nlp_path = os.path.abspath(\"../../\")\n",
+    "if nlp_path not in sys.path:\n",
+    "    sys.path.insert(0, nlp_path)\n",
+    "\n",
+    "from utils_nlp.dataset.cnndm import CNNDMBertSumProcessedData, CNNDMSummarizationDataset\n",
+    "from utils_nlp.eval.evaluate_summarization import get_rouge\n",
+    "from utils_nlp.models.transformers.extractive_summarization import (\n",
+    "    ExtractiveSummarizer,\n",
+    "    ExtSumProcessedData,\n",
+    "    ExtSumProcessor,\n",
+    ")\n",
+    "\n",
+    "import pandas as pd\n",
+    "import scrapbook as sb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### Configuration: choose the transformer model to be used"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Several pretrained models have been made available by [Hugging Face](https://github.com/huggingface/transformers). For extractive summarization, the following pretrained models are supported. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>model_name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>bert-base-uncased</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>distilbert-base-uncased</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                model_name\n",
+       "0        bert-base-uncased\n",
+       "1  distilbert-base-uncased"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pd.DataFrame({\"model_name\": ExtractiveSummarizer.list_supported_models()})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Transformer model being used\n",
+    "MODEL_NAME = \"distilbert-base-uncased\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Also, we need to install the dependencies for pyrouge."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# dependencies for ROUGE-1.5.5.pl\n",
+    "Run the following commands in your terminal to install XML parsing C library.\n",
+    "\n",
+    "1. sudo apt-get update\n",
+    "1. sudo apt-get install expat\n",
+    "1. sudo apt-get install libexpat-dev -y"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the following commands in your terminal to install other pre-requistes for using pyrouge.\n",
+    "1. sudo cpan install XML::Parser\n",
+    "1. sudo cpan install XML::Parser::PerlSAX\n",
+    "1. sudo cpan install XML::DOM\n",
+    "\n",
+    "Download ROUGE-1.5.5 from https://github.com/andersjo/pyrouge/tree/master/tools/ROUGE-1.5.5.\n",
+    "Run the following command in your terminal.\n",
+    "* pyrouge_set_rouge_path $ABSOLUTE_DIRECTORY_TO_ROUGE-1.5.5.pl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Preprocessing\n",
+    "\n",
+    "The dataset we used for this notebook is CNN/DM dataset which contains the documents and accompanying questions from the news articles of CNN and Daily mail. The highlights in each article are used as summary. The dataset consits of ~289K training examples, ~11K valiation examples and ~11K test examples.  You can choose the [Option 1] below preprocess the data or [Option 2] to use the preprocessed version at [BERTSum published example](https://github.com/nlpyang/BertSum/). You don't need to manually download any of these two data sets as the code below will handle downloading. Functions defined specific in [cnndm.py](../../utils_nlp/dataset/cnndm.py) are unique to CNN/DM dataset that's preprocessed by harvardnlp. However, it provides a skeleton of how to preprocessing text into the format that model preprocessor takes: sentence tokenization and work tokenization. \n",
+    "\n",
+    "##### Details of Data Preprocessing\n",
+    "\n",
+    "The purpose of preprocessing is to process the input articles to the format that model finetuning needed. Assuming you have (1) all articles and (2) target summaries, each in a file and line-breaker separated, the steps to preprocess the data are:\n",
+    "1. sentence tokenization\n",
+    "2. word tokenization\n",
+    "3. **label** the sentences in the article with 1 meaning the sentence is selected and 0 meaning the sentence is not selected. The algorithms for the sentence selection are \"greedy\" and \"combination\" and can be found in [sentence_selection.py](../../utils_nlp/dataset/sentence_selection.py)\n",
+    "3. convert each example to  the desired format for extractive summarization\n",
+    "    - filter the sentences in the example based on the min_src_ntokens argument. If the lefted total sentence number is less than min_nsents, the example is discarded.\n",
+    "    - truncate the sentences in the example if the length is greater than max_src_ntokens\n",
+    "    - truncate the sentences in the example and the labels if the total number of sentences is greater than max_nsents\n",
+    "    - [CLS] and [SEP] are inserted before and after each sentence\n",
+    "    - wordPiece tokenization or Byte Pair Encoding (BPE) subword tokenization\n",
+    "    - truncate the example to 512 tokens\n",
+    "    - convert the tokens into token indices corresponding to the transformer tokenizer's vocabulary.\n",
+    "    - segment ids are generated and added\n",
+    "    - [CLS] token positions are logged\n",
+    "    - [CLS] token labels are truncated if it's greater than 512, which is the maximum input length that can be taken by the transformer model.\n",
+    "    \n",
+    "    \n",
+    "Note that the original BERTSum paper use Stanford CoreNLP for data preprocessing, here we use NLTK for data preprocessing. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### [Option 1] Preprocess  data (Please skil this part if you choose to use preprocessed data)\n",
+    "The code in following cell will download the CNN/DM dataset listed at https://github.com/harvardnlp/sent-summary/."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the data path used to save the downloaded data file\n",
+    "DATA_PATH = TemporaryDirectory().name\n",
+    "# The number of lines at the head of data file used for preprocessing. -1 means all the lines.\n",
+    "TOP_N = 1000\n",
+    "CHUNK_SIZE=200\n",
+    "if not QUICK_RUN:\n",
+    "    TOP_N = -1\n",
+    "    CHUNK_SIZE = 2000"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "I0103 05:29:37.485339 140060135520064 utils.py:173] Opening tar file /tmp/tmpjd6tv6g9/cnndm.tar.gz.\n",
+      "I0103 05:29:37.487093 140060135520064 utils.py:181] /tmp/tmpjd6tv6g9/test.txt.src already extracted.\n",
+      "I0103 05:29:37.777695 140060135520064 utils.py:181] /tmp/tmpjd6tv6g9/test.txt.tgt.tagged already extracted.\n",
+      "I0103 05:29:37.804513 140060135520064 utils.py:181] /tmp/tmpjd6tv6g9/train.txt.src already extracted.\n",
+      "I0103 05:29:45.345131 140060135520064 utils.py:181] /tmp/tmpjd6tv6g9/train.txt.tgt.tagged already extracted.\n",
+      "I0103 05:29:45.963999 140060135520064 utils.py:181] /tmp/tmpjd6tv6g9/val.txt.src already extracted.\n",
+      "I0103 05:29:46.300785 140060135520064 utils.py:181] /tmp/tmpjd6tv6g9/val.txt.tgt.tagged already extracted.\n"
+     ]
+    }
+   ],
+   "source": [
+    "train_dataset, test_dataset = CNNDMSummarizationDataset(top_n=TOP_N, local_cache_path=DATA_PATH)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Preprocess the data and save the data to disk."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "I0103 05:29:49.643863 140060135520064 tokenization_utils.py:379] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084\n"
+     ]
+    }
+   ],
+   "source": [
+    "processor = ExtSumProcessor(model_name=MODEL_NAME)\n",
+    "ext_sum_train = processor.preprocess(train_dataset, train_dataset.get_target(), oracle_mode=\"greedy\")\n",
+    "ext_sum_test = processor.preprocess(test_dataset, test_dataset.get_target(),oracle_mode=\"greedy\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "save_path = os.path.join(DATA_PATH, \"processed\")\n",
+    "train_files = ExtSumProcessedData.save_data(\n",
+    "    ext_sum_train, is_test=False, save_path=save_path, chunk_size=CHUNK_SIZE\n",
+    ")\n",
+    "test_files = ExtSumProcessedData.save_data(\n",
+    "    ext_sum_test, is_test=True, save_path=save_path, chunk_size=CHUNK_SIZE\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['/tmp/tmpjd6tv6g9/processed/0_train',\n",
+       " '/tmp/tmpjd6tv6g9/processed/1_train',\n",
+       " '/tmp/tmpjd6tv6g9/processed/2_train',\n",
+       " '/tmp/tmpjd6tv6g9/processed/3_train',\n",
+       " '/tmp/tmpjd6tv6g9/processed/4_train']"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['/tmp/tmpjd6tv6g9/processed/0_test',\n",
+       " '/tmp/tmpjd6tv6g9/processed/1_test',\n",
+       " '/tmp/tmpjd6tv6g9/processed/2_test',\n",
+       " '/tmp/tmpjd6tv6g9/processed/3_test',\n",
+       " '/tmp/tmpjd6tv6g9/processed/4_test']"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset, test_dataset = ExtSumProcessedData().splits(root=save_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Inspect Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "200\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "bert_format_data = torch.load(train_files[0])\n",
+    "print(len(bert_format_data))\n",
+    "bert_format_data[0].keys()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bert_format_data[0]['labels']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### [Option 2] Reuse Preprocessed  data from [BERTSUM Repo](https://github.com/nlpyang/BertSum)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the data path used to downloaded the preprocessed data from BERTSUM Repo.\n",
+    "# if you have downloaded the dataset, change the code to use that path where the dataset is.\n",
+    "PROCESSED_DATA_PATH = TemporaryDirectory().name\n",
+    "#data_path = \"./temp_data5/\"\n",
+    "#PROCESSED_DATA_PATH = data_path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if USE_PREPROCSSED_DATA:\n",
+    "    CNNDMBertSumProcessedData.download(local_path=PROCESSED_DATA_PATH)\n",
+    "    train_dataset, test_dataset = ExtSumProcessedData().splits(root=PROCESSED_DATA_PATH)\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Model training\n",
+    "To start model training, we need to create a instance of ExtractiveSummarizer.\n",
+    "#### Choose the transformer model.\n",
+    "Currently ExtractiveSummarizer support two models:\n",
+    "- distilbert-base-uncase, \n",
+    "- bert-base-uncase\n",
+    "\n",
+    "Potentionally, roberta-based model and xlnet can be supported but needs to be tested.\n",
+    "#### Choose the encoder algorithm.\n",
+    "There are four options:\n",
+    "- baseline: it used a smaller transformer model to replace the bert model and with transformer summarization layer\n",
+    "- classifier: it uses pretrained BERT and fine-tune BERT with **simple logistic classification** summarization layer\n",
+    "- transformer: it uses pretrained BERT and fine-tune BERT with **transformer** summarization layer\n",
+    "- RNN: it uses pretrained BERT and fine-tune BERT with **LSTM** summarization layer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# notebook parameters\n",
+    "# the cache data path during find tuning\n",
+    "CACHE_DIR = TemporaryDirectory().name\n",
+    "\n",
+    "# batch size, unit is the number of tokens\n",
+    "BATCH_SIZE = 3000\n",
+    "\n",
+    "# GPU used for training\n",
+    "NUM_GPUS = 2\n",
+    "\n",
+    "# Encoder name. Options are: 1. baseline, classifier, transformer, rnn.\n",
+    "ENCODER = \"transformer\"\n",
+    "\n",
+    "# Learning rate\n",
+    "LEARNING_RATE=2e-3\n",
+    "\n",
+    "# How often the statistics reports show up in training, unit is step.\n",
+    "REPORT_EVERY=100\n",
+    "\n",
+    "# total number of steps for training\n",
+    "MAX_STEPS=1e3\n",
+    "# number of steps for warm up\n",
+    "WARMUP_STEPS=5e2\n",
+    "    \n",
+    "if not QUICK_RUN:\n",
+    "    MAX_STEPS=5e4\n",
+    "    WARMUP_STEPS=5e3\n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "I0103 05:31:28.485541 140060135520064 file_utils.py:319] https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json not found in cache or force_download set to True, downloading to /tmp/tmp8gqb629n\n",
+      "100%|██████████| 492/492 [00:00<00:00, 636323.64B/s]\n",
+      "I0103 05:31:28.695636 140060135520064 file_utils.py:334] copying /tmp/tmp8gqb629n to cache at /tmp/tmp_b2wqaou/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.1ccd1a11c9ff276830e114ea477ea2407100f4a3be7bdc45d37be9e37fa71c7e\n",
+      "I0103 05:31:28.696582 140060135520064 file_utils.py:338] creating metadata file for /tmp/tmp_b2wqaou/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.1ccd1a11c9ff276830e114ea477ea2407100f4a3be7bdc45d37be9e37fa71c7e\n",
+      "I0103 05:31:28.698467 140060135520064 file_utils.py:347] removing temp file /tmp/tmp8gqb629n\n",
+      "I0103 05:31:28.699253 140060135520064 configuration_utils.py:157] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json from cache at /tmp/tmp_b2wqaou/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.1ccd1a11c9ff276830e114ea477ea2407100f4a3be7bdc45d37be9e37fa71c7e\n",
+      "I0103 05:31:28.700302 140060135520064 configuration_utils.py:174] Model config {\n",
+      "  \"activation\": \"gelu\",\n",
+      "  \"attention_dropout\": 0.1,\n",
+      "  \"dim\": 768,\n",
+      "  \"dropout\": 0.1,\n",
+      "  \"finetuning_task\": null,\n",
+      "  \"hidden_dim\": 3072,\n",
+      "  \"initializer_range\": 0.02,\n",
+      "  \"is_decoder\": false,\n",
+      "  \"max_position_embeddings\": 512,\n",
+      "  \"n_heads\": 12,\n",
+      "  \"n_layers\": 6,\n",
+      "  \"num_labels\": 0,\n",
+      "  \"output_attentions\": false,\n",
+      "  \"output_hidden_states\": false,\n",
+      "  \"output_past\": true,\n",
+      "  \"pruned_heads\": {},\n",
+      "  \"qa_dropout\": 0.1,\n",
+      "  \"seq_classif_dropout\": 0.2,\n",
+      "  \"sinusoidal_pos_embds\": false,\n",
+      "  \"tie_weights_\": true,\n",
+      "  \"torchscript\": false,\n",
+      "  \"use_bfloat16\": false,\n",
+      "  \"vocab_size\": 30522\n",
+      "}\n",
+      "\n",
+      "I0103 05:31:28.885815 140060135520064 file_utils.py:319] https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin not found in cache or force_download set to True, downloading to /tmp/tmphcww33pn\n",
+      "100%|██████████| 267967963/267967963 [00:04<00:00, 60450116.33B/s]\n",
+      "I0103 05:31:33.496938 140060135520064 file_utils.py:334] copying /tmp/tmphcww33pn to cache at /tmp/tmp_b2wqaou/7b8a8f0b21c4e7f6962451c9370a5d9af90372a5f64637a251f2de154d0fc72c.c2015533705b9dff680ae707e205a35e2860e8d148b45d35085419d74fe57ac5\n",
+      "I0103 05:31:33.775700 140060135520064 file_utils.py:338] creating metadata file for /tmp/tmp_b2wqaou/7b8a8f0b21c4e7f6962451c9370a5d9af90372a5f64637a251f2de154d0fc72c.c2015533705b9dff680ae707e205a35e2860e8d148b45d35085419d74fe57ac5\n",
+      "I0103 05:31:33.776955 140060135520064 file_utils.py:347] removing temp file /tmp/tmphcww33pn\n",
+      "I0103 05:31:33.813161 140060135520064 modeling_utils.py:393] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin from cache at /tmp/tmp_b2wqaou/7b8a8f0b21c4e7f6962451c9370a5d9af90372a5f64637a251f2de154d0fc72c.c2015533705b9dff680ae707e205a35e2860e8d148b45d35085419d74fe57ac5\n",
+      "I0103 05:31:35.263193 140060135520064 configuration_utils.py:157] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json from cache at /tmp/tmp_b2wqaou/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.1ccd1a11c9ff276830e114ea477ea2407100f4a3be7bdc45d37be9e37fa71c7e\n",
+      "I0103 05:31:35.264739 140060135520064 configuration_utils.py:174] Model config {\n",
+      "  \"activation\": \"gelu\",\n",
+      "  \"attention_dropout\": 0.1,\n",
+      "  \"dim\": 768,\n",
+      "  \"dropout\": 0.1,\n",
+      "  \"finetuning_task\": null,\n",
+      "  \"hidden_dim\": 3072,\n",
+      "  \"initializer_range\": 0.02,\n",
+      "  \"is_decoder\": false,\n",
+      "  \"max_position_embeddings\": 512,\n",
+      "  \"n_heads\": 12,\n",
+      "  \"n_layers\": 6,\n",
+      "  \"num_labels\": 2,\n",
+      "  \"output_attentions\": false,\n",
+      "  \"output_hidden_states\": false,\n",
+      "  \"output_past\": true,\n",
+      "  \"pruned_heads\": {},\n",
+      "  \"qa_dropout\": 0.1,\n",
+      "  \"seq_classif_dropout\": 0.2,\n",
+      "  \"sinusoidal_pos_embds\": false,\n",
+      "  \"tie_weights_\": true,\n",
+      "  \"torchscript\": false,\n",
+      "  \"use_bfloat16\": false,\n",
+      "  \"vocab_size\": 30522\n",
+      "}\n",
+      "\n",
+      "I0103 05:31:35.406462 140060135520064 modeling_utils.py:393] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin from cache at /tmp/tmp_b2wqaou/7b8a8f0b21c4e7f6962451c9370a5d9af90372a5f64637a251f2de154d0fc72c.c2015533705b9dff680ae707e205a35e2860e8d148b45d35085419d74fe57ac5\n"
+     ]
+    }
+   ],
+   "source": [
+    "summarizer = ExtractiveSummarizer(MODEL_NAME, ENCODER, CACHE_DIR)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
+      "  warnings.warn('Was asked to gather along dimension 0, but all '\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "loss: 10.797444, time: 48.076998, number of examples in current step: 5, step 100 out of total 1000\n",
+      "loss: 10.033221, time: 36.946141, number of examples in current step: 5, step 200 out of total 1000\n",
+      "loss: 9.813506, time: 37.003569, number of examples in current step: 5, step 300 out of total 1000\n",
+      "loss: 9.743949, time: 36.684293, number of examples in current step: 5, step 400 out of total 1000\n",
+      "loss: 9.624907, time: 36.727618, number of examples in current step: 5, step 500 out of total 1000\n",
+      "loss: 9.359334, time: 36.721974, number of examples in current step: 5, step 600 out of total 1000\n",
+      "loss: 8.998051, time: 36.738466, number of examples in current step: 6, step 700 out of total 1000\n",
+      "loss: 8.392073, time: 36.622983, number of examples in current step: 5, step 800 out of total 1000\n",
+      "loss: 7.814545, time: 36.219987, number of examples in current step: 5, step 900 out of total 1000\n",
+      "loss: 6.793788, time: 36.647171, number of examples in current step: 5, step 1000 out of total 1000\n"
+     ]
+    }
+   ],
+   "source": [
+    "summarizer.fit(\n",
+    "            train_dataset,\n",
+    "            num_gpus=NUM_GPUS,\n",
+    "            batch_size=BATCH_SIZE,\n",
+    "            gradient_accumulation_steps=2,\n",
+    "            max_steps=MAX_STEPS,\n",
+    "            learning_rate=LEARNING_RATE,\n",
+    "            warmup_steps=WARMUP_STEPS,\n",
+    "            verbose=True,\n",
+    "            report_every=REPORT_EVERY,\n",
+    "            clip_grad_norm=False,\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "I0103 05:38:19.590131 140060135520064 extractive_summarization.py:729] Saving model checkpoint to /tmp/tmp_b2wqaou/fine_tuned/extsum_modelname_distilbert-base-uncased_usepreprocessFalse_steps_1000.0.pt\n"
+     ]
+    }
+   ],
+   "source": [
+    "summarizer.save_model(\"extsum_modelname_{0}_usepreprocess{1}_steps_{2}.pt\".format(MODEL_NAME, USE_PREPROCSSED_DATA, MAX_STEPS))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages/torch/serialization.py:453: SourceChangeWarning: source code of class 'transformers.modeling_distilbert.DistilBertModel' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.\n",
+      "  warnings.warn(msg, SourceChangeWarning)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# for loading a previous saved model\n",
+    "# import torch\n",
+    "# summarizer.model = torch.load(\"cnndm_transformersum_distilbert-base-uncased_bertsum_processed_data.pt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Model Evaluation\n",
+    "\n",
+    "[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation has been commonly used for evaluating text summarization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "target = [i['tgt_txt'] for i in test_dataset]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1000"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(target)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])"
+      ]
+     },
+     "execution_count": 36,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_dataset[0].keys()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Evaluating: 100%|██████████| 8/8 [00:03<00:00,  2.71it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 21.6 s, sys: 19.4 s, total: 41.1 s\n",
+      "Wall time: 3.14 s\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "prediction = summarizer.predict(test_dataset, num_gpus=NUM_GPUS, batch_size=128)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1000"
+      ]
+     },
+     "execution_count": 38,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(prediction)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "RESULT_DIR = TemporaryDirectory().name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1000\n",
+      "1000\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2020-01-03 05:38:27,106 [MainThread  ] [INFO ]  Writing summaries.\n",
+      "I0103 05:38:27.106614 140060135520064 pyrouge.py:525] Writing summaries.\n",
+      "2020-01-03 05:38:27,108 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp46h_l87z/tmpemn2roi7/system and model files to /tmp/tmp46h_l87z/tmpemn2roi7/model.\n",
+      "I0103 05:38:27.108142 140060135520064 pyrouge.py:518] Processing summaries. Saving system files to /tmp/tmp46h_l87z/tmpemn2roi7/system and model files to /tmp/tmp46h_l87z/tmpemn2roi7/model.\n",
+      "2020-01-03 05:38:27,109 [MainThread  ] [INFO ]  Processing files in /tmp/tmp46h_l87z/rouge-tmp-2020-01-03-05-38-26/candidate/.\n",
+      "I0103 05:38:27.109484 140060135520064 pyrouge.py:43] Processing files in /tmp/tmp46h_l87z/rouge-tmp-2020-01-03-05-38-26/candidate/.\n",
+      "2020-01-03 05:38:27,208 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp46h_l87z/tmpemn2roi7/system.\n",
+      "I0103 05:38:27.208842 140060135520064 pyrouge.py:53] Saved processed files to /tmp/tmp46h_l87z/tmpemn2roi7/system.\n",
+      "2020-01-03 05:38:27,209 [MainThread  ] [INFO ]  Processing files in /tmp/tmp46h_l87z/rouge-tmp-2020-01-03-05-38-26/reference/.\n",
+      "I0103 05:38:27.209856 140060135520064 pyrouge.py:43] Processing files in /tmp/tmp46h_l87z/rouge-tmp-2020-01-03-05-38-26/reference/.\n",
+      "2020-01-03 05:38:27,303 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp46h_l87z/tmpemn2roi7/model.\n",
+      "I0103 05:38:27.303928 140060135520064 pyrouge.py:53] Saved processed files to /tmp/tmp46h_l87z/tmpemn2roi7/model.\n",
+      "2020-01-03 05:38:27,312 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp46h_l87z/tmpba4xinuf/rouge_conf.xml\n",
+      "I0103 05:38:27.312163 140060135520064 pyrouge.py:354] Written ROUGE configuration to /tmp/tmp46h_l87z/tmpba4xinuf/rouge_conf.xml\n",
+      "2020-01-03 05:38:27,313 [MainThread  ] [INFO ]  Running ROUGE with command /dadendev/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl -e /dadendev/pyrouge/tools/ROUGE-1.5.5/data -c 95 -m -r 1000 -n 2 -a /tmp/tmp46h_l87z/tmpba4xinuf/rouge_conf.xml\n",
+      "I0103 05:38:27.313121 140060135520064 pyrouge.py:372] Running ROUGE with command /dadendev/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl -e /dadendev/pyrouge/tools/ROUGE-1.5.5/data -c 95 -m -r 1000 -n 2 -a /tmp/tmp46h_l87z/tmpba4xinuf/rouge_conf.xml\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "---------------------------------------------\n",
+      "1 ROUGE-1 Average_R: 0.43558 (95%-conf.int. 0.42501 - 0.44611)\n",
+      "1 ROUGE-1 Average_P: 0.21518 (95%-conf.int. 0.20966 - 0.22085)\n",
+      "1 ROUGE-1 Average_F: 0.27915 (95%-conf.int. 0.27249 - 0.28550)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-2 Average_R: 0.15229 (95%-conf.int. 0.14301 - 0.16166)\n",
+      "1 ROUGE-2 Average_P: 0.07242 (95%-conf.int. 0.06794 - 0.07666)\n",
+      "1 ROUGE-2 Average_F: 0.09525 (95%-conf.int. 0.08952 - 0.10088)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-L Average_R: 0.38633 (95%-conf.int. 0.37639 - 0.39645)\n",
+      "1 ROUGE-L Average_P: 0.19072 (95%-conf.int. 0.18537 - 0.19600)\n",
+      "1 ROUGE-L Average_F: 0.24743 (95%-conf.int. 0.24093 - 0.25324)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "rouge_score = get_rouge(prediction, target, RESULT_DIR)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'marseille prosecutor says `` so far no videos were used in the crash investigation `` despite media reports .<q>journalists at bild and paris match are `` very confident `` the video clip is real , an editor says .<q>andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .<q>'"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_dataset[0]['tgt_txt']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'marseille , france ( cnn ) the french prosecutor leading an investigation into the crash of germanwings flight 9525 insisted wednesday that he was not aware of any video footage from on board the plane .<q>marseille prosecutor brice robin told cnn that `` so far no videos were used in the crash investigation . ``<q>the two publications described the supposed video , but did not post it on their websites .'"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "prediction[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['marseille , france ( cnn ) the french prosecutor leading an investigation into the crash of germanwings flight 9525 insisted wednesday that he was not aware of any video footage from on board the plane .',\n",
+       " 'marseille prosecutor brice robin told cnn that `` so far no videos were used in the crash investigation . ``',\n",
+       " 'he added , `` a person who has such a video needs to immediately give it to the investigators . ``',\n",
+       " \"robin 's comments follow claims by two magazines , german daily bild and french paris match , of a cell phone video showing the harrowing final seconds from on board germanwings flight 9525 as it crashed into the french alps .\",\n",
+       " 'all 150 on board were killed .',\n",
+       " 'paris match and bild reported that the video was recovered from a phone at the wreckage site .',\n",
+       " 'the two publications described the supposed video , but did not post it on their websites .',\n",
+       " 'the publications said that they watched the video , which was found by a source close to the investigation . ``',\n",
+       " \"one can hear cries of ` my god ' in several languages , `` paris match reported . ``\",\n",
+       " 'metallic banging can also be heard more than three times , perhaps of the pilot trying to open the cockpit door with a heavy object .',\n",
+       " 'towards the end , after a heavy shake , stronger than the others , the screaming intensifies .',\n",
+       " '`` it is a very disturbing scene , `` said julian reichelt , editor-in-chief of bild online .',\n",
+       " \"an official with france 's accident investigation agency , the bea , said the agency is not aware of any such video .\",\n",
+       " 'lt. col. jean-marc menichini , a french gendarmerie spokesman in charge of communications on rescue efforts around the germanwings crash site , told cnn that the reports were `` completely wrong `` and `` unwarranted . ``',\n",
+       " \"cell phones have been collected at the site , he said , but that they `` had n't been exploited yet . ``\",\n",
+       " 'menichini said he believed the cell phones would need to be sent to the criminal research institute in rosny sous-bois , near paris , in order to be analyzed by specialized technicians working hand-in-hand with investigators .',\n",
+       " 'but none of the cell phones found so far have been sent to the institute , menichini said .',\n",
+       " 'asked whether staff involved in the search could have leaked a memory card to the media , menichini answered with a categorical `` no . ``',\n",
+       " 'reichelt told `` erin burnett : outfront `` that he had watched the video and stood by the report , saying bild and paris match are `` very confident `` that the clip is real .',\n",
+       " \"he noted that investigators only revealed they 'd recovered cell phones from the crash site after bild and paris match published their reports . ``\",\n",
+       " 'that is something we did not know before .',\n",
+       " \"... overall we can say many things of the investigation were n't revealed by the investigation at the beginning , `` he said .\",\n",
+       " 'what was mental state of germanwings co-pilot ?',\n",
+       " \"german airline lufthansa confirmed tuesday that co-pilot andreas lubitz had battled depression years before he took the controls of germanwings flight 9525 , which he 's accused of deliberately crashing last week in the french alps .\",\n",
+       " 'lubitz told his lufthansa flight training school in 2009 that he had a `` previous episode of severe depression , `` the airline said tuesday .',\n",
+       " 'email correspondence between lubitz and the school discovered in an internal investigation , lufthansa said , included medical documents he submitted in connection with resuming his flight training .',\n",
+       " \"the announcement indicates that lufthansa , the parent company of germanwings , knew of lubitz 's battle with depression , allowed him to continue training and ultimately put him in the cockpit .\",\n",
+       " 'lufthansa , whose ceo carsten spohr previously said lubitz was 100 % fit to fly , described its statement tuesday as a `` swift and seamless clarification `` and said it was sharing the information and documents -- including training and medical records -- with public prosecutors .',\n",
+       " 'spohr traveled to the crash site wednesday , where recovery teams have been working for the past week to recover human remains and plane debris scattered across a steep mountainside .',\n",
+       " 'he saw the crisis center set up in seyne-les-alpes , laid a wreath in the village of le vernet , closer to the crash site , where grieving families have left flowers at a simple stone memorial .',\n",
+       " 'menichini told cnn late tuesday that no visible human remains were left at the site but recovery teams would keep searching .',\n",
+       " 'french president francois hollande , speaking tuesday , said that it should be possible to identify all the victims using dna analysis by the end of the week , sooner than authorities had previously suggested .',\n",
+       " \"in the meantime , the recovery of the victims ' personal belongings will start wednesday , menichini said .\",\n",
+       " 'among those personal belongings could be more cell phones belonging to the 144 passengers and six crew on board .',\n",
+       " 'check out the latest from our correspondents .',\n",
+       " \"the details about lubitz 's correspondence with the flight school during his training were among several developments as investigators continued to delve into what caused the crash and lubitz 's possible motive for downing the jet .\",\n",
+       " 'a lufthansa spokesperson told cnn on tuesday that lubitz had a valid medical certificate , had passed all his examinations and `` held all the licenses required . ``',\n",
+       " \"earlier , a spokesman for the prosecutor 's office in dusseldorf , christoph kumpa , said medical records reveal lubitz suffered from suicidal tendencies at some point before his aviation career and underwent psychotherapy before he got his pilot 's license .\",\n",
+       " \"kumpa emphasized there 's no evidence suggesting lubitz was suicidal or acting aggressively before the crash .\",\n",
+       " \"investigators are looking into whether lubitz feared his medical condition would cause him to lose his pilot 's license , a european government official briefed on the investigation told cnn on tuesday .\",\n",
+       " \"while flying was `` a big part of his life , `` the source said , it 's only one theory being considered .\",\n",
+       " 'another source , a law enforcement official briefed on the investigation , also told cnn that authorities believe the primary motive for lubitz to bring down the plane was that he feared he would not be allowed to fly because of his medical problems .',\n",
+       " \"lubitz 's girlfriend told investigators he had seen an eye doctor and a neuropsychologist , both of whom deemed him unfit to work recently and concluded he had psychological issues , the european government official said .\",\n",
+       " \"but no matter what details emerge about his previous mental health struggles , there 's more to the story , said brian russell , a forensic psychologist . ``\",\n",
+       " \"psychology can explain why somebody would turn rage inward on themselves about the fact that maybe they were n't going to keep doing their job and they 're upset about that and so they 're suicidal , `` he said . ``\",\n",
+       " \"but there is no mental illness that explains why somebody then feels entitled to also take that rage and turn it outward on 149 other people who had nothing to do with the person 's problems . ``\",\n",
+       " 'germanwings crash compensation : what we know .',\n",
+       " 'who was the captain of germanwings flight 9525 ?',\n",
+       " \"cnn 's margot haddad reported from marseille and pamela brown from dusseldorf , while laura smith-spark wrote from london .\",\n",
+       " \"cnn 's frederik pleitgen , pamela boykoff , antonia mortensen , sandrine amiel and anna-maja rappard contributed to this report .\"]"
+      ]
+     },
+     "execution_count": 43,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_dataset[0]['src_txt']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.09525,
+       "encoder": "json",
+       "name": "rouge_2_f_score",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "rouge_2_f_score"
+      }
+     },
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# for testing\n",
+    "sb.glue(\"rouge_2_f_score\", rouge_score['rouge_2_f_score'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Clean up temporary folders"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if os.path.exists(DATA_PATH):\n",
+    "    shutil.rmtree(DATA_PATH, ignore_errors=True)\n",
+    "if os.path.exists(PROCESSED_DATA_PATH):\n",
+    "    shutil.rmtree(PROCESSED_DATA_PATH, ignore_errors=True)\n",
+    "if os.path.exists(CACHE_DIR):\n",
+    "    shutil.rmtree(CACHE_DIR, ignore_errors=True)\n",
+    "if os.path.exists(RESULT_DIR):\n",
+    "    shutil.rmtree(RESULT_DIR, ignore_errors=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3.6 cm3",
+   "language": "python",
+   "name": "cm3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/text_summarization/summarization_evaluation.ipynb b/examples/text_summarization/summarization_evaluation.ipynb
new file mode 100644
index 000000000..cca929807
--- /dev/null
+++ b/examples/text_summarization/summarization_evaluation.ipynb
@@ -0,0 +1,394 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Microsoft Corporation.  \n",
+    "Licensed under the MIT License. \n",
+    "\n",
+    "# Summarization Evaluation\n",
+    "This notebook explains the metrics commonly used to evaluate text summarization results and how to use the evaluation utilities provided in the repo. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ROUGE\n",
+    "Recall-Oriented Understudy for Gisting Evaluation(ROUGE) is a set of metrics for evaluating automatic text summarization and machine translation results. The metrics compare machine-generated summaries or translations against one or multiple reference summaries or translations created by human.  \n",
+    "Commonly used ROUGE metrics are ROUGE-1, ROUGE-2, and ROUGE-L\n",
+    "* ROUGE-1: Overlap of unigrams (single words) between machine-generated and reference summaries. \n",
+    "* ROUGE-2: Overlap of bigrams (two adjacent words) between machine-generated and reference summaries.\n",
+    "* ROUGE-L: Longest Common Subsequence (LCS), which doesn't require consecutive matches but in-sequence matches that refect sentence-level structure similarity.  \n",
+    "\n",
+    "For each metric, recall, precision, and F1 score are computed. \n",
+    "\n",
+    "**Utilities for computing ROUGE**\n",
+    "* `compute_rouge_perl`: The [pyrouge](https://github.com/bheinzerling/pyrouge) package based on the ROUGE package written in perl is the most popular package for computing ROUGE scores. We provide the `compute_rouge_perl` function based on pyrouge. This function supports English only. \n",
+    "* `compute_rouge_python`: The [py-rouge](https://pypi.org/project/py-rouge/) package is a Python implementation of the ROUGE metric which produces almost the same results as the perl implemenation. Since it's easier to install than pyrouge and can be extended to other languages, we provide the `compute_rouge_python` function based on py-rouge. Currently, English and Hindi are supported. Supports for other languages will be added on an as-needed basis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "nlp_path = os.path.abspath('../../')\n",
+    "if nlp_path not in sys.path:\n",
+    "    sys.path.insert(0, nlp_path)\n",
+    "    \n",
+    "from utils_nlp.eval import compute_rouge_perl, compute_rouge_python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Sample inputs\n",
+    "Both `compute_rouge_perl` and `compute_rouge_python` takes lists of candidate summaries and reference summaries as inputs. Alternatively, you can also provide paths to files containing the candidates and references and set the `is_input_files` argument to `True`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "summary_candidates = [\"The stock market is doing well this year.\", \"The movie is very popular.\"]\n",
+    "summary_references = [\"The stock market is doing really well in 2019.\", \"The movie is very popular among millennials.\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### compute_rouge_python\n",
+    "To use `compute_rouge_python`, you only need to install the Python package `py-rouge` and `nltk`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of candidates: 2\n",
+      "Number of references: 2\n"
+     ]
+    }
+   ],
+   "source": [
+    "python_rouge_scores = compute_rouge_python(cand=summary_candidates, ref=summary_references)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ROUGE-1: {'f': 0.7696078431372548, 'p': 0.875, 'r': 0.6904761904761905}\n",
+      "ROUGE-2: {'f': 0.6666666666666667, 'p': 0.7857142857142857, 'r': 0.5833333333333333}\n",
+      "ROUGE-L: {'f': 0.7696078431372548, 'p': 0.875, 'r': 0.6904761904761905}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"ROUGE-1: {}\".format(python_rouge_scores[\"rouge-1\"]))\n",
+    "print(\"ROUGE-2: {}\".format(python_rouge_scores[\"rouge-2\"]))\n",
+    "print(\"ROUGE-L: {}\".format(python_rouge_scores[\"rouge-l\"]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `compute_rouge_python` function can also support non-English languages. Currently, only Hindi is supported. Support for other languages will be added on an as-needed basis.  \n",
+    "Note that the Hindi sample inputs are generated by translation, so they are not perfect, but suffcient for testing.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "summary_candidates_hi = [\n",
+    "    \"शेयर बाजार इस साल बहुत अच्छा कर रहा है। 2020 के लिए भी यही उम्मीद है।\",\n",
+    "    \"नई फिल्म बहुत लोकप्रिय है।\",\n",
+    "]\n",
+    "summary_references_hi = [\n",
+    "    \"शेयर बाजार 2019 में वास्तव में अच्छा कर रहा है। आशा है कि 2020 भी ऐसा ही होगा।\",\n",
+    "    \"फिल्म सदियों के बीच बहुत लोकप्रिय है।\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of candidates: 2\n",
+      "Number of references: 2\n",
+      "ROUGE-1: {'f': 0.5980392156862745, 'p': 0.68125, 'r': 0.5357142857142857}\n",
+      "ROUGE-2: {'f': 0.325, 'p': 0.3833333333333333, 'r': 0.28431372549019607}\n",
+      "ROUGE-L: {'f': 0.5980392156862745, 'p': 0.68125, 'r': 0.5357142857142857}\n"
+     ]
+    }
+   ],
+   "source": [
+    "python_rouge_scores_hi = compute_rouge_python(cand=summary_candidates_hi, ref=summary_references_hi, language=\"hi\")\n",
+    "print(\"ROUGE-1: {}\".format(python_rouge_scores_hi[\"rouge-1\"]))\n",
+    "print(\"ROUGE-2: {}\".format(python_rouge_scores_hi[\"rouge-2\"]))\n",
+    "print(\"ROUGE-L: {}\".format(python_rouge_scores_hi[\"rouge-l\"]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### compute_rouge_perl\n",
+    "To use `compute_rouge_perl`, in addition to installing the Python package `pyrouge`, you also need to go through the following setup steps on a Linux machine.  \n",
+    "**NOTE**: Set `PYROUGE_PATH` to the root directory of the cloned `pyrouge` repo and `PYTHON_PATH` to the root directory of the conda environment where you installed `pyrouge` first."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "git clone https://github.com/andersjo/pyrouge.git\n",
+    "PYROUGE_PATH=<root directory of cloned pyrouge repo> #e.g./home/hlu/notebooks/summarization/pyrouge\n",
+    "PYTHON_PATH=<root directory of conda environment> #e.g./data/anaconda/envs/nlp_gpu\n",
+    "$PYTHON_PATH/bin/pyrouge_set_rouge_path $PYROUGE_PATH/tools/ROUGE-1.5.5\n",
+    "\n",
+    "# install XML::DOM plugin, instructions https://web.archive.org/web/20171107220839/www.summarizerman.com/post/42675198985/figuring-out-rouge\n",
+    "sudo cpan App::cpanminus\n",
+    "sudo cpanm XML::DOM\n",
+    "\n",
+    "# install XLM::Parser and its dependencies\n",
+    "sudo apt-get update\n",
+    "sudo apt-get install libexpat1-dev -y\n",
+    "sudo cpanm  XML::Parser\n",
+    "\n",
+    "# Fix WordNet issue\n",
+    "# Instructions https://web.archive.org/web/20180812011301/http://kavita-ganesan.com/rouge-howto/#IamHavingWordNetExceptions\n",
+    "cd  $PYROUGE_PATH/tools/ROUGE-1.5.5/data/\n",
+    "rm WordNet-2.0.exc.db\n",
+    "\n",
+    "cd WordNet-2.0-Exceptions/\n",
+    "./buildExeptionDB.pl . exc WordNet-2.0.exc.db\n",
+    "cd ..\n",
+    "ln -s WordNet-2.0-Exceptions/WordNet-2.0.exc.db WordNet-2.0.exc.db"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2019-12-03 19:43:25,977 [MainThread  ] [INFO ]  Writing summaries.\n",
+      "2019-12-03 19:43:25,978 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpm29_bwie/system and model files to /tmp/tmpm29_bwie/model.\n",
+      "2019-12-03 19:43:25,979 [MainThread  ] [INFO ]  Processing files in /tmp/tmpf5p8odh5/rouge-tmp-2019-12-03-19-43-25/candidate/.\n",
+      "2019-12-03 19:43:25,980 [MainThread  ] [INFO ]  Processing cand.1.txt.\n",
+      "2019-12-03 19:43:25,981 [MainThread  ] [INFO ]  Processing cand.0.txt.\n",
+      "2019-12-03 19:43:25,982 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpm29_bwie/system.\n",
+      "2019-12-03 19:43:25,982 [MainThread  ] [INFO ]  Processing files in /tmp/tmpf5p8odh5/rouge-tmp-2019-12-03-19-43-25/reference/.\n",
+      "2019-12-03 19:43:25,983 [MainThread  ] [INFO ]  Processing ref.0.txt.\n",
+      "2019-12-03 19:43:25,984 [MainThread  ] [INFO ]  Processing ref.1.txt.\n",
+      "2019-12-03 19:43:25,985 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpm29_bwie/model.\n",
+      "2019-12-03 19:43:25,986 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmps00p9hvz/rouge_conf.xml\n",
+      "2019-12-03 19:43:25,987 [MainThread  ] [INFO ]  Running ROUGE with command /home/hlu/notebooks/summarization/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl -e /home/hlu/notebooks/summarization/pyrouge/tools/ROUGE-1.5.5/data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m /tmp/tmps00p9hvz/rouge_conf.xml\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of candidates: 2\n",
+      "Number of references: 2\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-1 Average_R: 0.69048 (95%-conf.int. 0.66667 - 0.71429)\n",
+      "1 ROUGE-1 Average_P: 0.87500 (95%-conf.int. 0.75000 - 1.00000)\n",
+      "1 ROUGE-1 Average_F: 0.76961 (95%-conf.int. 0.70588 - 0.83334)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-2 Average_R: 0.58333 (95%-conf.int. 0.50000 - 0.66667)\n",
+      "1 ROUGE-2 Average_P: 0.78571 (95%-conf.int. 0.57143 - 1.00000)\n",
+      "1 ROUGE-2 Average_F: 0.66666 (95%-conf.int. 0.53333 - 0.80000)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-3 Average_R: 0.51428 (95%-conf.int. 0.42857 - 0.60000)\n",
+      "1 ROUGE-3 Average_P: 0.75000 (95%-conf.int. 0.50000 - 1.00000)\n",
+      "1 ROUGE-3 Average_F: 0.60577 (95%-conf.int. 0.46154 - 0.75000)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-4 Average_R: 0.41666 (95%-conf.int. 0.33333 - 0.50000)\n",
+      "1 ROUGE-4 Average_P: 0.70000 (95%-conf.int. 0.40000 - 1.00000)\n",
+      "1 ROUGE-4 Average_F: 0.51515 (95%-conf.int. 0.36363 - 0.66667)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-L Average_R: 0.69048 (95%-conf.int. 0.66667 - 0.71429)\n",
+      "1 ROUGE-L Average_P: 0.87500 (95%-conf.int. 0.75000 - 1.00000)\n",
+      "1 ROUGE-L Average_F: 0.76961 (95%-conf.int. 0.70588 - 0.83334)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-W-1.2 Average_R: 0.44238 (95%-conf.int. 0.40075 - 0.48401)\n",
+      "1 ROUGE-W-1.2 Average_P: 0.84981 (95%-conf.int. 0.69963 - 1.00000)\n",
+      "1 ROUGE-W-1.2 Average_F: 0.58095 (95%-conf.int. 0.50960 - 0.65230)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-S* Average_R: 0.44643 (95%-conf.int. 0.41667 - 0.47619)\n",
+      "1 ROUGE-S* Average_P: 0.76785 (95%-conf.int. 0.53571 - 1.00000)\n",
+      "1 ROUGE-S* Average_F: 0.55695 (95%-conf.int. 0.46875 - 0.64516)\n",
+      "---------------------------------------------\n",
+      "1 ROUGE-SU* Average_R: 0.49790 (95%-conf.int. 0.47727 - 0.51852)\n",
+      "1 ROUGE-SU* Average_P: 0.80000 (95%-conf.int. 0.60000 - 1.00000)\n",
+      "1 ROUGE-SU* Average_F: 0.60729 (95%-conf.int. 0.53164 - 0.68293)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "perl_rouge_scores = compute_rouge_perl(cand=summary_candidates, ref=summary_references)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'rouge_1_recall': 0.69048,\n",
+       " 'rouge_1_recall_cb': 0.66667,\n",
+       " 'rouge_1_recall_ce': 0.71429,\n",
+       " 'rouge_1_precision': 0.875,\n",
+       " 'rouge_1_precision_cb': 0.75,\n",
+       " 'rouge_1_precision_ce': 1.0,\n",
+       " 'rouge_1_f_score': 0.76961,\n",
+       " 'rouge_1_f_score_cb': 0.70588,\n",
+       " 'rouge_1_f_score_ce': 0.83334,\n",
+       " 'rouge_2_recall': 0.58333,\n",
+       " 'rouge_2_recall_cb': 0.5,\n",
+       " 'rouge_2_recall_ce': 0.66667,\n",
+       " 'rouge_2_precision': 0.78571,\n",
+       " 'rouge_2_precision_cb': 0.57143,\n",
+       " 'rouge_2_precision_ce': 1.0,\n",
+       " 'rouge_2_f_score': 0.66666,\n",
+       " 'rouge_2_f_score_cb': 0.53333,\n",
+       " 'rouge_2_f_score_ce': 0.8,\n",
+       " 'rouge_3_recall': 0.51428,\n",
+       " 'rouge_3_recall_cb': 0.42857,\n",
+       " 'rouge_3_recall_ce': 0.6,\n",
+       " 'rouge_3_precision': 0.75,\n",
+       " 'rouge_3_precision_cb': 0.5,\n",
+       " 'rouge_3_precision_ce': 1.0,\n",
+       " 'rouge_3_f_score': 0.60577,\n",
+       " 'rouge_3_f_score_cb': 0.46154,\n",
+       " 'rouge_3_f_score_ce': 0.75,\n",
+       " 'rouge_4_recall': 0.41666,\n",
+       " 'rouge_4_recall_cb': 0.33333,\n",
+       " 'rouge_4_recall_ce': 0.5,\n",
+       " 'rouge_4_precision': 0.7,\n",
+       " 'rouge_4_precision_cb': 0.4,\n",
+       " 'rouge_4_precision_ce': 1.0,\n",
+       " 'rouge_4_f_score': 0.51515,\n",
+       " 'rouge_4_f_score_cb': 0.36363,\n",
+       " 'rouge_4_f_score_ce': 0.66667,\n",
+       " 'rouge_l_recall': 0.69048,\n",
+       " 'rouge_l_recall_cb': 0.66667,\n",
+       " 'rouge_l_recall_ce': 0.71429,\n",
+       " 'rouge_l_precision': 0.875,\n",
+       " 'rouge_l_precision_cb': 0.75,\n",
+       " 'rouge_l_precision_ce': 1.0,\n",
+       " 'rouge_l_f_score': 0.76961,\n",
+       " 'rouge_l_f_score_cb': 0.70588,\n",
+       " 'rouge_l_f_score_ce': 0.83334,\n",
+       " 'rouge_w_1.2_recall': 0.44238,\n",
+       " 'rouge_w_1.2_recall_cb': 0.40075,\n",
+       " 'rouge_w_1.2_recall_ce': 0.48401,\n",
+       " 'rouge_w_1.2_precision': 0.84981,\n",
+       " 'rouge_w_1.2_precision_cb': 0.69963,\n",
+       " 'rouge_w_1.2_precision_ce': 1.0,\n",
+       " 'rouge_w_1.2_f_score': 0.58095,\n",
+       " 'rouge_w_1.2_f_score_cb': 0.5096,\n",
+       " 'rouge_w_1.2_f_score_ce': 0.6523,\n",
+       " 'rouge_s*_recall': 0.44643,\n",
+       " 'rouge_s*_recall_cb': 0.41667,\n",
+       " 'rouge_s*_recall_ce': 0.47619,\n",
+       " 'rouge_s*_precision': 0.76785,\n",
+       " 'rouge_s*_precision_cb': 0.53571,\n",
+       " 'rouge_s*_precision_ce': 1.0,\n",
+       " 'rouge_s*_f_score': 0.55695,\n",
+       " 'rouge_s*_f_score_cb': 0.46875,\n",
+       " 'rouge_s*_f_score_ce': 0.64516,\n",
+       " 'rouge_su*_recall': 0.4979,\n",
+       " 'rouge_su*_recall_cb': 0.47727,\n",
+       " 'rouge_su*_recall_ce': 0.51852,\n",
+       " 'rouge_su*_precision': 0.8,\n",
+       " 'rouge_su*_precision_cb': 0.6,\n",
+       " 'rouge_su*_precision_ce': 1.0,\n",
+       " 'rouge_su*_f_score': 0.60729,\n",
+       " 'rouge_su*_f_score_cb': 0.53164,\n",
+       " 'rouge_su*_f_score_ce': 0.68293}"
+      ]
+     },
+     "execution_count": 44,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "perl_rouge_scores"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For each score, the 95% confidence interval is also computed, i.e. \"\\_cb\" and \"\\_ce\" stand for the beginning  and end of the confidence interval, respectively.  \n",
+    "In addition to ROUGE-1, ROUGE-2, ROUGE-L, the perl script computes a few other ROUGE scores. See details of all scores [here](https://en.wikipedia.org/wiki/ROUGE_%28metric%29).  "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nlp_gpu",
+   "language": "python",
+   "name": "nlp_gpu"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/tests/conftest.py b/tests/conftest.py
index c1428c41b..b940476e8 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -85,6 +85,9 @@ def notebooks():
         "deep_and_unified_understanding": os.path.join(
             folder_notebooks, "model_explainability", "interpret_dnn_layers.ipynb"
         ),
+        "extractive_summarization_cnndm_transformer": os.path.join(
+            folder_notebooks, "text_summarization", "extractive_summarization_cnndm_transformer.ipynb"
+        ),
     }
     return paths
 
diff --git a/tests/integration/test_notebooks_extractive_summarization.py b/tests/integration/test_notebooks_extractive_summarization.py
new file mode 100644
index 000000000..4f9e17f79
--- /dev/null
+++ b/tests/integration/test_notebooks_extractive_summarization.py
@@ -0,0 +1,36 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import papermill as pm
+import pytest
+import scrapbook as sb
+from tests.notebooks_common import KERNEL_NAME, OUTPUT_NOTEBOOK
+
+ABS_TOL = 0.02
+
+
+@pytest.mark.gpu
+@pytest.mark.integration
+def test_extractive_summarization_cnndm_transformers(notebooks, tmp):
+    notebook_path = notebooks["extractive_summarization_cnndm_transformer"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        kernel_name=KERNEL_NAME,
+        parameters=dict(
+            QUICK_RUN=True,
+            TOP_N=1000,
+            CHUNK_SIZE=200,
+            USE_PREPROCESSED_DATA=False,
+            NUM_GPUS=1,
+            DATA_FOLDER=tmp,
+            CACHE_DIR=tmp,
+            BATCH_SIZE=3000,
+            REPORT_EVERY=50,
+            MAX_STEPS=1000,
+            WARMUP_STEPS=5e2,
+            MODEL_NAME="distilbert-base-uncased",
+        ),
+    )
+    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
+    assert pytest.approx(result["rouge_2_f_score"], 0.1, abs=ABS_TOL)
diff --git a/tests/integration/test_notebooks_text_classification.py b/tests/integration/test_notebooks_text_classification.py
index 8f00107eb..97eb9d6d7 100644
--- a/tests/integration/test_notebooks_text_classification.py
+++ b/tests/integration/test_notebooks_text_classification.py
@@ -33,8 +33,8 @@ def test_tc_mnli_transformers(notebooks, tmp):
         ),
     )
     result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
-    assert pytest.approx(result["accuracy"], 0.89, abs=ABS_TOL)
-    assert pytest.approx(result["f1"], 0.89, abs=ABS_TOL)
+    assert pytest.approx(result["accuracy"], 0.885, abs=ABS_TOL)
+    assert pytest.approx(result["f1"], 0.885, abs=ABS_TOL)
 
 
 @pytest.mark.integration
diff --git a/tests/smoke/test_gpu_utils.py b/tests/smoke/test_gpu_utils.py
index 11418ad38..32d04a136 100644
--- a/tests/smoke/test_gpu_utils.py
+++ b/tests/smoke/test_gpu_utils.py
@@ -9,4 +9,3 @@
 @pytest.mark.gpu
 def test_machine_is_gpu_machine():
     assert torch.cuda.is_available() is True
-
diff --git a/tests/unit/test_bert_token_classification.py b/tests/unit/test_bert_token_classification.py
deleted file mode 100644
index c3a46584f..000000000
--- a/tests/unit/test_bert_token_classification.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT License.
-
-import pytest
-
-from utils_nlp.models.bert.token_classification import (
-    BERTTokenClassifier,
-    postprocess_token_labels,
-)
-
-
-def test_token_classifier_num_labels():
-    with pytest.raises(ValueError):
-        BERTTokenClassifier(num_labels=1)
-
-
-def test_token_classifier_fit_predict(tmp_path, ner_test_data):
-    token_classifier = BERTTokenClassifier(num_labels=6, cache_dir=tmp_path)
-
-    # test fit, no warmup
-    token_classifier.fit(
-        token_ids=ner_test_data["INPUT_TOKEN_IDS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-        labels=ner_test_data["INPUT_LABEL_IDS"],
-    )
-
-    # test fit, with warmup
-    token_classifier.fit(
-        token_ids=ner_test_data["INPUT_TOKEN_IDS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-        labels=ner_test_data["INPUT_LABEL_IDS"],
-        warmup_proportion=0.1,
-    )
-    # test predict, no labels
-    token_classifier.predict(
-        token_ids=ner_test_data["INPUT_TOKEN_IDS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-    )
-
-    # test predict, with labels
-    token_classifier.predict(
-        token_ids=ner_test_data["INPUT_TOKEN_IDS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-        labels=ner_test_data["INPUT_LABEL_IDS"],
-    )
-
-    # test output probabilities
-    predictions = token_classifier.predict(
-        token_ids=ner_test_data["INPUT_TOKEN_IDS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-        labels=ner_test_data["INPUT_LABEL_IDS"],
-        probabilities=True,
-    )
-    assert len(predictions.classes) == predictions.probabilities.shape[0]
-
-
-def test_postprocess_token_labels(ner_test_data):
-    labels_no_padding = postprocess_token_labels(
-        labels=ner_test_data["PREDICTED_LABELS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-        label_map=ner_test_data["LABEL_MAP"],
-    )
-
-    assert labels_no_padding == ner_test_data["EXPECTED_TOKENS_NO_PADDING"]
-
-
-def test_postprocess_token_labels_remove_trailing(ner_test_data):
-    labels_no_padding_no_trailing = postprocess_token_labels(
-        labels=ner_test_data["PREDICTED_LABELS"],
-        input_mask=ner_test_data["INPUT_MASK"],
-        label_map=ner_test_data["LABEL_MAP"],
-        remove_trailing_word_pieces=True,
-        trailing_token_mask=ner_test_data["TRAILING_TOKEN_MASK"],
-    )
-
-    assert (
-        labels_no_padding_no_trailing
-        == ner_test_data["EXPECTED_TOKENS_NO_PADDING_NO_TRAILING"]
-    )
diff --git a/tests/unit/test_common_pytorch_utils.py b/tests/unit/test_common_pytorch_utils.py
index e2fce1e10..6b966d3e2 100644
--- a/tests/unit/test_common_pytorch_utils.py
+++ b/tests/unit/test_common_pytorch_utils.py
@@ -1,14 +1,15 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.
 
+"""PyTorch utils tests."""
 
 import pytest
 import torch
 import torch.nn as nn
-from torch.nn.parallel.data_parallel import DataParallel
 from torch.nn.modules.container import Sequential
+from torch.nn.parallel.data_parallel import DataParallel
 
-from utils_nlp.common.pytorch_utils import get_device, move_to_device
+from utils_nlp.common.pytorch_utils import get_device, move_model_to_device
 
 
 @pytest.fixture
@@ -46,58 +47,56 @@ def test_get_device_all_gpus():
 
 @pytest.mark.gpu
 def test_get_device_local_rank():
-    device, gpus = get_device(local_rank=1)
+    device, gpus = get_device(local_rank=0)
     assert isinstance(device, torch.device)
     assert device.type == "cuda"
-    assert device.index == 1
+    assert device.index == 0
     assert gpus == 1
 
 
 def test_move_to_device_cpu(model):
     # test when device.type="cpu"
-    model_cpu = move_to_device(model, torch.device("cpu"))
+    model_cpu = move_model_to_device(model, torch.device("cpu"))
     assert isinstance(model_cpu, nn.modules.container.Sequential)
 
 
 def test_move_to_device_cpu_parallelized(model):
     # test when input model is parallelized
     model_parallelized = nn.DataParallel(model)
-    model_parallelized_output = move_to_device(model_parallelized, torch.device("cpu"))
+    model_parallelized_output = move_model_to_device(model_parallelized, torch.device("cpu"))
     assert isinstance(model_parallelized_output, nn.modules.container.Sequential)
 
 
 def test_move_to_device_exception_not_torch_device(model):
     # test when device is not torch.device
     with pytest.raises(ValueError):
-        move_to_device(model, "abc")
+        move_model_to_device(model, "abc")
 
 
 def test_move_to_device_exception_wrong_type(model):
     # test when device.type is not "cuda" or "cpu"
     with pytest.raises(Exception):
-        move_to_device(model, torch.device("opengl"))
+        move_model_to_device(model, torch.device("opengl"))
 
 
-@pytest.mark.skipif(
-    torch.cuda.is_available(), reason="Skip if we are executing the cpu tests on a gpu machine"
-)
+@pytest.mark.skipif(torch.cuda.is_available(), reason="Skip if we are executing the cpu tests on a gpu machine")
 def test_move_to_device_exception_gpu_model_on_cpu_machine(model):
     # test when the model is moved to a gpu but it is a cpu machine
     with pytest.raises(Exception):
-        move_to_device(model, torch.device("cuda"))
+        move_model_to_device(model, torch.device("cuda"))
 
 
 @pytest.mark.gpu
 def test_move_to_device_exception_cuda_zero_gpus(model):
     # test when device.type is cuda, but num_gpus is 0
     with pytest.raises(ValueError):
-        move_to_device(model, torch.device("cuda"), num_gpus=0)
+        move_model_to_device(model, torch.device("cuda"), num_gpus=0)
 
 
 @pytest.mark.gpu
 def test_move_to_device_gpu(model):
     # test when device.type="cuda"
-    model_cuda = move_to_device(model, torch.device("cuda"))
+    model_cuda = move_model_to_device(model, torch.device("cuda"))
     num_cuda_devices = torch.cuda.device_count()
 
     if num_cuda_devices > 1:
@@ -105,20 +104,17 @@ def test_move_to_device_gpu(model):
     else:
         assert isinstance(model_cuda, Sequential)
 
-    model_cuda_1_gpu = move_to_device(model, torch.device("cuda"), num_gpus=1)
+    model_cuda_1_gpu = move_model_to_device(model, torch.device("cuda"), num_gpus=1)
     assert isinstance(model_cuda_1_gpu, Sequential)
 
-    model_cuda_1_more_gpu = move_to_device(
-        model, torch.device("cuda"), num_gpus=num_cuda_devices + 1
-    )
+    model_cuda_1_more_gpu = move_model_to_device(model, torch.device("cuda"), num_gpus=num_cuda_devices + 1)
     if num_cuda_devices > 1:
         assert isinstance(model_cuda_1_more_gpu, DataParallel)
     else:
         assert isinstance(model_cuda_1_more_gpu, Sequential)
 
-    model_cuda_same_gpu = move_to_device(model, torch.device("cuda"), num_gpus=num_cuda_devices)
+    model_cuda_same_gpu = move_model_to_device(model, torch.device("cuda"), num_gpus=num_cuda_devices)
     if num_cuda_devices > 1:
         assert isinstance(model_cuda_same_gpu, DataParallel)
     else:
         assert isinstance(model_cuda_same_gpu, Sequential)
-
diff --git a/tests/unit/test_eval_compute_rouge.py b/tests/unit/test_eval_compute_rouge.py
new file mode 100644
index 000000000..778594765
--- /dev/null
+++ b/tests/unit/test_eval_compute_rouge.py
@@ -0,0 +1,221 @@
+import os
+import pytest
+from utils_nlp.eval import compute_rouge_perl, compute_rouge_python
+
+ABS_TOL = 0.00001
+
+R1R = 0.71429
+R1P = 0.77381
+R1F = 0.74176
+R2R = 0.44231
+R2P = 0.49231
+R2F = 0.46504
+RLR = 0.67857
+RLP = 0.73810
+RLF = 0.70605
+
+R1R_hi = 0.53571
+R1P_hi = 0.68125
+R1F_hi = 0.59804
+R2R_hi = 0.28431
+R2P_hi = 0.38334
+R2F_hi = 0.325
+RLR_hi = 0.53571
+RLP_hi = 0.68125
+RLF_hi = 0.59804
+
+
+@pytest.fixture()
+def rouge_test_data():
+    ## First English testing case:
+    # Unigrams in candidate: 14
+    # Unigrams in reference: 14
+    # Unigram overlapping: 10
+    # Bigrams in candidate: 13
+    # Bigrams in reference: 13
+    # Bigram overlapping: 5
+    # LCS: 6, 3
+    # ROUGE-1 R: 10/14 = 0.71429
+    # ROUGE-1 P: 10/14 = 0.71429
+    # ROUGE-1 F: 2/(14/10 + 14/10) = 20/28 = 0.71429
+    # ROUGE-2 R: 5/13 = 0.38462
+    # ROUGE-2 P: 5/13 = 0.38462
+    # ROUGE-2 F: 0.38462
+    # ROUGE-L R: (6+3)/(9+5) = 0.64286
+    # ROUGE-L P: 0.64286
+    # ROUGE-L F: 0.64286
+
+    ## Second English testing case:
+    # Unigrams in candidate: 6
+    # Unigrams in reference: 7
+    # Unigram overlapping: 5
+    # Bigrams in candidate: 5
+    # Bigrams in reference: 6
+    # Bigram overlapping: 3
+    # LCS: 5
+    # ROUGE-1 R: 5/7 = 0.71429
+    # ROUGE-1 P: 5/6 = 0.83333
+    # ROUGE-1 F: 2/(7/5 + 6/5) = 10/13 = 0.76923
+    # ROUGE-2 R: 3/6 = 0.5
+    # ROUGE-2 P: 3/5 = 0.6
+    # ROUGE-2 F: 2/(6/3 + 5/3) = 6/11 = 0.54545
+    # ROUGE-L R: 5/7 = 0.71429
+    # ROUGE-L P: 5/6 = 0.83333
+    # ROUGE-L F: 2/(7/5 + 6/5) = 10/13 = 0.76923
+
+    summary_candidates = [
+        "The stock market is doing very well this year. Hope the same for 2020",
+        "The new movie is very popular.",
+    ]
+    summary_references = [
+        "The stock market is doing really well in 2019. Hope 2020 is the same.",
+        "The movie is very popular among millennials.",
+    ]
+
+    ## First Hindi testing case:
+    # Unigrams in candidate: 16
+    # Unigrams in reference: 18
+    # Unigram overlapping: 9
+    # Bigrams in candidate: 15
+    # Bigrams in reference: 17
+    # Bigram overlapping: 4
+    # LCS: 6, 3 (for each reference sentence, the code checks each candidate sentence)
+    # ROUGE-1 R: 9/18 = 0.5
+    # ROUGE-1 P: 9/16 = 0.5625
+    # ROUGE-1 F: 2/(18/9 + 16/9) = 18/34 = 0.52941
+    # ROUGE-2 R: 4/17 = 0.23529
+    # ROUGE-2 P: 4/15 = 0.26667
+    # ROUGE-2 F: 2/(17/4 + 15/4) = 8/32 = 0.25
+    # ROUGE-L R: (6+3)/18 = 0.5
+    # ROUGE-L P: (6+3)/16 = 0.5625
+    # ROUGE-L F: 2/(18/9 + 16/9) = 18/34 = 0.52941
+
+    ## Second Hindi testing case:
+    # Unigrams in candidate: 5
+    # Unigrams in reference: 7
+    # Unigram overlapping: 4
+    # Bigrams in candidate: 4
+    # Bigrams in reference: 6
+    # Bigram overlapping: 2
+    # LCS: 4
+    # ROUGE-1 R: 4/7 = 0.57143
+    # ROUGE-1 P: 4/5 = 0.8
+    # ROUGE-1 F: 2/(7/4 + 5/4) = 8/12 = 0.66667
+    # ROUGE-2 R: 2/6 = 0.33333
+    # ROUGE-2 P: 2/4 = 0.5
+    # ROUGE-2 F: 2/(6/2 + 4/2) = 4/10 = 0.4
+    # ROUGE-L R: 4/7 = 0.57143
+    # ROUGE-L P: 4/5 = 0.8
+    # ROUGE-L F: 2/(7/4 + 5/4) = 8/12 = 0.66667
+
+    summary_candidates_hi = [
+        "शेयर बाजार इस साल बहुत अच्छा कर रहा है। 2020 के लिए भी यही उम्मीद है।",
+        "नई फिल्म बहुत लोकप्रिय है।",
+    ]
+    summary_references_hi = [
+        "शेयर बाजार 2019 में वास्तव में अच्छा कर रहा है। आशा है कि 2020 भी ऐसा ही होगा।",
+        "फिल्म सदियों के बीच बहुत लोकप्रिय है।",
+    ]
+
+    return {
+        "candidates": summary_candidates,
+        "references": summary_references,
+        "candidates_hi": summary_candidates_hi,
+        "references_hi": summary_references_hi,
+    }
+
+
+def test_compute_rouge_perl(rouge_test_data):
+    rouge_perl = compute_rouge_perl(
+        cand=rouge_test_data["candidates"], ref=rouge_test_data["references"]
+    )
+
+    pytest.approx(rouge_perl["rouge_1_recall"], R1R, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_1_precision"], R1P, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_1_f_score"], R1F, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_2_recall"], R2R, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_2_precision"], R2P, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_2_f_score"], R2F, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_l_recall"], RLR, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_l_precision"], RLP, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_l_f_score"], RLF, abs=ABS_TOL)
+
+
+def test_compute_rouge_python(rouge_test_data):
+    rouge_python = compute_rouge_python(
+        cand=rouge_test_data["candidates"], ref=rouge_test_data["references"]
+    )
+
+    pytest.approx(rouge_python["rouge-1"]["r"], R1R, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-1"]["p"], R1P, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-1"]["f"], R1F, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["r"], R2R, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["p"], R2P, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["f"], R2F, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["r"], RLR, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["p"], RLP, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["f"], RLF, abs=ABS_TOL)
+
+
+def test_compute_rouge_python_hi(rouge_test_data):
+    rouge_python = compute_rouge_python(
+        cand=rouge_test_data["candidates_hi"], ref=rouge_test_data["references_hi"], language="hi"
+    )
+
+    pytest.approx(rouge_python["rouge-1"]["r"], R1R_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-1"]["p"], R1P_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-1"]["f"], R1F_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["r"], R2R_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["p"], R2P_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["f"], R2F_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["r"], RLR_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["p"], RLP_hi, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["f"], RLF_hi, abs=ABS_TOL)
+
+
+def test_compute_rouge_perl_file(rouge_test_data, tmp):
+    tmp_cand_file = os.path.join(tmp, "cand.txt")
+    tmp_ref_file = os.path.join(tmp, "ref.txt")
+
+    with open(tmp_cand_file, "w") as f:
+        for s in rouge_test_data["candidates"]:
+            f.write(s + "\n")
+    with open(tmp_ref_file, "w") as f:
+        for s in rouge_test_data["references"]:
+            f.write(s + "\n")
+
+    rouge_perl = compute_rouge_perl(cand=tmp_cand_file, ref=tmp_ref_file, is_input_files=True)
+
+    pytest.approx(rouge_perl["rouge_1_recall"], R1R, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_1_precision"], R1P, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_1_f_score"], R1F, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_2_recall"], R2R, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_2_precision"], R2P, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_2_f_score"], R2F, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_l_recall"], RLR, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_l_precision"], RLP, abs=ABS_TOL)
+    pytest.approx(rouge_perl["rouge_l_f_score"], RLF, abs=ABS_TOL)
+
+
+def test_compute_rouge_python_file(rouge_test_data, tmp):
+    tmp_cand_file = os.path.join(tmp, "cand.txt")
+    tmp_ref_file = os.path.join(tmp, "ref.txt")
+
+    with open(tmp_cand_file, "w") as f:
+        for s in rouge_test_data["candidates"]:
+            f.write(s + "\n")
+    with open(tmp_ref_file, "w") as f:
+        for s in rouge_test_data["references"]:
+            f.write(s + "\n")
+
+    rouge_python = compute_rouge_python(cand=tmp_cand_file, ref=tmp_ref_file, is_input_files=True)
+
+    pytest.approx(rouge_python["rouge-1"]["r"], R1R, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-1"]["p"], R1P, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-1"]["f"], R1F, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["r"], R2R, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["p"], R2P, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-2"]["f"], R2F, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["r"], RLR, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["p"], RLP, abs=ABS_TOL)
+    pytest.approx(rouge_python["rouge-l"]["f"], RLF, abs=ABS_TOL)
diff --git a/tests/unit/test_extractive_summarization.py b/tests/unit/test_extractive_summarization.py
new file mode 100644
index 000000000..797e631e5
--- /dev/null
+++ b/tests/unit/test_extractive_summarization.py
@@ -0,0 +1,109 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import os
+
+import nltk
+nltk.download("punkt")
+import pytest
+from nltk import tokenize
+
+from utils_nlp.models.transformers.datasets import SummarizationDataset
+from utils_nlp.models.transformers.extractive_summarization import (
+    ExtractiveSummarizer,
+    ExtSumProcessedData,
+    ExtSumProcessor,
+)
+
+
+
+
+# @pytest.fixture()
+def source_data():
+    return (
+        "Boston, MA welcome to Microsoft/nlp. Welcome to text summarization."
+        "Welcome to Microsoft NERD."
+        "Look outside, what a beautiful Charlse River fall view."
+    )
+
+
+# @pytest.fixture()
+def target_data():
+    return "welcome to microsoft/nlp." "Welcome to text summarization." "Welcome to Microsoft NERD."
+
+
+MODEL_NAME = "distilbert-base-uncased"
+NUM_GPUS = 1
+
+
+@pytest.fixture(scope="module")
+def data_to_file(tmp_module):
+    source = source_data()
+    target = target_data()
+    source_file = os.path.join(tmp_module, "source.txt")
+    target_file = os.path.join(tmp_module, "target.txt")
+    f = open(source_file, "w")
+    f.write(source)
+    f.close()
+    f = open(target_file, "w")
+    f.write(target)
+    f.close()
+    train_dataset = SummarizationDataset(
+        source_file, target_file, [tokenize.sent_tokenize], [tokenize.sent_tokenize], nltk.word_tokenize,
+    )
+    test_dataset = SummarizationDataset(
+        source_file, target_file, [tokenize.sent_tokenize], [tokenize.sent_tokenize], nltk.word_tokenize,
+    )
+
+    processor = ExtSumProcessor(
+        model_name=MODEL_NAME,
+        cache_dir=tmp_module,
+        max_nsents=200,
+        max_src_ntokens=2000,
+        min_nsents=0,
+        min_src_ntokens=1,
+    )
+    ext_sum_train = processor.preprocess(train_dataset, train_dataset.get_target(), oracle_mode="greedy")
+    ext_sum_test = processor.preprocess(test_dataset, test_dataset.get_target(), oracle_mode="greedy")
+
+    save_path = os.path.join(tmp_module, "processed")
+    train_files = ExtSumProcessedData.save_data(ext_sum_train, is_test=False, save_path=save_path, chunk_size=2000)
+    test_files = ExtSumProcessedData.save_data(ext_sum_test, is_test=True, save_path=save_path, chunk_size=2000)
+    print(train_files)
+    print(test_files)
+    assert os.path.exists(train_files[0])
+    assert os.path.exists(test_files[0])
+    return save_path
+
+
+@pytest.mark.gpu
+def test_bert_training(data_to_file, tmp_module):
+
+    CACHE_DIR = tmp_module
+    ENCODER = "transformer"
+    BATCH_SIZE = 128
+    LEARNING_RATE = 2e-3
+    REPORT_EVERY = 50
+    MAX_STEPS = 2e2
+    WARMUP_STEPS = 1e2
+    DATA_SAVED_PATH = data_to_file
+    result_base_path = "./results"
+
+    train_dataset, test_dataset = ExtSumProcessedData().splits(root=DATA_SAVED_PATH)
+    summarizer = ExtractiveSummarizer(MODEL_NAME, ENCODER, CACHE_DIR)
+    summarizer.fit(
+        train_dataset,
+        num_gpus=1,
+        batch_size=BATCH_SIZE,
+        gradient_accumulation_steps=2,
+        max_steps=MAX_STEPS,
+        lr=LEARNING_RATE,
+        warmup_steps=WARMUP_STEPS,
+        verbose=True,
+        report_every=REPORT_EVERY,
+        clip_grad_norm=False,
+    )
+
+    target = [i["tgt_txt"] for i in test_dataset]
+    prediction = summarizer.predict(test_dataset, num_gpus=NUM_GPUS, batch_size=128)
+    assert len(prediction) == 1
diff --git a/tests/unit/test_models_transformers_question_answering.py b/tests/unit/test_models_transformers_question_answering.py
index 010bf5c5d..7f14f0d0e 100644
--- a/tests/unit/test_models_transformers_question_answering.py
+++ b/tests/unit/test_models_transformers_question_answering.py
@@ -1,18 +1,20 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.
 
-import pytest
 import os
+
+import pytest
+import torch
+
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
 from utils_nlp.models.transformers.datasets import QADataset
 from utils_nlp.models.transformers.question_answering import (
-    QAProcessor,
-    AnswerExtractor,
     CACHED_EXAMPLES_TEST_FILE,
     CACHED_FEATURES_TEST_FILE,
+    AnswerExtractor,
+    QAProcessor,
 )
 
-import torch
-
 NUM_GPUS = max(1, torch.cuda.device_count())
 BATCH_SIZE = 8
 
@@ -109,9 +111,7 @@ def qa_test_data(qa_test_df, tmp_module):
         feature_cache_dir=tmp_module,
     )
 
-    qa_processor_distilbert = QAProcessor(
-        model_name="distilbert-base-uncased", cache_dir=tmp_module
-    )
+    qa_processor_distilbert = QAProcessor(model_name="distilbert-base-uncased", cache_dir=tmp_module)
     train_features_distilbert = qa_processor_distilbert.preprocess(
         train_dataset,
         batch_size=BATCH_SIZE,
@@ -153,15 +153,9 @@ def qa_test_data(qa_test_df, tmp_module):
 def test_QAProcessor(qa_test_data, tmp_module):
     for model_name in ["bert-base-cased", "xlnet-base-cased", "distilbert-base-uncased"]:
         qa_processor = QAProcessor(model_name=model_name, cache_dir=tmp_module)
-        qa_processor.preprocess(
-            qa_test_data["train_dataset"], is_training=True, feature_cache_dir=tmp_module
-        )
-        qa_processor.preprocess(
-            qa_test_data["train_dataset_list"], is_training=True, feature_cache_dir=tmp_module
-        )
-        qa_processor.preprocess(
-            qa_test_data["test_dataset"], is_training=False, feature_cache_dir=tmp_module
-        )
+        qa_processor.preprocess(qa_test_data["train_dataset"], is_training=True, feature_cache_dir=tmp_module)
+        qa_processor.preprocess(qa_test_data["train_dataset_list"], is_training=True, feature_cache_dir=tmp_module)
+        qa_processor.preprocess(qa_test_data["test_dataset"], is_training=False, feature_cache_dir=tmp_module)
 
     # test unsupported model type
     with pytest.raises(ValueError):
@@ -169,51 +163,49 @@ def test_QAProcessor(qa_test_data, tmp_module):
 
     # test training data has no ground truth exception
     with pytest.raises(Exception):
-        qa_processor.preprocess(
-            qa_test_data["test_dataset"], is_training=True, feature_cache_dir=tmp_module
-        )
+        qa_processor.preprocess(qa_test_data["test_dataset"], is_training=True, feature_cache_dir=tmp_module)
 
     # test when answer start is a list, but answer text is not
     with pytest.raises(Exception):
         qa_processor.preprocess(
-            qa_test_data["train_dataset_start_text_mismatch"],
-            is_training=True,
-            feature_cache_dir=tmp_module,
+            qa_test_data["train_dataset_start_text_mismatch"], is_training=True, feature_cache_dir=tmp_module,
         )
 
     # test when training data has multiple answers
     with pytest.raises(Exception):
         qa_processor.preprocess(
-            qa_test_data["train_dataset_multi_answers"],
-            is_training=True,
-            feature_cache_dir=tmp_module,
+            qa_test_data["train_dataset_multi_answers"], is_training=True, feature_cache_dir=tmp_module,
         )
 
 
 def test_AnswerExtractor(qa_test_data, tmp_module):
-    # test bert
+    # bert
     qa_extractor_bert = AnswerExtractor(cache_dir=tmp_module)
-    qa_extractor_bert.fit(qa_test_data["train_features_bert"], cache_model=True)
+    train_loader_bert = dataloader_from_dataset(qa_test_data["train_features_bert"])
+    test_loader_bert = dataloader_from_dataset(qa_test_data["test_features_bert"], shuffle=False)
+    qa_extractor_bert.fit(train_loader_bert, verbose=False, cache_model=True)
 
     # test saving fine-tuned model
     model_output_dir = os.path.join(tmp_module, "fine_tuned")
     assert os.path.exists(os.path.join(model_output_dir, "pytorch_model.bin"))
     assert os.path.exists(os.path.join(model_output_dir, "config.json"))
 
-    qa_extractor_from_cache = AnswerExtractor(
-        cache_dir=tmp_module, load_model_from_dir=model_output_dir
-    )
-    qa_extractor_from_cache.predict(qa_test_data["test_features_bert"])
+    qa_extractor_from_cache = AnswerExtractor(cache_dir=tmp_module, load_model_from_dir=model_output_dir)
+    qa_extractor_from_cache.predict(test_loader_bert, verbose=False)
 
+    # xlnet
+    train_loader_xlnet = dataloader_from_dataset(qa_test_data["train_features_xlnet"])
+    test_loader_xlnet = dataloader_from_dataset(qa_test_data["test_features_xlnet"], shuffle=False)
     qa_extractor_xlnet = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp_module)
-    qa_extractor_xlnet.fit(qa_test_data["train_features_xlnet"], cache_model=False)
-    qa_extractor_xlnet.predict(qa_test_data["test_features_xlnet"])
+    qa_extractor_xlnet.fit(train_loader_xlnet, verbose=False, cache_model=False)
+    qa_extractor_xlnet.predict(test_loader_xlnet, verbose=False)
 
-    qa_extractor_distilbert = AnswerExtractor(
-        model_name="distilbert-base-uncased", cache_dir=tmp_module
-    )
-    qa_extractor_distilbert.fit(qa_test_data["train_features_distilbert"], cache_model=False)
-    qa_extractor_distilbert.predict(qa_test_data["test_features_distilbert"])
+    # distilbert
+    train_loader_xlnet = dataloader_from_dataset(qa_test_data["train_features_distilbert"])
+    test_loader_xlnet = dataloader_from_dataset(qa_test_data["test_features_distilbert"], shuffle=False)
+    qa_extractor_distilbert = AnswerExtractor(model_name="distilbert-base-uncased", cache_dir=tmp_module)
+    qa_extractor_distilbert.fit(train_loader_xlnet, verbose=False, cache_model=False)
+    qa_extractor_distilbert.predict(test_loader_xlnet, verbose=False)
 
 
 def test_postprocess_bert_answer(qa_test_data, tmp_module):
@@ -226,8 +218,9 @@ def test_postprocess_bert_answer(qa_test_data, tmp_module):
         doc_stride=32,
         feature_cache_dir=tmp_module,
     )
+    test_loader = dataloader_from_dataset(test_features, shuffle=False)
     qa_extractor = AnswerExtractor(cache_dir=tmp_module)
-    predictions = qa_extractor.predict(test_features)
+    predictions = qa_extractor.predict(test_loader)
 
     qa_processor.postprocess(
         results=predictions,
@@ -260,8 +253,9 @@ def test_postprocess_xlnet_answer(qa_test_data, tmp_module):
         doc_stride=32,
         feature_cache_dir=tmp_module,
     )
+    test_loader = dataloader_from_dataset(test_features, shuffle=False)
     qa_extractor = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp_module)
-    predictions = qa_extractor.predict(test_features)
+    predictions = qa_extractor.predict(test_loader)
 
     qa_processor.postprocess(
         results=predictions,
diff --git a/tests/unit/test_transformers_sequence_classification.py b/tests/unit/test_transformers_sequence_classification.py
old mode 100644
new mode 100755
index 156854200..3ffb1f8b9
--- a/tests/unit/test_transformers_sequence_classification.py
+++ b/tests/unit/test_transformers_sequence_classification.py
@@ -5,6 +5,7 @@
 import pandas as pd
 
 from utils_nlp.models.transformers.sequence_classification import SequenceClassifier, Processor
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
 
 
 @pytest.fixture()
@@ -19,12 +20,11 @@ def test_classifier(data, tmpdir):
     num_labels = len(pd.unique(data[1]))
     model_name = "bert-base-uncased"
     processor = Processor(model_name=model_name, cache_dir=tmpdir)
-    train_dataloader = processor.create_dataloader_from_df(
-        df, "text", "label", batch_size=2, num_gpus=0
-    )
+    ds = processor.dataset_from_dataframe(df, "text", "label")
+    dl = dataloader_from_dataset(ds, batch_size=2, num_gpus=0, shuffle=True)
     classifier = SequenceClassifier(model_name=model_name, num_labels=num_labels, cache_dir=tmpdir)
-    classifier.fit(train_dataloader=train_dataloader, num_epochs=1, num_gpus=0, verbose=False)
-    preds = classifier.predict(train_dataloader, num_gpus=0, verbose=False)
+    classifier.fit(train_dataloader=dl, num_epochs=1, num_gpus=0, verbose=False)
+    preds = classifier.predict(dl, num_gpus=0, verbose=False)
     assert len(preds) == len(data[1])
 
 
@@ -35,17 +35,16 @@ def test_classifier_gpu_train_cpu_predict(data, tmpdir):
     num_labels = len(pd.unique(data[1]))
     model_name = "bert-base-uncased"
     processor = Processor(model_name=model_name, cache_dir=tmpdir)
-    train_dataloader = processor.create_dataloader_from_df(
-        df, "text", "label", batch_size=2, num_gpus=1
-    )
+    ds = processor.dataset_from_dataframe(df, "text", "label")
+    dl = dataloader_from_dataset(ds, batch_size=2, num_gpus=1, shuffle=True)
     classifier = SequenceClassifier(model_name=model_name, num_labels=num_labels, cache_dir=tmpdir)
-    classifier.fit(train_dataloader=train_dataloader, num_epochs=1, num_gpus=1, verbose=False)
+    classifier.fit(train_dataloader=dl, num_epochs=1, num_gpus=1, verbose=False)
 
     assert next(classifier.model.parameters()).is_cuda is True
     # gpu prediction, no model move
-    preds = classifier.predict(train_dataloader, num_gpus=1, verbose=False)
+    preds = classifier.predict(dl, num_gpus=1, verbose=False)
     assert len(preds) == len(data[1])
     # cpu prediction, need model move
     assert next(classifier.model.parameters()).is_cuda is True
-    preds = classifier.predict(train_dataloader, num_gpus=0, verbose=False)
+    preds = classifier.predict(dl, num_gpus=0, verbose=False)
     assert next(classifier.model.parameters()).is_cuda is False
diff --git a/tests/unit/test_transformers_token_classification.py b/tests/unit/test_transformers_token_classification.py
new file mode 100644
index 000000000..e8c780f21
--- /dev/null
+++ b/tests/unit/test_transformers_token_classification.py
@@ -0,0 +1,23 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import pytest
+
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
+from utils_nlp.models.transformers.named_entity_recognition import TokenClassificationProcessor, TokenClassifier
+
+
+@pytest.mark.cpu
+def test_token_classifier_fit_predict(tmpdir, ner_test_data):
+    token_classifier = TokenClassifier(model_name="bert-base-uncased", num_labels=6, cache_dir=tmpdir)
+    processor = TokenClassificationProcessor(model_name="bert-base-uncased", cache_dir=tmpdir)
+
+    # test fit, no warmup
+    train_dataset = processor.preprocess_for_bert(
+        text=ner_test_data["INPUT_TEXT"], labels=ner_test_data["INPUT_LABELS"], label_map=ner_test_data["LABEL_MAP"],
+    )
+    train_dataloader = dataloader_from_dataset(train_dataset)
+    token_classifier.fit(train_dataloader)
+
+    # test predict, no labels
+    _ = token_classifier.predict(train_dataloader, verbose=False)
diff --git a/tools/generate_conda_file.py b/tools/generate_conda_file.py
index ab6dd8fdb..613f9d928 100644
--- a/tools/generate_conda_file.py
+++ b/tools/generate_conda_file.py
@@ -43,19 +43,16 @@
     "pytest": "pytest>=3.6.4",
     "pytorch": "pytorch-cpu>=1.0.0",
     "scipy": "scipy>=1.0.0",
-    "tensorflow": "tensorflow==1.12.0",
     "h5py": "h5py>=2.8.0",
-    "tensorflow-hub": "tensorflow-hub==0.5.0",
-    "py-xgboost": "py-xgboost<=0.80",
+    "tensorflow": "tensorflow==1.15.0",
+    "tensorflow-hub": "tensorflow-hub==0.7.0",
     "dask": "dask[dataframe]==1.2.2",
     "papermill": "papermill>=1.0.1",
 }
 
 CONDA_GPU = {
     "numba": "numba>=0.38.1",
-    "pytorch": "pytorch>=1.0.0",
-    "tensorflow": "tensorflow-gpu==1.12.0",
-    "cudatoolkit": "cudatoolkit==9.2",
+    "cudatoolkit": "cudatoolkit==10.2.89",
 }
 
 PIP_BASE = {
@@ -78,7 +75,7 @@
     "scikit-learn": "scikit-learn>=0.19.0,<=0.20.3",
     "seaborn": "seaborn>=0.9.0",
     "sklearn-crfsuite": "sklearn-crfsuite>=0.3.6",
-    "spacy": "spacy>=2.1.4",
+    "spacy": "spacy==2.1.8",
     "spacy-models": (
         "https://github.com/explosion/spacy-models/releases/download/"
         "en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz"
@@ -87,9 +84,18 @@
     "gensim": "gensim>=3.7.0",
     "nltk": "nltk>=3.4",
     "seqeval": "seqeval>=0.0.12",
+    "bertsum": "git+https://github.com/daden-ms/BertSum.git@030c139c97bc57d0c31f6515b8bf9649f999a443#egg=BertSum",
+    "pyrouge": "pyrouge>=0.1.3",
+    "py-rouge": "py-rouge>=1.1",
+    "indic-nlp-library": "indic-nlp-library>=0.6",
+    "torchtext": "torchtext>=0.4.0",
+    "multiprocess": "multiprocess==0.70.9",
+    "tensorboardX": "tensorboardX==1.8",
 }
 
-PIP_GPU = {}
+PIP_GPU = {
+    "torch": "torch==1.4.0",
+}
 
 PIP_DARWIN = {}
 PIP_DARWIN_GPU = {}
diff --git a/utils_nlp/common/pytorch_utils.py b/utils_nlp/common/pytorch_utils.py
index ea09f8768..2badb45e5 100644
--- a/utils_nlp/common/pytorch_utils.py
+++ b/utils_nlp/common/pytorch_utils.py
@@ -1,11 +1,11 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.
 
-"""Common PyTorch utilities that facilitate building Pytorch models."""
+"""Common PyTorch utilities that facilitate building PyTorch models."""
 
 import torch
-import torch.nn as nn
-import warnings
+from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
+from torch.utils.data.distributed import DistributedSampler
 
 
 def get_device(
@@ -17,11 +17,7 @@ def get_device(
     #    init_method="file:///distributed",
 ):
     if local_rank == -1:
-        num_gpus = (
-            min(num_gpus, torch.cuda.device_count())
-            if num_gpus is not None
-            else torch.cuda.device_count()
-        )
+        num_gpus = min(num_gpus, torch.cuda.device_count()) if num_gpus is not None else torch.cuda.device_count()
         device = torch.device("cuda" if torch.cuda.is_available() and num_gpus > 0 else "cpu")
     else:
         torch.cuda.set_device(local_rank)
@@ -32,59 +28,109 @@ def get_device(
     return device, num_gpus
 
 
-def move_to_device(model, device, num_gpus=None):
+def move_model_to_device(model, device, num_gpus=None, gpu_ids=None, local_rank=-1):
     """Moves a model to the specified device (cpu or gpu/s)
        and implements data parallelism when multiple gpus are specified.
 
     Args:
-        model (Module): A PyTorch model
-        device (torch.device): A PyTorch device
-        num_gpus (int): The number of GPUs to be used. Defaults to None,
-            all gpus are used.
+        model (Module): A PyTorch model.
+        device (torch.device): A PyTorch device.
+        num_gpus (int): The number of GPUs to be used.
+            If set to None, all available GPUs will be used.
+            Defaults to None.
+        gpu_ids (list): List of GPU IDs to be used.
+            If None, the first num_gpus GPUs will be used.
+            If not None, overrides num_gpus.
+            Defaults to None.
+        local_rank (int): Local GPU ID within a node. Used in distributed environments.
+            If not -1, num_gpus and gpu_ids are ignored.
+            Defaults to -1.
 
     Returns:
-        Module, DataParallel: A PyTorch Module or
-            a DataParallel wrapper (when multiple gpus are used).
+        Module, DataParallel, DistributedDataParallel: A PyTorch Module or
+            a DataParallel/DistributedDataParallel wrapper (when multiple gpus are used).
     """
-    if isinstance(model, nn.DataParallel):
-        model = model.module
-
     if not isinstance(device, torch.device):
         raise ValueError("device must be of type torch.device.")
 
-    if device.type == "cuda":
-        model.to(device)  # inplace
-        if num_gpus == 0:
-            raise ValueError("num_gpus must be non-zero when device.type is 'cuda'")
-        elif num_gpus == 1:
-            return model
-        else:
-            # parallelize
+    # unwrap model
+    if isinstance(model, torch.nn.DataParallel):
+        model = model.module
+    # wrap in DataParallel or DistributedDataParallel
+    if local_rank != -1:
+        self.model = torch.nn.parallel.DistributedDataParallel(
+            self.model, device_ids=[local_rank], output_device=local_rank, find_unused_parameters=True,
+        )
+    else:
+        if device.type == "cuda":
+            if num_gpus is not None:
+                if num_gpus < 1:
+                    raise ValueError("num_gpus must be at least 1 or None")
             num_cuda_devices = torch.cuda.device_count()
             if num_cuda_devices < 1:
                 raise Exception("CUDA devices are not available.")
-            elif num_cuda_devices < 2:
-                print("Warning: Only 1 CUDA device is available. Data parallelism is not possible.")
-                return model
-            else:
-                if num_gpus is None:
-                    # use all available devices
-                    return nn.DataParallel(model, device_ids=None)
-                elif num_gpus > num_cuda_devices:
-                    print(
-                        "Warning: Only {0} devices are available. "
-                        "Setting the number of gpus to {0}".format(num_cuda_devices)
-                    )
-                    return nn.DataParallel(model, device_ids=None)
-                else:
-                    return nn.DataParallel(model, device_ids=list(range(num_gpus)))
-    elif device.type == "cpu":
-        if num_gpus != 0 and num_gpus is not None:
-            warnings.warn("Device type is 'cpu'. num_gpus is ignored.")
-        return model.to(device)
+            if gpu_ids is None:
+                num_gpus = num_cuda_devices if num_gpus is None else min(num_gpus, num_cuda_devices)
+                gpu_ids = list(range(num_gpus))
+            if len(gpu_ids) > 1:
+                model = torch.nn.DataParallel(model, device_ids=gpu_ids)
+    # move to device
+    return model.to(device)
+
 
+def dataloader_from_dataset(ds, batch_size=32, num_gpus=None, shuffle=False, distributed=False):
+    """Creates a PyTorch DataLoader given a Dataset object.
+
+    Args:
+        ds (torch.utils.data.DataSet): A PyTorch dataset.
+        batch_size (int, optional): Batch size.
+            If more than 1 gpu is used, this would be the batch size per gpu.
+            Defaults to 32.
+        num_gpus (int, optional): The number of GPUs to be used. Defaults to None.
+        shuffle (bool, optional): If True, a RandomSampler is used. Defaults to False.
+        distributed (book, optional): If True, a DistributedSampler is used. Defaults to False.
+
+    Returns:
+        Module, DataParallel: A PyTorch Module or
+            a DataParallel wrapper (when multiple gpus are used).
+    """
+    if num_gpus is None:
+        num_gpus = torch.cuda.device_count()
+
+    batch_size = batch_size * max(1, num_gpus)
+
+    if distributed:
+        sampler = DistributedSampler(ds)
     else:
-        raise Exception(
-            "Device type '{}' not supported. Currently, only cpu "
-            "and cuda devices are supported.".format(device.type)
-        )
+        sampler = RandomSampler(ds) if shuffle else SequentialSampler(ds)
+
+    return DataLoader(ds, sampler=sampler, batch_size=batch_size)
+
+
+def compute_training_steps(dataloader, num_epochs=1, max_steps=-1, gradient_accumulation_steps=1):
+    """Computes the max training steps given a dataloader.
+
+    Args:
+        dataloader (Dataloader): A PyTorch DataLoader.
+        num_epochs (int, optional): Number of training epochs. Defaults to 1.
+        max_steps (int, optional): Total number of training steps.
+            If set to a positive value, it overrides num_epochs.
+            Otherwise, it's determined by the dataset length, gradient_accumulation_steps, and num_epochs.
+            Defualts to -1.
+        gradient_accumulation_steps (int, optional): Number of steps to accumulate
+            before performing a backward/update pass.
+            Default to 1.
+
+    Returns:
+        int: The max number of steps to be used in a training loop.
+    """
+    try:
+        dataset_length = len(dataloader)
+    except Exception:
+        dataset_length = -1
+    if max_steps <= 0:
+        if dataset_length != -1 and num_epochs > 0:
+            max_steps = dataset_length // gradient_accumulation_steps * num_epochs
+    if max_steps <= 0:
+        raise Exception("Max steps cannot be determined.")
+    return max_steps
diff --git a/utils_nlp/dataset/README.md b/utils_nlp/dataset/README.md
index 40db2ffb6..bfbb040c9 100644
--- a/utils_nlp/dataset/README.md
+++ b/utils_nlp/dataset/README.md
@@ -20,6 +20,8 @@ df = load_pandas_df(DATA_FOLDER, file_split ="train", nrows = 1000)
 |[The Cross-Lingual NLI (XNLI) Corpus](https://www.nyu.edu/projects/bowman/xnli/)|[xnli.py](./xnli.py)|
 |[The STSbenchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark)|[stsbenchmark.py](./stsbenchmark.py)|
 |[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/)|[squad.py](./squad.py)|
+|[CNN/Daily Mail(CNN/DM) Dataset](https://github.com/harvardnlp/sent-summary)|[cnndm.py](./cnndm.py)|
+|[Preprocessed CNN/Daily Mail(CNN/DM) Dataset for Extractive Summarization](https://github.com/nlpyang/BertSum)|[cnndm.py](./cnndm.py)|
 
 ## Dataset References
 Please see [Dataset References](../../DatasetReferences.md) for notice and information regarding datasets used.
diff --git a/utils_nlp/dataset/bbc_hindi.py b/utils_nlp/dataset/bbc_hindi.py
index c8212cd63..c24710680 100644
--- a/utils_nlp/dataset/bbc_hindi.py
+++ b/utils_nlp/dataset/bbc_hindi.py
@@ -7,24 +7,21 @@
     https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1
 """
 
-import os
-import pandas as pd
 import logging
-import numpy as np
+import os
 import tarfile
-
 from tempfile import TemporaryDirectory
+
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder
+
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
 from utils_nlp.dataset.url_utils import maybe_download
 from utils_nlp.models.transformers.common import MAX_SEQ_LEN
 from utils_nlp.models.transformers.sequence_classification import Processor
-from sklearn.preprocessing import LabelEncoder
-from sklearn.model_selection import train_test_split
-
 
-URL = (
-    "https://github.com/NirantK/hindi2vec/releases/"
-    "download/bbc-hindi-v0.1/bbc-hindiv01.tar.gz"
-)
+URL = "https://github.com/NirantK/hindi2vec/releases/" "download/bbc-hindi-v0.1/bbc-hindiv01.tar.gz"
 
 
 def load_pandas_df(local_cache_path=TemporaryDirectory().name):
@@ -49,19 +46,9 @@ def load_pandas_df(local_cache_path=TemporaryDirectory().name):
     train_csv_file_path = os.path.join(local_cache_path, "hindi-train.csv")
     test_csv_file_path = os.path.join(local_cache_path, "hindi-test.csv")
 
-    train_df = pd.read_csv(
-        train_csv_file_path,
-        sep="\t",
-        encoding='utf-8',
-        header=None
-    )
+    train_df = pd.read_csv(train_csv_file_path, sep="\t", encoding="utf-8", header=None)
 
-    test_df = pd.read_csv(
-        test_csv_file_path,
-        sep="\t",
-        encoding='utf-8',
-        header=None
-    )
+    test_df = pd.read_csv(test_csv_file_path, sep="\t", encoding="utf-8", header=None)
 
     train_df = train_df.fillna("")
     test_df = test_df.fillna("")
@@ -80,7 +67,7 @@ def load_tc_dataset(
     cache_dir=TemporaryDirectory().name,
     max_len=MAX_SEQ_LEN,
     batch_size=32,
-    num_gpus=None
+    num_gpus=None,
 ):
     """
     Load the multinli dataset and split into training and testing datasets.
@@ -105,7 +92,7 @@ def load_tc_dataset(
         cache_dir (str, optional): The default folder for saving cache files.
             Defaults to TemporaryDirectory().name.
         max_len (int, optional): Maximum length of the list of tokens. Lists longer
-            than this are truncated and shorter ones are padded with "O"s. 
+            than this are truncated and shorter ones are padded with "O"s.
             Default value is BERT_MAX_LEN=512.
         batch_size (int, optional): The batch size for training and testing.
             Defaults to 32.
@@ -114,15 +101,15 @@ def load_tc_dataset(
 
     Returns:
         tuple. The tuple contains four elements:
-        train_dataload (DataLoader): a PyTorch DataLoader instance for training.
+        train_dataloader (DataLoader): a PyTorch DataLoader instance for training.
+
+        test_dataloader (DataLoader): a PyTorch DataLoader instance for testing.
 
-        test_dataload (DataLoader): a PyTorch DataLoader instance for testing.
-        
         label_encoder (LabelEncoder): a sklearn LabelEncoder instance. The label values
             can be retrieved by calling the `inverse_transform` function.
-        
+
         test_labels (Series): a Pandas Series of testing label (in label ID format). If
-            the labels are in raw label values format, we will need to transform it to 
+            the labels are in raw label values format, we will need to transform it to
             label IDs by using the label_encoder.transform function.
     """
 
@@ -140,12 +127,8 @@ def load_tc_dataset(
     if test_fraction < 0 or test_fraction >= 1.0:
         logging.warning("Invalid test fraction value: {}, changed to 0.25".format(test_fraction))
         test_fraction = 0.25
-    
-    train_df, test_df = train_test_split(
-        all_df,
-        train_size=(1.0 - test_fraction),
-        random_state=random_seed
-    )
+
+    train_df, test_df = train_test_split(all_df, train_size=(1.0 - test_fraction), random_state=random_seed)
 
     if train_sample_ratio > 1.0:
         train_sample_ratio = 1.0
@@ -153,7 +136,7 @@ def load_tc_dataset(
     elif train_sample_ratio < 0:
         logging.error("Invalid training sample ration: {}".format(train_sample_ratio))
         raise ValueError("Invalid training sample ration: {}".format(train_sample_ratio))
-    
+
     if test_sample_ratio > 1.0:
         test_sample_ratio = 1.0
         logging.warning("Setting the testing sample ratio to 1.0")
@@ -171,42 +154,24 @@ def load_tc_dataset(
     test_labels = label_encoder.transform(test_df[label_col])
     test_df[label_col] = test_labels
 
-    processor = Processor(
-        model_name=model_name,
-        to_lower=to_lower,
-        cache_dir=cache_dir
-    )
+    processor = Processor(model_name=model_name, to_lower=to_lower, cache_dir=cache_dir)
 
-    train_dataloader = processor.create_dataloader_from_df(
-        df=train_df,
-        text_col=text_col,
-        label_col=label_col,
-        max_len=max_len,
-        text2_col=None,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        shuffle=True,
-        distributed=False
+    train_dataset = processor.dataset_from_dataframe(
+        df=train_df, text_col=text_col, label_col=label_col, max_len=max_len,
     )
+    train_dataloader = dataloader_from_dataset(train_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=True)
 
-    test_dataloader = processor.create_dataloader_from_df(
-        df=test_df,
-        text_col=text_col,
-        label_col=label_col,
-        max_len=max_len,
-        text2_col=None,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        shuffle=False,
-        distributed=False
+    test_dataset = processor.dataset_from_dataframe(
+        df=test_df, text_col=text_col, label_col=label_col, max_len=max_len,
     )
+    test_dataloader = dataloader_from_dataset(test_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=False)
 
     return (train_dataloader, test_dataloader, label_encoder, test_labels)
 
 
 def get_label_values(label_encoder, label_ids):
     """
-    Get the label values from label IDs. 
+    Get the label values from label IDs.
 
     Args:
         label_encoder (LabelEncoder): a fitted sklearn LabelEncoder instance
diff --git a/utils_nlp/dataset/cnndm.py b/utils_nlp/dataset/cnndm.py
new file mode 100644
index 000000000..99d486671
--- /dev/null
+++ b/utils_nlp/dataset/cnndm.py
@@ -0,0 +1,120 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+# This script reuses some code from https://github.com/nlpyang/BertSum
+
+"""
+    Utility functions for downloading, extracting, and reading the
+    CNN/DM dataset at https://github.com/harvardnlp/sent-summary.
+    
+"""
+
+import glob
+import nltk
+
+nltk.download("punkt")
+from nltk import tokenize
+import os
+import sys
+import regex as re
+import torch
+import torchtext
+from torchtext.utils import download_from_url, extract_archive
+import zipfile
+
+
+from utils_nlp.dataset.url_utils import maybe_download
+from utils_nlp.models.transformers.datasets import SummarizationDataset
+from utils_nlp.models.transformers.extractive_summarization import get_dataset, get_dataloader
+
+
+
+
+
+def CNNDMSummarizationDataset(*args, **kwargs):
+    """Load the CNN/Daily Mail dataset preprocessed by harvardnlp group."""
+
+    REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}",
+         "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}
+
+    
+    def _clean(x):
+        return re.sub(
+            r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''",
+            lambda m: REMAP.get(m.group()), x)
+
+
+    def _remove_ttags(line):
+        line = re.sub(r"<t>", "", line)
+        # change </t> to <q>
+        # pyrouge test requires <q> as sentence splitter
+        line = re.sub(r"</t>", "<q>", line)
+        return line
+
+
+    def _target_sentence_tokenization(line):
+        return line.split("<q>")
+
+    URLS = ["https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz"]
+    
+
+    def _setup_datasets(url, top_n=-1, local_cache_path=".data"):
+        FILE_NAME = "cnndm.tar.gz"
+        maybe_download(url, FILE_NAME, local_cache_path)
+        dataset_tar = os.path.join(local_cache_path, FILE_NAME)
+        extracted_files = extract_archive(dataset_tar)
+        for fname in extracted_files:
+            if fname.endswith("train.txt.src"):
+                train_source_file = fname
+            if fname.endswith("train.txt.tgt.tagged"):
+                train_target_file = fname
+            if fname.endswith("test.txt.src"):
+                test_source_file = fname
+            if fname.endswith("test.txt.tgt.tagged"):
+                test_target_file = fname
+
+        return (
+            SummarizationDataset(
+                train_source_file,
+                train_target_file,
+                [_clean, tokenize.sent_tokenize],
+                [_clean, _remove_ttags, _target_sentence_tokenization],
+                nltk.word_tokenize,
+                top_n,
+            ),
+            SummarizationDataset(
+                test_source_file,
+                test_target_file,
+                [_clean, tokenize.sent_tokenize],
+                [_clean, _remove_ttags, _target_sentence_tokenization],
+                nltk.word_tokenize,
+                top_n,
+            ),
+        )
+
+    return _setup_datasets(*((URLS[0],) + args), **kwargs)
+    
+    
+class CNNDMBertSumProcessedData:
+    """Class to load dataset preprocessed by BertSum paper at 
+        https://github.com/nlpyang/BertSum
+    """
+
+    @staticmethod
+    def download(local_path=".data"):
+        file_name = "bertsum_data.zip"
+        url = "https://drive.google.com/uc?export=download&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6"
+        try:
+            if os.path.exists(os.path.join(local_path, file_name)):
+                downloaded_zipfile = zipfile.ZipFile(os.path.join(local_path, file_name))
+            else:
+                dataset_zip = download_from_url(url, root=local_path)
+                downloaded_zipfile = zipfile.ZipFile(dataset_zip)
+        except:
+            print("Unexpected dataset downloading or reading error:", sys.exc_info()[0])
+            raise
+
+        downloaded_zipfile.extractall(local_path)
+        return local_path
+
+    
diff --git a/utils_nlp/dataset/dac.py b/utils_nlp/dataset/dac.py
index c692dfb56..c8af1ad87 100644
--- a/utils_nlp/dataset/dac.py
+++ b/utils_nlp/dataset/dac.py
@@ -8,18 +8,18 @@
         arabic-text-classification-using-deep-learning-technics/")
 """
 
+import logging
 import os
+from tempfile import TemporaryDirectory
+
 import pandas as pd
-import logging
-import numpy as np
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder
 
-from tempfile import TemporaryDirectory
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
 from utils_nlp.dataset.url_utils import extract_zip, maybe_download
 from utils_nlp.models.transformers.common import MAX_SEQ_LEN
 from utils_nlp.models.transformers.sequence_classification import Processor
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import LabelEncoder
-
 
 URL = (
     "https://data.mendeley.com/datasets/v524p5dhpj/2"
@@ -58,7 +58,7 @@ def load_tc_dataset(
     cache_dir=TemporaryDirectory().name,
     max_len=MAX_SEQ_LEN,
     batch_size=32,
-    num_gpus=None
+    num_gpus=None,
 ):
     """
     Load the multinli dataset and split into training and testing datasets.
@@ -92,9 +92,9 @@ def load_tc_dataset(
 
     Returns:
         tuple. The tuple contains four elements:
-        train_dataload (DataLoader): a PyTorch DataLoader instance for training.
+        train_dataloader (DataLoader): a PyTorch DataLoader instance for training.
 
-        test_dataload (DataLoader): a PyTorch DataLoader instance for testing.
+        test_dataloader (DataLoader): a PyTorch DataLoader instance for testing.
         
         label_encoder (LabelEncoder): a sklearn LabelEncoder instance. The label values
             can be retrieved by calling the `inverse_transform` function.
@@ -104,11 +104,8 @@ def load_tc_dataset(
             label IDs by using the label_encoder.transform function.
     """
 
-     # download and load the original dataset
-    all_df = load_pandas_df(
-        local_cache_path=local_path,
-        num_rows=None
-    )
+    # download and load the original dataset
+    all_df = load_pandas_df(local_cache_path=local_path, num_rows=None)
 
     # set the text and label columns
     text_col = all_df.columns[0]
@@ -123,12 +120,8 @@ def load_tc_dataset(
     if test_fraction < 0 or test_fraction >= 1.0:
         logging.warning("Invalid test fraction value: {}, changed to 0.25".format(test_fraction))
         test_fraction = 0.25
-    
-    train_df, test_df = train_test_split(
-        all_df,
-        train_size=(1.0 - test_fraction),
-        random_state=random_seed
-    )
+
+    train_df, test_df = train_test_split(all_df, train_size=(1.0 - test_fraction), random_state=random_seed)
 
     if train_sample_ratio > 1.0:
         train_sample_ratio = 1.0
@@ -136,7 +129,7 @@ def load_tc_dataset(
     elif train_sample_ratio < 0:
         logging.error("Invalid training sample ration: {}".format(train_sample_ratio))
         raise ValueError("Invalid training sample ration: {}".format(train_sample_ratio))
-    
+
     if test_sample_ratio > 1.0:
         test_sample_ratio = 1.0
         logging.warning("Setting the testing sample ratio to 1.0")
@@ -149,35 +142,17 @@ def load_tc_dataset(
     if test_sample_ratio < 1.0:
         test_df = test_df.sample(frac=test_sample_ratio).reset_index(drop=True)
 
-    processor = Processor(
-        model_name=model_name,
-        to_lower=to_lower,
-        cache_dir=cache_dir
-    )
+    processor = Processor(model_name=model_name, to_lower=to_lower, cache_dir=cache_dir)
 
-    train_dataloader = processor.create_dataloader_from_df(
-        df=train_df,
-        text_col=text_col,
-        label_col=label_col,
-        max_len=max_len,
-        text2_col=None,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        shuffle=True,
-        distributed=False
+    train_dataset = processor.dataset_from_dataframe(
+        df=train_df, text_col=text_col, label_col=label_col, max_len=max_len,
     )
+    train_dataloader = dataloader_from_dataset(train_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=True)
 
-    test_dataloader = processor.create_dataloader_from_df(
-        df=test_df,
-        text_col=text_col,
-        label_col=label_col,
-        max_len=max_len,
-        text2_col=None,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        shuffle=False,
-        distributed=False
+    test_dataset = processor.dataset_from_dataframe(
+        df=test_df, text_col=text_col, label_col=label_col, max_len=max_len,
     )
+    test_dataloader = dataloader_from_dataset(test_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=False)
 
     # the DAC dataset already converted the labels to label ID format
     test_labels = test_df[label_col]
diff --git a/utils_nlp/dataset/multinli.py b/utils_nlp/dataset/multinli.py
index 62b772cd1..adab4c925 100644
--- a/utils_nlp/dataset/multinli.py
+++ b/utils_nlp/dataset/multinli.py
@@ -7,18 +7,19 @@
     https://www.nyu.edu/projects/bowman/multinli/
 """
 
+import logging
 import os
+from tempfile import TemporaryDirectory
 
 import pandas as pd
-import logging
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder
 
-from tempfile import TemporaryDirectory
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
 from utils_nlp.dataset.data_loaders import DaskJSONLoader
 from utils_nlp.dataset.url_utils import extract_zip, maybe_download
 from utils_nlp.models.transformers.common import MAX_SEQ_LEN
 from utils_nlp.models.transformers.sequence_classification import Processor
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import LabelEncoder
 
 URL = "http://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip"
 DATA_FILES = {
@@ -63,9 +64,7 @@ def load_pandas_df(local_cache_path=".", file_split="train"):
     return pd.read_json(os.path.join(local_cache_path, DATA_FILES[file_split]), lines=True)
 
 
-def get_generator(
-    local_cache_path=".", file_split="train", block_size=10e6, batch_size=10e6, num_batches=None
-):
+def get_generator(local_cache_path=".", file_split="train", block_size=10e6, batch_size=10e6, num_batches=None):
     """ Returns an extracted dataset as a random batch generator that
     yields pandas dataframes.
     Args:
@@ -85,9 +84,7 @@ def get_generator(
     except Exception as e:
         raise e
 
-    loader = DaskJSONLoader(
-        os.path.join(local_cache_path, DATA_FILES[file_split]), block_size=block_size
-    )
+    loader = DaskJSONLoader(os.path.join(local_cache_path, DATA_FILES[file_split]), block_size=block_size)
 
     return loader.get_sequential_batches(batch_size=int(batch_size), num_batches=num_batches)
 
@@ -103,7 +100,7 @@ def load_tc_dataset(
     cache_dir=TemporaryDirectory().name,
     max_len=MAX_SEQ_LEN,
     batch_size=32,
-    num_gpus=None
+    num_gpus=None,
 ):
     """
     Load the multinli dataset and split into training and testing datasets.
@@ -137,9 +134,9 @@ def load_tc_dataset(
 
     Returns:
         tuple. The tuple contains four elements:
-        train_dataload (DataLoader): a PyTorch DataLoader instance for training.
+        train_dataloader (DataLoader): a PyTorch DataLoader instance for training.
 
-        test_dataload (DataLoader): a PyTorch DataLoader instance for testing.
+        test_dataloader (DataLoader): a PyTorch DataLoader instance for testing.
         
         label_encoder (LabelEncoder): a sklearn LabelEncoder instance. The label values
             can be retrieved by calling the `inverse_transform` function.
@@ -150,10 +147,7 @@ def load_tc_dataset(
     """
 
     # download and load the original dataset
-    all_df = load_pandas_df(
-        local_cache_path=local_path,
-        file_split="train"
-    )
+    all_df = load_pandas_df(local_cache_path=local_path, file_split="train")
 
     # select the examples corresponding to one of the entailment labels (neutral
     # in this case) to avoid duplicate rows, as the sentences are not unique,
@@ -169,12 +163,8 @@ def load_tc_dataset(
     if test_fraction < 0 or test_fraction >= 1.0:
         logging.warning("Invalid test fraction value: {}, changed to 0.25".format(test_fraction))
         test_fraction = 0.25
-    
-    train_df, test_df = train_test_split(
-        all_df,
-        train_size=(1.0 - test_fraction),
-        random_state=random_seed
-    )
+
+    train_df, test_df = train_test_split(all_df, train_size=(1.0 - test_fraction), random_state=random_seed)
 
     if train_sample_ratio > 1.0:
         train_sample_ratio = 1.0
@@ -182,7 +172,7 @@ def load_tc_dataset(
     elif train_sample_ratio < 0:
         logging.error("Invalid training sample ration: {}".format(train_sample_ratio))
         raise ValueError("Invalid training sample ration: {}".format(train_sample_ratio))
-    
+
     if test_sample_ratio > 1.0:
         test_sample_ratio = 1.0
         logging.warning("Setting the testing sample ratio to 1.0")
@@ -200,35 +190,17 @@ def load_tc_dataset(
     test_labels = label_encoder.transform(test_df[label_col])
     test_df[label_col] = test_labels
 
-    processor = Processor(
-        model_name=model_name,
-        to_lower=to_lower,
-        cache_dir=cache_dir
-    )
+    processor = Processor(model_name=model_name, to_lower=to_lower, cache_dir=cache_dir)
 
-    train_dataloader = processor.create_dataloader_from_df(
-        df=train_df,
-        text_col=text_col,
-        label_col=label_col,
-        max_len=max_len,
-        text2_col=None,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        shuffle=True,
-        distributed=False
+    train_dataset = processor.dataset_from_dataframe(
+        df=train_df, text_col=text_col, label_col=label_col, max_len=max_len,
     )
+    train_dataloader = dataloader_from_dataset(train_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=True)
 
-    test_dataloader = processor.create_dataloader_from_df(
-        df=test_df,
-        text_col=text_col,
-        label_col=label_col,
-        max_len=max_len,
-        text2_col=None,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        shuffle=False,
-        distributed=False
+    test_dataset = processor.dataset_from_dataframe(
+        df=test_df, text_col=text_col, label_col=label_col, max_len=max_len,
     )
+    test_dataloader = dataloader_from_dataset(test_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=False)
 
     return (train_dataloader, test_dataloader, label_encoder, test_labels)
 
diff --git a/utils_nlp/dataset/sentence_selection.py b/utils_nlp/dataset/sentence_selection.py
new file mode 100644
index 000000000..47ca22aeb
--- /dev/null
+++ b/utils_nlp/dataset/sentence_selection.py
@@ -0,0 +1,131 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# This script reuses some code from https://github.com/nlpyang/BertSum
+
+
+import itertools
+import re
+
+
+def _get_ngrams(n, text):
+    """Calcualtes n-grams.
+    Args:
+      n: which n-grams to calculate
+      text: An array of tokens
+    Returns:
+      A set of n-grams
+    """
+    ngram_set = set()
+    text_length = len(text)
+    max_index_ngram_start = text_length - n
+    for i in range(max_index_ngram_start + 1):
+        ngram_set.add(tuple(text[i:i + n]))
+    return ngram_set
+
+
+def _get_word_ngrams(n, sentences):
+    """Calculates word n-grams for multiple sentences.
+    """
+    assert len(sentences) > 0
+    assert n > 0
+
+    # words = _split_into_words(sentences)
+
+    words = sum(sentences, [])
+    # words = [w for w in words if w not in stopwords]
+    return _get_ngrams(n, words)
+
+
+def cal_rouge(evaluated_ngrams, reference_ngrams):
+    reference_count = len(reference_ngrams)
+    evaluated_count = len(evaluated_ngrams)
+
+    overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
+    overlapping_count = len(overlapping_ngrams)
+
+    if evaluated_count == 0:
+        precision = 0.0
+    else:
+        precision = overlapping_count / evaluated_count
+
+    if reference_count == 0:
+        recall = 0.0
+    else:
+        recall = overlapping_count / reference_count
+
+    f1_score = 2.0 * ((precision * recall) / (precision + recall + 1e-8))
+    return {"f": f1_score, "p": precision, "r": recall}
+
+
+def combination_selection(doc_sent_list, abstract_sent_list, summary_size):
+    def _rouge_clean(s):
+        return re.sub(r'[^a-zA-Z0-9 ]', '', s)
+
+    max_rouge = 0.0
+    max_idx = (0, 0)
+    abstract = sum(abstract_sent_list, [])
+    abstract = _rouge_clean(' '.join(abstract)).split()
+    sents = [_rouge_clean(' '.join(s)).split() for s in doc_sent_list]
+    evaluated_1grams = [_get_word_ngrams(1, [sent]) for sent in sents]
+    reference_1grams = _get_word_ngrams(1, [abstract])
+    evaluated_2grams = [_get_word_ngrams(2, [sent]) for sent in sents]
+    reference_2grams = _get_word_ngrams(2, [abstract])
+
+    impossible_sents = []
+    for s in range(summary_size + 1):
+        combinations = itertools.combinations([i for i in range(len(sents)) if i not in impossible_sents], s + 1)
+        for c in combinations:
+            candidates_1 = [evaluated_1grams[idx] for idx in c]
+            candidates_1 = set.union(*map(set, candidates_1))
+            candidates_2 = [evaluated_2grams[idx] for idx in c]
+            candidates_2 = set.union(*map(set, candidates_2))
+            rouge_1 = cal_rouge(candidates_1, reference_1grams)['f']
+            rouge_2 = cal_rouge(candidates_2, reference_2grams)['f']
+
+            rouge_score = rouge_1 + rouge_2
+            if (s == 0 and rouge_score == 0):
+                impossible_sents.append(c[0])
+            if rouge_score > max_rouge:
+                max_idx = c
+                max_rouge = rouge_score
+    return sorted(list(max_idx))
+
+
+def greedy_selection(doc_sent_list, abstract_sent_list, summary_size):
+    def _rouge_clean(s):
+        return re.sub(r'[^a-zA-Z0-9 ]', '', s)
+
+    max_rouge = 0.0
+    abstract = sum(abstract_sent_list, [])
+    abstract = _rouge_clean(' '.join(abstract)).split()
+    sents = [_rouge_clean(' '.join(s)).split() for s in doc_sent_list]
+    evaluated_1grams = [_get_word_ngrams(1, [sent]) for sent in sents]
+    reference_1grams = _get_word_ngrams(1, [abstract])
+    evaluated_2grams = [_get_word_ngrams(2, [sent]) for sent in sents]
+    reference_2grams = _get_word_ngrams(2, [abstract])
+
+    selected = []
+    for s in range(summary_size):
+        cur_max_rouge = max_rouge
+        cur_id = -1
+        for i in range(len(sents)):
+            if (i in selected):
+                continue
+            c = selected + [i]
+            candidates_1 = [evaluated_1grams[idx] for idx in c]
+            candidates_1 = set.union(*map(set, candidates_1))
+            candidates_2 = [evaluated_2grams[idx] for idx in c]
+            candidates_2 = set.union(*map(set, candidates_2))
+            rouge_1 = cal_rouge(candidates_1, reference_1grams)['f']
+            rouge_2 = cal_rouge(candidates_2, reference_2grams)['f']
+            rouge_score = rouge_1 + rouge_2
+            if rouge_score > cur_max_rouge:
+                cur_max_rouge = rouge_score
+                cur_id = i
+        if (cur_id == -1):
+            return selected
+        selected.append(cur_id)
+        max_rouge = cur_max_rouge
+
+    return sorted(selected)
diff --git a/utils_nlp/dataset/wikigold.py b/utils_nlp/dataset/wikigold.py
index 4713451fb..32a0c5420 100644
--- a/utils_nlp/dataset/wikigold.py
+++ b/utils_nlp/dataset/wikigold.py
@@ -7,18 +7,19 @@
     https://github.com/juand-r/entity-recognition-datasets/tree/master/data/wikigold/CONLL-format/data
 """
 
-import random
+import logging
 import os
+import random
+from tempfile import TemporaryDirectory
+
 import pandas as pd
-import logging
 
-from tempfile import TemporaryDirectory
-from utils_nlp.dataset.url_utils import maybe_download
+from utils_nlp.common.pytorch_utils import dataloader_from_dataset
 from utils_nlp.dataset.ner_utils import preprocess_conll
+from utils_nlp.dataset.url_utils import maybe_download
 from utils_nlp.models.transformers.common import MAX_SEQ_LEN
 from utils_nlp.models.transformers.named_entity_recognition import TokenClassificationProcessor
 
-
 URL = (
     "https://raw.githubusercontent.com/juand-r/entity-recognition-datasets"
     "/master/data/wikigold/CONLL-format/data/wikigold.conll.txt"
@@ -91,7 +92,7 @@ def load_dataset(
     max_len=MAX_SEQ_LEN,
     trailing_piece_tag="X",
     batch_size=32,
-    num_gpus=None
+    num_gpus=None,
 ):
     """
     Load the wikigold dataset and split into training and testing datasets.
@@ -116,7 +117,7 @@ def load_dataset(
         cache_dir (str, optional): The default folder for saving cache files.
             Defaults to './temp'.
         max_len (int, optional): Maximum length of the list of tokens. Lists longer
-            than this are truncated and shorter ones are padded with "O"s. 
+            than this are truncated and shorter ones are padded with "O"s.
             Default value is BERT_MAX_LEN=512.
         trailing_piece_tag (str, optional): Tag used to label trailing word pieces.
             For example, "criticize" is broken into "critic" and "##ize", "critic"
@@ -129,16 +130,12 @@ def load_dataset(
 
     Returns:
         tuple. The tuple contains four elements.
-        train_dataload (DataLoader): a PyTorch DataLoader instance for training.
-
-        test_dataload (DataLoader): a PyTorch DataLoader instance for testing.
-        
-        label_map (dict): A dictionary object to map a label (str) to an ID (int). 
-
+        train_dataloader (DataLoader): a PyTorch DataLoader instance for training.
+        test_dataloader (DataLoader): a PyTorch DataLoader instance for testing.
+        label_map (dict): A dictionary object to map a label (str) to an ID (int).
         test_dataset (TensorDataset): A TensorDataset containing the following four tensors.
             1. input_ids_all: Tensor. Each sublist contains numerical values,
-                i.e. token ids, corresponding to the tokens in the input 
-                text data.
+                i.e. token ids, corresponding to the tokens in the input text data.
             2. input_mask_all: Tensor. Each sublist contains the attention
                 mask of the input token id list, 1 for input tokens and 0 for
                 padded tokens, so that padded tokens are not attended to.
@@ -155,9 +152,7 @@ def load_dataset(
     """
 
     train_df, test_df = load_train_test_dfs(
-        local_cache_path=local_path,
-        test_fraction=test_fraction,
-        random_seed=random_seed
+        local_cache_path=local_path, test_fraction=test_fraction, random_seed=random_seed
     )
 
     if train_sample_ratio > 1.0:
@@ -166,7 +161,7 @@ def load_dataset(
     elif train_sample_ratio < 0:
         logging.error("Invalid training sample ration: {}".format(train_sample_ratio))
         raise ValueError("Invalid training sample ration: {}".format(train_sample_ratio))
-    
+
     if test_sample_ratio > 1.0:
         test_sample_ratio = 1.0
         logging.warning("Setting the testing sample ratio to 1.0")
@@ -179,47 +174,34 @@ def load_dataset(
     if test_sample_ratio < 1.0:
         test_df = test_df.sample(frac=test_sample_ratio).reset_index(drop=True)
 
-    processor = TokenClassificationProcessor(
-        model_name=model_name,
-        to_lower=to_lower,
-        cache_dir=cache_dir
-    )
+    processor = TokenClassificationProcessor(model_name=model_name, to_lower=to_lower, cache_dir=cache_dir)
 
     label_map = TokenClassificationProcessor.create_label_map(
-        label_lists=train_df['labels'],
-        trailing_piece_tag=trailing_piece_tag
+        label_lists=train_df["labels"], trailing_piece_tag=trailing_piece_tag
     )
 
     train_dataset = processor.preprocess_for_bert(
-        text=train_df['sentence'],
+        text=train_df["sentence"],
         max_len=max_len,
-        labels=train_df['labels'],
+        labels=train_df["labels"],
         label_map=label_map,
-        trailing_piece_tag=trailing_piece_tag
+        trailing_piece_tag=trailing_piece_tag,
     )
 
     test_dataset = processor.preprocess_for_bert(
-        text=test_df['sentence'],
+        text=test_df["sentence"],
         max_len=max_len,
-        labels=test_df['labels'],
+        labels=test_df["labels"],
         label_map=label_map,
-        trailing_piece_tag=trailing_piece_tag
+        trailing_piece_tag=trailing_piece_tag,
     )
 
-    train_dataloader = processor.create_dataloader_from_dataset(
-        train_dataset,
-        shuffle=True,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        distributed=False
+    train_dataloader = dataloader_from_dataset(
+        train_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=True, distributed=False
     )
 
-    test_dataloader = processor.create_dataloader_from_dataset(
-        test_dataset,
-        shuffle=False,
-        batch_size=batch_size,
-        num_gpus=num_gpus,
-        distributed=False
+    test_dataloader = dataloader_from_dataset(
+        test_dataset, batch_size=batch_size, num_gpus=num_gpus, shuffle=False, distributed=False
     )
 
     return (train_dataloader, test_dataloader, label_map, test_dataset)
diff --git a/utils_nlp/eval/__init__.py b/utils_nlp/eval/__init__.py
index e69de29bb..484b96e59 100644
--- a/utils_nlp/eval/__init__.py
+++ b/utils_nlp/eval/__init__.py
@@ -0,0 +1 @@
+from .rouge.compute_rouge import compute_rouge_perl, compute_rouge_python
diff --git a/utils_nlp/eval/evaluate_summarization.py b/utils_nlp/eval/evaluate_summarization.py
new file mode 100644
index 000000000..421d48f13
--- /dev/null
+++ b/utils_nlp/eval/evaluate_summarization.py
@@ -0,0 +1,40 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+import os
+from random import random, seed
+
+from bertsum.others.utils import test_rouge
+
+
+def get_rouge(predictions, targets, temp_dir, random_seed=42):
+    """
+    function to get the rouge metric for the prediction and the reference.
+
+    Args:
+        predictions (list of strings): Predictions to be compared.
+        target (list of strings): References
+        temp_dir (str): Path where temporary folders are created to host the files
+            generated by ROUGE application.
+        seed (int, optional): Random seed. Defaults to 42.
+
+    Return:
+        dictionary: rouge metric
+
+    """
+
+    def _write_list_to_file(list_items, filename):
+        with open(filename, "w") as filehandle:
+            # for cnt, line in enumerate(filehandle):
+            for item in list_items:
+                filehandle.write("%s\n" % item)
+
+    seed(random_seed)
+    random_number = random()
+    os.makedirs(temp_dir, exist_ok=True)
+    candidate_path = os.path.join(temp_dir, "candidate" + str(random_number))
+    gold_path = os.path.join(temp_dir, "gold" + str(random_number))
+    _write_list_to_file(predictions, candidate_path)
+    _write_list_to_file(targets, gold_path)
+    rouge = test_rouge(temp_dir, candidate_path, gold_path)
+    return rouge
diff --git a/utils_nlp/eval/rouge/compute_rouge.py b/utils_nlp/eval/rouge/compute_rouge.py
new file mode 100644
index 000000000..12830a94b
--- /dev/null
+++ b/utils_nlp/eval/rouge/compute_rouge.py
@@ -0,0 +1,135 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+import os
+import shutil
+import time
+import tempfile
+
+from pyrouge import Rouge155
+from rouge import Rouge
+from .rouge_ext import RougeExt
+
+
+def compute_rouge_perl(cand, ref, is_input_files=False, verbose=False):
+    """
+    Computes ROUGE scores using the python wrapper
+    (https://github.com/bheinzerling/pyrouge) of perl ROUGE package.
+
+    Args:
+        cand (list or str): If `is_input_files` is `False`, `cand` is a list of strings
+            containing predicted summaries. if `is_input_files` is `True`, `cand` is the path
+            to the file containing the predicted summaries.
+        ref (list or str): If `is_input_files` is `False`, `cand` is a list of strings
+            containing reference summaries. if `is_input_files` is `True`, `cand` is the path
+            to the file containing the reference summaries.
+        is_input_files (bool, optional): If True, inputs are file names. Otherwise, inputs are lists
+            of predicted and reference summaries. Defaults to False.
+        verbose (bool, optional): If True, print out all rouge scores. Defaults to False.
+
+    Returns:
+        dict: Dictionary of ROUGE scores.
+
+    """
+
+    temp_dir = tempfile.mkdtemp()
+
+    if is_input_files:
+        candidates = [line.strip() for line in open(cand, encoding="utf-8")]
+        references = [line.strip() for line in open(ref, encoding="utf-8")]
+    else:
+        candidates = cand
+        references = ref
+
+    print("Number of candidates: {}".format(len(candidates)))
+    print("Number of references: {}".format(len(references)))
+    assert len(candidates) == len(references)
+
+    cnt = len(candidates)
+    current_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
+    tmp_dir = os.path.join(temp_dir, "rouge-tmp-{}".format(current_time))
+
+    tmp_dir_candidate = tmp_dir + "/candidate/"
+    tmp_dir_reference = tmp_dir + "/reference/"
+
+    os.makedirs(tmp_dir_candidate, exist_ok=True)
+    os.makedirs(tmp_dir_reference, exist_ok=True)
+
+    try:
+        for i in range(cnt):
+            if len(references[i]) < 1:
+                continue
+            with open(tmp_dir_candidate + "/cand.{}.txt".format(i), "w", encoding="utf-8") as f:
+                f.write(candidates[i])
+            with open(tmp_dir_reference + "/ref.{}.txt".format(i), "w", encoding="utf-8") as f:
+                f.write(references[i])
+        r = Rouge155()
+        r.model_dir = tmp_dir_reference
+        r.system_dir = tmp_dir_candidate
+        r.model_filename_pattern = "ref.#ID#.txt"
+        r.system_filename_pattern = r"cand.(\d+).txt"
+        rouge_results = r.convert_and_evaluate()
+        if verbose:
+            print(rouge_results)
+        results_dict = r.output_to_dict(rouge_results)
+    finally:
+        if os.path.isdir(tmp_dir):
+            shutil.rmtree(tmp_dir)
+    return results_dict
+
+
+def compute_rouge_python(cand, ref, is_input_files=False, language="en"):
+    """
+    Computes ROUGE scores using the python package (https://pypi.org/project/py-rouge/).
+
+    Args:
+        cand (list or str): If `is_input_files` is `False`, `cand` is a list of strings
+            containing predicted summaries. if `is_input_files` is `True`, `cand` is the path
+            to the file containing the predicted summaries.
+        ref (list or str): If `is_input_files` is `False`, `cand` is a list of strings
+            containing reference summaries. if `is_input_files` is `True`, `cand` is the path
+            to the file containing the reference summaries.
+        is_input_files (bool, optional): If True, inputs are file names. Otherwise, inputs are
+            lists of predicted and reference summaries. Defaults to False.
+        language (str, optional): Language of the input text. Supported values are "en" and
+            "hi". Defaults to "en".
+
+    Returns:
+        dict: Dictionary of ROUGE scores.
+
+    """
+    supported_langauges = ["en", "hi"]
+    if language not in supported_langauges:
+        raise Exception(
+            "Language {0} is not supported. Supported languages are: {1}.".format(
+                language, supported_langauges
+            )
+        )
+
+    if is_input_files:
+        candidates = [line.strip() for line in open(cand, encoding="utf-8")]
+        references = [line.strip() for line in open(ref, encoding="utf-8")]
+    else:
+        candidates = cand
+        references = ref
+
+    print("Number of candidates: {}".format(len(candidates)))
+    print("Number of references: {}".format(len(references)))
+    assert len(candidates) == len(references)
+
+    if language == "en":
+        evaluator = Rouge(
+            metrics=["rouge-n", "rouge-l"], max_n=2, limit_length=False, apply_avg=True
+        )
+    else:
+        evaluator = RougeExt(
+            metrics=["rouge-n", "rouge-l"],
+            max_n=2,
+            limit_length=False,
+            apply_avg=True,
+            language=language,
+        )
+
+    scores = evaluator.get_scores(candidates, [[it] for it in references])
+
+    return scores
diff --git a/utils_nlp/eval/rouge/rouge_ext.py b/utils_nlp/eval/rouge/rouge_ext.py
new file mode 100644
index 000000000..0e3fad1a7
--- /dev/null
+++ b/utils_nlp/eval/rouge/rouge_ext.py
@@ -0,0 +1,606 @@
+# This script is adopted from https://github.com/Diego999/py-rouge/blob/master/rouge/rouge.py
+# to compute ROUGE scores for non-English languages.
+
+# Currently, the script supports Hindi.
+# Additional language support can be added by adding language specific
+# 1) sentence splitter (SENTENCE_SPLIT_DICT or the sentence_split_func argument)
+# 2) word tokenizer (WORD_TOKENIZE_DICT or the word_tokenize_func argument)
+# 3) pattern of characters to remove (REMOVE_CHAR_PATTERN_DICT or the remove_char_pattern
+#    argument)
+# 4) stemmer (STEMMER_DICT or the stemming_func argument), this is optional since
+#    stemming is not applicable to all languages
+# 5) word splitter (WORD_SPLIT_DICT or the word_split_func_argument)
+
+# Major changes made to the original rouge.py include:
+# 1) Don't remove non-English or non-numeric characters
+# 2) Removed the ensure_compatibility argument as we don't need to reproduce the results of
+#    the original perl script that only supports English.
+
+
+import re
+import string
+import itertools
+import collections
+
+from indicnlp.tokenize import sentence_tokenize, indic_tokenize
+from ...language_utils.hi.hindi_stemmer import hi_stem
+from rouge import Rouge
+
+
+class RougeExt(Rouge):
+    DEFAULT_METRICS = {"rouge-n"}
+    DEFAULT_N = 1
+    STATS = ["f", "p", "r"]
+    AVAILABLE_METRICS = {"rouge-n", "rouge-l", "rouge-w"}
+    AVAILABLE_LENGTH_LIMIT_TYPES = {"words", "bytes"}
+
+    SENTENCE_SPLIT_DICT = {"hi": sentence_tokenize.sentence_split}
+    WORD_TOKENIZE_DICT = {"hi": indic_tokenize.trivial_tokenize}
+    REMOVE_CHAR_PATTERN_DICT = {
+        "hi": re.compile(r"([" + string.punctuation + r"\u0964\u0965" + r"])")
+    }
+    STEMMER_DICT = {"hi": hi_stem}
+    WORD_SPLIT_DICT = {}
+
+    # REMOVE_CHAR_PATTERN = re.compile('[^A-Za-z0-9]')
+
+    # Hack to not tokenize "cannot" to "can not" and consider them different as in the
+    # official ROUGE script
+    # KEEP_CANNOT_IN_ONE_WORD = re.compile('cannot')
+    # KEEP_CANNOT_IN_ONE_WORD_REVERSED = re.compile('_cannot_')
+
+    # WORDNET_KEY_VALUE = {}
+    # WORDNET_DB_FILEPATH = 'wordnet_key_value.txt'
+    # WORDNET_DB_FILEPATH_SPECIAL_CASE = 'wordnet_key_value_special_cases.txt'
+    # WORDNET_DB_DELIMITER = '|'
+    # STEMMER = None
+
+    def __init__(
+        self,
+        language,
+        metrics=None,
+        max_n=None,
+        limit_length=True,
+        length_limit=665,
+        length_limit_type="bytes",
+        apply_avg=True,
+        apply_best=False,
+        stemming=True,
+        alpha=0.5,
+        weight_factor=1.0,
+        sentence_split_func=None,
+        word_tokenize_func=None,
+        remove_char_pattern=None,
+        stemming_func=None,
+        word_split_func=None,
+    ):
+        """
+        Handle the ROUGE score computation as in the official perl script.
+
+        Note 1: Small differences might happen if the resampling of the perl script is not
+                high enough (as the average depends on this).
+        Note 2: Stemming of the official Porter Stemmer of the ROUGE perl script is slightly
+                different and the Porter one implemented in NLTK. However, special cases of
+                DUC 2004 have been traited.
+                The solution would be to rewrite the whole perl stemming in python from
+                the original script
+
+        Args:
+            language: language of the text to be evaluated, e.g. "hi".
+            metrics: What ROUGE score to compute. Available: ROUGE-N, ROUGE-L, ROUGE-W.
+                Default: ROUGE-N
+            max_n: N-grams for ROUGE-N if specify. Default:1
+            limit_length: If the summaries must be truncated. Defaut:True
+            length_limit: Number of the truncation where the unit is express int length_limit_Type.
+                Default:665 (bytes)
+            length_limit_type: Unit of length_limit. Available: words, bytes. Default: 'bytes'
+            apply_avg: If we should average the score of multiple samples. Default: True. If
+                apply_Avg & apply_best = False, then each ROUGE scores are independant
+            apply_best: Take the best instead of the average. Default: False, then each ROUGE
+                scores are independant
+            stemming: Apply stemming to summaries. Default: True
+            alpha: Alpha use to compute f1 score: P*R/((1-a)*P + a*R). Default:0.5
+            weight_factor: Weight factor to be used for ROUGE-W. Official rouge score defines
+                it at 1.2. Default: 1.0
+            sentence_split_func (function, optional): Language specific function for splitting
+                sentences. Defaults to None.
+            word_tokenize_func (function, optional): Language specific function for tokenizing text.
+                Defaults to None.
+            remove_char_pattern (_sre.SRE_Pattern, optional): Langauge specific regular expression
+                pattern for removing special characters, e.g. punctuations. Defaults to None.
+            stemming_func (function, optional): Language specific stemmer. Defaults to None.
+            word_split_func (function, optional): Language specific word splitter. Only needed if
+            the language words are not separated by space, e.g. Chinese. Defaults to None.
+
+        Raises:
+            ValueError: raises exception if metric is not among AVAILABLE_METRICS
+            ValueError: raises exception if length_limit_type is not among
+                AVAILABLE_LENGTH_LIMIT_TYPES
+            ValueError: raises exception if weight_factor < 0
+        """
+        supported_langauges = ["hi"]
+        if language not in supported_langauges and not all(
+            [sentence_split_func, word_tokenize_func, remove_char_pattern]
+        ):
+            raise Exception(
+                "Language {0} is not supported. Supported languages are: {1}. Provide language "
+                "speicifc sentence_split_func, word_tokenize_func, remove_char_pattern, "
+                "stemming_func(optional), and word_split_func (if words are not separated by "
+                "space) to use this class".format(language, supported_langauges)
+            )
+        self.metrics = metrics[:] if metrics is not None else RougeExt.DEFAULT_METRICS
+        for m in self.metrics:
+            if m not in RougeExt.AVAILABLE_METRICS:
+                raise ValueError("Unknown metric '{}'".format(m))
+
+        self.max_n = max_n if "rouge-n" in self.metrics else None
+        # Add all rouge-n metrics
+        if self.max_n is not None:
+            index_rouge_n = self.metrics.index("rouge-n")
+            del self.metrics[index_rouge_n]
+            self.metrics += ["rouge-{}".format(n) for n in range(1, self.max_n + 1)]
+        self.metrics = set(self.metrics)
+
+        self.limit_length = limit_length
+        if self.limit_length:
+            if length_limit_type not in RougeExt.AVAILABLE_LENGTH_LIMIT_TYPES:
+                raise ValueError("Unknown length_limit_type '{}'".format(length_limit_type))
+
+        self.length_limit = length_limit
+        if self.length_limit == 0:
+            self.limit_length = False
+        self.length_limit_type = length_limit_type
+        self.stemming = stemming
+
+        self.apply_avg = apply_avg
+        self.apply_best = apply_best
+        self.alpha = alpha
+        self.weight_factor = weight_factor
+        if self.weight_factor <= 0:
+            raise ValueError("ROUGE-W weight factor must greater than 0.")
+
+        self.language = language
+        if sentence_split_func is None:
+            self.sentence_split = RougeExt.SENTENCE_SPLIT_DICT[self.language]
+        else:
+            self.sentence_split = sentence_split_func
+        if word_tokenize_func is None:
+            self.word_tokenize = RougeExt.WORD_TOKENIZE_DICT[self.language]
+        else:
+            self.word_tokenize = word_tokenize_func
+        if remove_char_pattern is None:
+            self.remove_char_pattern = RougeExt.REMOVE_CHAR_PATTERN_DICT[self.language]
+        else:
+            self.remove_char_pattern = remove_char_pattern
+        if self.language not in RougeExt.STEMMER_DICT.keys() and stemming_func is None:
+            self.stemmer = None
+            warnings.warn("Language-specific stemmer is not available. Skipping stemming.")
+        elif stemming_func is None:
+            self.stemmer = RougeExt.STEMMER_DICT[self.language]
+        else:
+            self.stemmer = stemming_func
+
+        if self.language not in RougeExt.WORD_SPLIT_DICT.keys() and word_split_func is None:
+            self.word_split = None
+        elif word_split_func is None:
+            self.word_split = RougeExt.WORD_SPLIT_DICT[self.language]
+        else:
+            self.word_split = word_split_func
+
+    def tokenize_text(self, text):
+        """
+        Tokenize text in the specific language
+
+        Args:
+          text: The string text to tokenize
+          language: Language of the text
+
+        Returns:
+          List of tokens of text
+        """
+        return self.word_tokenize(text, self.language)
+
+    def split_into_sentences(self, text):
+        """
+        Split text into sentences, using specified language.
+
+        Args:
+          text: The string text to tokenize
+          language: Language of the text
+
+        Returns:
+          List of tokens of text
+        """
+
+        return self.sentence_split(text, self.language)
+
+    def stem_tokens(self, tokens):
+        """
+        Stem each token of tokens
+
+        Args:
+          tokens: List of tokens to stem
+
+        Returns:
+          List of final stems
+        """
+        for i, token in enumerate(tokens):
+            tokens[i] = self.stemmer(token)
+
+        return tokens
+
+    def _split_into_words(self, sentences):
+        """
+        Splits multiple sentences into words and flattens the result
+
+        Args:
+          sentences: list of string
+
+        Returns:
+          A list of words (split by white space)
+        """
+        # Modified from https://github.com/pltrdy/seq2seq/blob/master/seq2seq/metrics/rouge.py
+        if self.word_split is None:
+            return list(itertools.chain(*[_.split() for _ in sentences]))
+        else:
+            return list(itertools.chain(*[self.word_split(_) for _ in sentences]))
+
+    def _get_word_ngrams_and_length(self, n, sentences):
+        """
+        Calculates word n-grams for multiple sentences.
+
+        Args:
+          n: wich n-grams to calculate
+          sentences: list of string
+
+        Returns:
+          A set of n-grams, their frequency and #n-grams in sentences
+        """
+        # Modified from https://github.com/pltrdy/seq2seq/blob/master/seq2seq/metrics/rouge.py
+        assert len(sentences) > 0
+        assert n > 0
+
+        tokens = self._split_into_words(sentences)
+        return self._get_ngrams(n, tokens), tokens, len(tokens) - (n - 1)
+
+    def _get_unigrams(self, sentences):
+        """
+        Calcualtes uni-grams.
+
+        Args:
+          sentences: list of string
+
+        Returns:
+          A set of n-grams and their freqneucy
+        """
+        assert len(sentences) > 0
+
+        tokens = self._split_into_words(sentences)
+        unigram_set = collections.defaultdict(int)
+        for token in tokens:
+            unigram_set[token] += 1
+        return unigram_set, len(tokens)
+
+    def _compute_ngrams(self, evaluated_sentences, reference_sentences, n):
+        """
+        Computes n-grams overlap of two text collections of sentences.
+        Source: http://research.microsoft.com/en-us/um/people/cyl/download/
+        papers/rouge-working-note-v1.3.1.pdf
+
+        Args:
+          evaluated_sentences: The sentences that have been picked by the
+                               summarizer
+          reference_sentences: The sentences from the referene set
+          n: Size of ngram
+
+        Returns:
+          Number of n-grams for evaluated_sentences, reference_sentences and intersection of both.
+          intersection of both count multiple of occurences in n-grams match several times
+
+        Raises:
+          ValueError: raises exception if a param has len <= 0
+        """
+        # Modified from https://github.com/pltrdy/seq2seq/blob/master/seq2seq/metrics/rouge.py
+        if len(evaluated_sentences) <= 0 or len(reference_sentences) <= 0:
+            raise ValueError("Collections must contain at least 1 sentence.")
+
+        evaluated_ngrams, _, evaluated_count = self._get_word_ngrams_and_length(
+            n, evaluated_sentences
+        )
+        reference_ngrams, _, reference_count = self._get_word_ngrams_and_length(
+            n, reference_sentences
+        )
+
+        # Gets the overlapping ngrams between evaluated and reference
+        overlapping_ngrams = set(evaluated_ngrams.keys()).intersection(set(reference_ngrams.keys()))
+        overlapping_count = 0
+        for ngram in overlapping_ngrams:
+            overlapping_count += min(evaluated_ngrams[ngram], reference_ngrams[ngram])
+
+        return evaluated_count, reference_count, overlapping_count
+
+    def _compute_ngrams_lcs(self, evaluated_sentences, reference_sentences, weight_factor=1.0):
+        """
+        Computes ROUGE-L (summary level) of two text collections of sentences.
+        http://research.microsoft.com/en-us/um/people/cyl/download/papers/
+        rouge-working-note-v1.3.1.pdf
+        Args:
+          evaluated_sentences: The sentences that have been picked by the summarizer
+          reference_sentence: One of the sentences in the reference summaries
+          weight_factor: Weight factor to be used for WLCS (1.0 by default if LCS)
+        Returns:
+          Number of LCS n-grams for evaluated_sentences, reference_sentences and intersection
+              of both.
+          intersection of both count multiple of occurences in n-grams match several times
+        Raises:
+          ValueError: raises exception if a param has len <= 0
+        """
+
+        def _lcs(x, y):
+            m = len(x)
+            n = len(y)
+            vals = collections.defaultdict(int)
+            dirs = collections.defaultdict(int)
+
+            for i in range(1, m + 1):
+                for j in range(1, n + 1):
+                    if x[i - 1] == y[j - 1]:
+                        vals[i, j] = vals[i - 1, j - 1] + 1
+                        dirs[i, j] = "|"
+                    elif vals[i - 1, j] >= vals[i, j - 1]:
+                        vals[i, j] = vals[i - 1, j]
+                        dirs[i, j] = "^"
+                    else:
+                        vals[i, j] = vals[i, j - 1]
+                        dirs[i, j] = "<"
+
+            return vals, dirs
+
+        def _wlcs(x, y, weight_factor):
+            m = len(x)
+            n = len(y)
+            vals = collections.defaultdict(float)
+            dirs = collections.defaultdict(int)
+            lengths = collections.defaultdict(int)
+
+            for i in range(1, m + 1):
+                for j in range(1, n + 1):
+                    if x[i - 1] == y[j - 1]:
+                        length_tmp = lengths[i - 1, j - 1]
+                        vals[i, j] = (
+                            vals[i - 1, j - 1]
+                            + (length_tmp + 1) ** weight_factor
+                            - length_tmp ** weight_factor
+                        )
+                        dirs[i, j] = "|"
+                        lengths[i, j] = length_tmp + 1
+                    elif vals[i - 1, j] >= vals[i, j - 1]:
+                        vals[i, j] = vals[i - 1, j]
+                        dirs[i, j] = "^"
+                        lengths[i, j] = 0
+                    else:
+                        vals[i, j] = vals[i, j - 1]
+                        dirs[i, j] = "<"
+                        lengths[i, j] = 0
+
+            return vals, dirs
+
+        def _mark_lcs(mask, dirs, m, n):
+            while m != 0 and n != 0:
+                if dirs[m, n] == "|":
+                    m -= 1
+                    n -= 1
+                    mask[m] = 1
+                elif dirs[m, n] == "^":
+                    m -= 1
+                elif dirs[m, n] == "<":
+                    n -= 1
+                else:
+                    raise UnboundLocalError("Illegal move")
+
+            return mask
+
+        if len(evaluated_sentences) <= 0 or len(reference_sentences) <= 0:
+            raise ValueError("Collections must contain at least 1 sentence.")
+
+        evaluated_unigrams_dict, evaluated_count = self._get_unigrams(evaluated_sentences)
+        reference_unigrams_dict, reference_count = self._get_unigrams(reference_sentences)
+
+        # Has to use weight factor for WLCS
+        use_WLCS = weight_factor != 1.0
+        if use_WLCS:
+            evaluated_count = evaluated_count ** weight_factor
+            reference_count = 0
+
+        overlapping_count = 0.0
+        for reference_sentence in reference_sentences:
+            reference_sentence_tokens = reference_sentence.split()
+            if use_WLCS:
+                reference_count += len(reference_sentence_tokens) ** weight_factor
+            hit_mask = [0 for _ in range(len(reference_sentence_tokens))]
+
+            for evaluated_sentence in evaluated_sentences:
+                evaluated_sentence_tokens = evaluated_sentence.split()
+
+                if use_WLCS:
+                    _, lcs_dirs = _wlcs(
+                        reference_sentence_tokens, evaluated_sentence_tokens, weight_factor
+                    )
+                else:
+                    _, lcs_dirs = _lcs(reference_sentence_tokens, evaluated_sentence_tokens)
+                _mark_lcs(
+                    hit_mask,
+                    lcs_dirs,
+                    len(reference_sentence_tokens),
+                    len(evaluated_sentence_tokens),
+                )
+
+            overlapping_count_length = 0
+            for ref_token_id, val in enumerate(hit_mask):
+                if val == 1:
+                    token = reference_sentence_tokens[ref_token_id]
+                    if evaluated_unigrams_dict[token] > 0 and reference_unigrams_dict[token] > 0:
+                        evaluated_unigrams_dict[token] -= 1
+                        reference_unigrams_dict[ref_token_id] -= 1
+
+                        if use_WLCS:
+                            overlapping_count_length += 1
+                            if (
+                                ref_token_id + 1 < len(hit_mask) and hit_mask[ref_token_id + 1] == 0
+                            ) or ref_token_id + 1 == len(hit_mask):
+                                overlapping_count += overlapping_count_length ** weight_factor
+                                overlapping_count_length = 0
+                        else:
+                            overlapping_count += 1
+
+        if use_WLCS:
+            reference_count = reference_count ** weight_factor
+
+        return evaluated_count, reference_count, overlapping_count
+
+    def _preprocess_summary_as_a_whole(self, summary):
+        """
+        Preprocessing (truncate text if enable, tokenization, stemming if enable, lowering)
+        of a summary as a whole
+
+        Args:
+          summary: string of the summary
+
+        Returns:
+          Return the preprocessed summary (string)
+        """
+        sentences = self.split_into_sentences(summary)
+
+        # Truncate
+        if self.limit_length:
+            # By words
+            if self.length_limit_type == "words":
+                summary = " ".join(sentences)
+                all_tokens = summary.split()  # Counting as in the perls script
+                summary = " ".join(all_tokens[: self.length_limit])
+
+            # By bytes
+            elif self.length_limit_type == "bytes":
+                summary = ""
+                current_len = 0
+                for sentence in sentences:
+                    sentence = sentence.strip()
+                    sentence_len = len(sentence)
+
+                    if current_len + sentence_len < self.length_limit:
+                        if current_len != 0:
+                            summary += " "
+                        summary += sentence
+                        current_len += sentence_len
+                    else:
+                        if current_len > 0:
+                            summary += " "
+                        summary += sentence[: self.length_limit - current_len]
+                        break
+        else:
+            summary = " ".join(sentences)
+
+        # summary = Rouge.REMOVE_CHAR_PATTERN.sub(' ', summary.lower()).strip()
+        summary = self.remove_char_pattern.sub(" ", summary.lower()).strip()
+
+        # # Preprocess. Hack: because official ROUGE script bring "cannot" as "cannot" and
+        #   "can not" as "can not",
+        # # we have to hack nltk tokenizer to not transform "cannot/can not" to "can not"
+        # if self.ensure_compatibility:
+        #     tokens = self.tokenize_text(Rouge.KEEP_CANNOT_IN_ONE_WORD.sub('_cannot_', summary))
+        # else:
+        #     tokens = self.tokenize_text(Rouge.REMOVE_CHAR_PATTERN.sub(' ', summary))
+
+        # if self.stemming:
+        #     self.stem_tokens(tokens) # stemming in-place
+
+        # if self.ensure_compatibility:
+        #     preprocessed_summary = [Rouge.KEEP_CANNOT_IN_ONE_WORD_REVERSED.sub(
+        #         'cannot', ' '.join(tokens))]
+        # else:
+        #     preprocessed_summary = [' '.join(tokens)]
+
+        # return preprocessed_summary
+
+        tokens = self.tokenize_text(summary)
+        if self.stemming:
+            self.stem_tokens(tokens)  # stemming in-place
+        summary = [" ".join(tokens)]
+
+        return summary
+
+    def _preprocess_summary_per_sentence(self, summary):
+        """
+        Preprocessing (truncate text if enable, tokenization, stemming if enable, lowering)
+        of a summary by sentences
+
+        Args:
+          summary: string of the summary
+
+        Returns:
+          Return the preprocessed summary (string)
+        """
+        sentences = self.split_into_sentences(summary)
+
+        # Truncate
+        if self.limit_length:
+            final_sentences = []
+            current_len = 0
+            # By words
+            if self.length_limit_type == "words":
+                for sentence in sentences:
+                    tokens = sentence.strip().split()
+                    tokens_len = len(tokens)
+                    if current_len + tokens_len < self.length_limit:
+                        sentence = " ".join(tokens)
+                        final_sentences.append(sentence)
+                        current_len += tokens_len
+                    else:
+                        sentence = " ".join(tokens[: self.length_limit - current_len])
+                        final_sentences.append(sentence)
+                        break
+            # By bytes
+            elif self.length_limit_type == "bytes":
+                for sentence in sentences:
+                    sentence = sentence.strip()
+                    sentence_len = len(sentence)
+                    if current_len + sentence_len < self.length_limit:
+                        final_sentences.append(sentence)
+                        current_len += sentence_len
+                    else:
+                        sentence = sentence[: self.length_limit - current_len]
+                        final_sentences.append(sentence)
+                        break
+            sentences = final_sentences
+
+        final_sentences = []
+        for sentence in sentences:
+            # sentence = Rouge.REMOVE_CHAR_PATTERN.sub(' ', sentence.lower()).strip()
+            sentence = self.remove_char_pattern.sub(" ", sentence.lower()).strip()
+
+            #     # Preprocess. Hack: because official ROUGE script bring "cannot" as "cannot"
+            #       and "can not" as "can not",
+            #     # we have to hack nltk tokenizer to not transform "cannot/can not" to "can not"
+            #     if self.ensure_compatibility:
+            #         tokens = self.tokenize_text(Rouge.KEEP_CANNOT_IN_ONE_WORD.sub(
+            #             '_cannot_', sentence))
+            #     else:
+            #         tokens = self.tokenize_text(Rouge.REMOVE_CHAR_PATTERN.sub(' ', sentence))
+
+            #     if self.stemming:
+            #         self.stem_tokens(tokens) # stemming in-place
+
+            #     if self.ensure_compatibility:
+            #         sentence = Rouge.KEEP_CANNOT_IN_ONE_WORD_REVERSED.sub(
+            #             'cannot', ' '.join(tokens)
+            #         )
+            #     else:
+            #         sentence = ' '.join(tokens)
+
+            tokens = self.tokenize_text(sentence)
+            if self.stemming:
+                self.stem_tokens(tokens)  # stemming in-place
+            sentence = " ".join(tokens)
+            final_sentences.append(sentence)
+
+        return final_sentences
diff --git a/utils_nlp/language_utils/hi/hindi_stemmer.py b/utils_nlp/language_utils/hi/hindi_stemmer.py
new file mode 100644
index 000000000..ca5b72419
--- /dev/null
+++ b/utils_nlp/language_utils/hi/hindi_stemmer.py
@@ -0,0 +1,102 @@
+#! /usr/bin/env python3.1
+# Script was downloaded from https://research.variancia.com/hindi_stemmer/
+""" Lightweight Hindi stemmer
+Copyright © 2010 Luís Gomes <luismsgomes@gmail.com>.
+
+Implementation of algorithm described in
+
+    A Lightweight Stemmer for Hindi
+    Ananthakrishnan Ramanathan and Durgesh D Rao
+    http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf
+
+    @conference{ramanathan2003lightweight,
+      title={{A lightweight stemmer for Hindi}},
+      author={Ramanathan, A. and Rao, D.},
+      booktitle={Workshop on Computational Linguistics for South-Asian Languages, EACL},
+      year={2003}
+    }
+
+Ported from HindiStemmer.java, part of of Lucene.
+"""
+
+suffixes = {
+    1: ["ो", "े", "ू", "ु", "ी", "ि", "ा"],
+    2: [
+        "कर",
+        "ाओ",
+        "िए",
+        "ाई",
+        "ाए",
+        "ने",
+        "नी",
+        "ना",
+        "ते",
+        "ीं",
+        "ती",
+        "ता",
+        "ाँ",
+        "ां",
+        "ों",
+        "ें",
+    ],
+    3: [
+        "ाकर",
+        "ाइए",
+        "ाईं",
+        "ाया",
+        "ेगी",
+        "ेगा",
+        "ोगी",
+        "ोगे",
+        "ाने",
+        "ाना",
+        "ाते",
+        "ाती",
+        "ाता",
+        "तीं",
+        "ाओं",
+        "ाएं",
+        "ुओं",
+        "ुएं",
+        "ुआं",
+    ],
+    4: [
+        "ाएगी",
+        "ाएगा",
+        "ाओगी",
+        "ाओगे",
+        "एंगी",
+        "ेंगी",
+        "एंगे",
+        "ेंगे",
+        "ूंगी",
+        "ूंगा",
+        "ातीं",
+        "नाओं",
+        "नाएं",
+        "ताओं",
+        "ताएं",
+        "ियाँ",
+        "ियों",
+        "ियां",
+    ],
+    5: ["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"],
+}
+
+
+def hi_stem(word):
+    for L in 5, 4, 3, 2, 1:
+        if len(word) > L + 1:
+            for suf in suffixes[L]:
+                if word.endswith(suf):
+                    return word[:-L]
+    return word
+
+
+if __name__ == "__main__":
+    import sys
+
+    if len(sys.argv) != 1:
+        sys.exit("{} takes no arguments".format(sys.argv[0]))
+    for line in sys.stdin:
+        print(*[hi_stem(word) for word in line.split()])
diff --git a/utils_nlp/models/bert/sequence_classification.py b/utils_nlp/models/bert/sequence_classification.py
index 03a324604..8fbe416c8 100644
--- a/utils_nlp/models/bert/sequence_classification.py
+++ b/utils_nlp/models/bert/sequence_classification.py
@@ -13,7 +13,7 @@
 from tqdm import tqdm
 
 from utils_nlp.models.bert.common import Language
-from utils_nlp.common.pytorch_utils import get_device, move_to_device
+from utils_nlp.common.pytorch_utils import get_device
 
 from cached_property import cached_property
 
@@ -91,7 +91,7 @@ def fit(
 
         device, num_gpus = get_device(num_gpus)
 
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
         input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
@@ -211,7 +211,7 @@ def predict(
                 (classes, probabilities) if probabilities is True.
         """
         device, num_gpus = get_device(num_gpus)
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         # score
         self.model.eval()
diff --git a/utils_nlp/models/bert/sequence_classification_distributed.py b/utils_nlp/models/bert/sequence_classification_distributed.py
index ee5061158..d448515ba 100644
--- a/utils_nlp/models/bert/sequence_classification_distributed.py
+++ b/utils_nlp/models/bert/sequence_classification_distributed.py
@@ -14,7 +14,7 @@
 from pytorch_pretrained_bert.optimization import BertAdam
 from tqdm import tqdm
 
-from utils_nlp.common.pytorch_utils import get_device, move_to_device
+from utils_nlp.common.pytorch_utils import get_device, move_model_to_device
 from utils_nlp.models.bert.common import Language
 
 try:
@@ -192,7 +192,7 @@ def fit(
 
         device, num_gpus = get_device(num_gpus)
 
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         if bert_optimizer is None:
             bert_optimizer = self.create_optimizer(
@@ -277,7 +277,7 @@ def predict(self, test_loader, num_gpus=None, probabilities=False):
                 a dictionary with classes, target labels, probabilities) if probabilities is True.
         """
         device, num_gpus = get_device(num_gpus)
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         # score
         self.model.eval()
diff --git a/utils_nlp/models/bert/sequence_encoding.py b/utils_nlp/models/bert/sequence_encoding.py
index 088a6310d..520c56a3d 100644
--- a/utils_nlp/models/bert/sequence_encoding.py
+++ b/utils_nlp/models/bert/sequence_encoding.py
@@ -4,19 +4,17 @@
 # This script reuses code from https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples
 # /extract_features.py, with necessary modifications.
 
-from pytorch_pretrained_bert.modeling import BertModel
-
-from utils_nlp.common.pytorch_utils import get_device, move_to_device
 from enum import Enum
+
 import numpy as np
 import pandas as pd
-import os
 import torch
+from cached_property import cached_property
+from pytorch_pretrained_bert.modeling import BertModel
+from torch.utils.data import DataLoader, SequentialSampler, TensorDataset
 
-from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
-
+from utils_nlp.common.pytorch_utils import get_device, move_model_to_device
 from utils_nlp.models.bert.common import Language, Tokenizer
-from cached_property import cached_property
 
 
 class PoolingStrategy(str, Enum):
@@ -43,27 +41,21 @@ def __init__(
         pooling_strategy=PoolingStrategy.MEAN,
     ):
         """Initialize the encoder's underlying model and tokenizer
-        
+
         Args:
             bert_model: BERT model to use for encoding. Defaults to pretrained BertModel.
             tokenizer: Tokenizer to use for preprocessing. Defaults to pretrained BERT tokenizer.
             language: The pretrained model's language. Defaults to Language.ENGLISH.
-            num_gpus: The number of gpus to use. Defaults to None, which forces all available GPUs to be used. 
+            num_gpus: The number of gpus to use. Defaults to None, which forces all available GPUs to be used.
             cache_dir: Location of BERT's cache directory. Defaults to "."
             to_lower: True to lowercase before tokenization. Defaults to False.
             max_len: Maximum number of tokens.
-            layer_index: The layer from which to extract features. 
+            layer_index: The layer from which to extract features.
                          Defaults to the last layer; can also be a list of integers for experimentation.
             pooling_strategy: Pooling strategy to aggregate token embeddings into sentence embedding.
         """
-        self.model = (
-            bert_model.model.bert
-            if bert_model
-            else BertModel.from_pretrained(language, cache_dir=cache_dir)
-        )
-        self.tokenizer = (
-            tokenizer if tokenizer else Tokenizer(language, to_lower=to_lower, cache_dir=cache_dir)
-        )
+        self.model = bert_model.model.bert if bert_model else BertModel.from_pretrained(language, cache_dir=cache_dir)
+        self.tokenizer = tokenizer if tokenizer else Tokenizer(language, to_lower=to_lower, cache_dir=cache_dir)
         self.num_gpus = num_gpus
         self.max_len = max_len
         self.layer_index = layer_index
@@ -98,16 +90,17 @@ def pooling_strategy(self, pooling_strategy):
 
     def get_hidden_states(self, text, batch_size=32):
         """Extract the hidden states from the pretrained model
-        
+
         Args:
             text: List of documents to extract features from.
             batch_size: Batch size, defaults to 32.
-        
+
         Returns:
-            pd.DataFrame with columns text_index (int), token (str), layer_index (int), values (list[float]). 
+            pd.DataFrame with columns:
+                text_index (int), token (str), layer_index (int), values (list[float]).
         """
         device, num_gpus = get_device(self.num_gpus)
-        self.model = move_to_device(self.model, device, self.num_gpus)
+        self.model = move_model_to_device(self.model, device, self.num_gpus)
 
         self.model.eval()
 
@@ -122,9 +115,7 @@ def get_hidden_states(self, text, batch_size=32):
         input_type_ids = torch.arange(input_ids.size(0), dtype=torch.long, device=device)
 
         eval_data = TensorDataset(input_ids, input_mask, input_type_ids)
-        eval_dataloader = DataLoader(
-            eval_data, sampler=SequentialSampler(eval_data), batch_size=batch_size
-        )
+        eval_dataloader = DataLoader(eval_data, sampler=SequentialSampler(eval_data), batch_size=batch_size)
 
         hidden_states = {"text_index": [], "token": [], "layer_index": [], "values": []}
         for (input_ids_tensor, input_mask_tensor, example_indices_tensor) in eval_dataloader:
@@ -142,9 +133,7 @@ def get_hidden_states(self, text, batch_size=32):
                         hidden_states["text_index"].append(example_index.item())
                         hidden_states["token"].append(token)
                         hidden_states["layer_index"].append(layer_index)
-                        hidden_states["values"].append(
-                            [round(x.item(), 6) for x in layer_output[i]]
-                        )
+                        hidden_states["values"].append([round(x.item(), 6) for x in layer_output[i]])
 
             # empty cache
             del [input_ids_tensor, input_mask_tensor, example_indices_tensor]
@@ -158,7 +147,7 @@ def get_hidden_states(self, text, batch_size=32):
 
     def pool(self, df):
         """Pooling to aggregate token-wise embeddings to sentence embeddings
-        
+
         Args:
             df: pd.DataFrame with columns text_index (int), token (str), layer_index (int), values (list[float])
 
@@ -167,31 +156,16 @@ def pool(self, df):
         """
 
         def max_pool(x):
-            values = np.array(
-                [
-                    np.reshape(np.array(x.values[i]), self.embedding_dim)
-                    for i in range(x.values.shape[0])
-                ]
-            )
+            values = np.array([np.reshape(np.array(x.values[i]), self.embedding_dim) for i in range(x.values.shape[0])])
             m, _ = torch.max(torch.tensor(values, dtype=torch.float), 0)
             return m.numpy()
 
         def mean_pool(x):
-            values = np.array(
-                [
-                    np.reshape(np.array(x.values[i]), self.embedding_dim)
-                    for i in range(x.values.shape[0])
-                ]
-            )
+            values = np.array([np.reshape(np.array(x.values[i]), self.embedding_dim) for i in range(x.values.shape[0])])
             return torch.mean(torch.tensor(values, dtype=torch.float), 0).numpy()
 
         def cls_pool(x):
-            values = np.array(
-                [
-                    np.reshape(np.array(x.values[i]), self.embedding_dim)
-                    for i in range(x.values.shape[0])
-                ]
-            )
+            values = np.array([np.reshape(np.array(x.values[i]), self.embedding_dim) for i in range(x.values.shape[0])])
             return values[0]
 
         try:
@@ -206,15 +180,11 @@ def cls_pool(x):
         except ValueError as ve:
             print(ve)
 
-        return (
-            df.groupby(["text_index", "layer_index"])["values"]
-            .apply(lambda x: pool_func(x))
-            .reset_index()
-        )
+        return df.groupby(["text_index", "layer_index"])["values"].apply(lambda x: pool_func(x)).reset_index()
 
     def encode(self, text, batch_size=32, as_numpy=False):
-        """Computes sentence encodings 
-        
+        """Computes sentence encodings
+
         Args:
             text: List of documents to encode.
             batch_size: Batch size, defaults to 32.
diff --git a/utils_nlp/models/bert/token_classification.py b/utils_nlp/models/bert/token_classification.py
index 3965c41c1..816cb0216 100644
--- a/utils_nlp/models/bert/token_classification.py
+++ b/utils_nlp/models/bert/token_classification.py
@@ -16,7 +16,7 @@
 from tqdm import tqdm, trange
 
 from utils_nlp.models.bert.common import Language, create_data_loader
-from utils_nlp.common.pytorch_utils import get_device, move_to_device
+from utils_nlp.common.pytorch_utils import get_device, move_model_to_device
 
 from cached_property import cached_property
 
@@ -144,7 +144,7 @@ def fit(
 
         device, num_gpus = get_device(num_gpus)
 
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         if num_gpus is None:
             num_gpus_used = torch.cuda.device_count()
@@ -228,7 +228,7 @@ def predict(
         )
         device, num_gpus = get_device(num_gpus)
 
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         self.model.eval()
         eval_loss = 0
diff --git a/utils_nlp/models/transformers/common.py b/utils_nlp/models/transformers/common.py
old mode 100644
new mode 100755
index 42aedebfb..7fce22c6b
--- a/utils_nlp/models/transformers/common.py
+++ b/utils_nlp/models/transformers/common.py
@@ -7,13 +7,13 @@
 import logging
 import os
 import random
+import time
+from itertools import cycle
 
 import numpy as np
 import torch
-from tqdm import tqdm, trange
-from transformers import AdamW
-from transformers import get_linear_schedule_with_warmup
-
+from tqdm import tqdm
+from transformers import AdamW, get_linear_schedule_with_warmup
 from transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
 from transformers.modeling_distilbert import DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
 from transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
@@ -22,7 +22,8 @@
 from transformers.tokenization_distilbert import DistilBertTokenizer
 from transformers.tokenization_roberta import RobertaTokenizer
 from transformers.tokenization_xlnet import XLNetTokenizer
-from utils_nlp.common.pytorch_utils import get_device
+
+from utils_nlp.common.pytorch_utils import get_device, move_model_to_device
 
 TOKENIZER_CLASS = {}
 TOKENIZER_CLASS.update({k: BertTokenizer for k in BERT_PRETRAINED_MODEL_ARCHIVE_MAP})
@@ -37,12 +38,7 @@
 
 class Transformer:
     def __init__(
-        self,
-        model_class,
-        model_name="bert-base-cased",
-        num_labels=2,
-        cache_dir=".",
-        load_model_from_dir=None,
+        self, model_class, model_name="bert-base-cased", num_labels=2, cache_dir=".", load_model_from_dir=None,
     ):
 
         if model_name not in self.list_supported_models():
@@ -81,68 +77,55 @@ def set_seed(seed, cuda=True):
         if cuda and torch.cuda.is_available():
             torch.cuda.manual_seed_all(seed)
 
+    @staticmethod
+    def get_default_optimizer(model, weight_decay, learning_rate, adam_epsilon):
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+                "weight_decay": weight_decay,
+            },
+            {
+                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+        optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
+        return optimizer
+
+    @staticmethod
+    def get_default_scheduler(optimizer, warmup_steps, num_training_steps):
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, num_warmup_steps=warmup_steps, num_training_steps=num_training_steps
+        )
+        return scheduler
+
     def fine_tune(
         self,
         train_dataloader,
         get_inputs,
+        num_gpus=None,
+        gpu_ids=None,
         max_steps=-1,
-        num_train_epochs=1,
         max_grad_norm=1.0,
         gradient_accumulation_steps=1,
-        n_gpu=1,
         optimizer=None,
         scheduler=None,
-        weight_decay=0.0,
-        learning_rate=5e-5,
-        adam_epsilon=1e-8,
-        warmup_steps=0,
         fp16=False,
         fp16_opt_level="O1",
         local_rank=-1,
         verbose=True,
         seed=None,
+        report_every=10,
+        clip_grad_norm=True,
     ):
 
-        device, num_gpus = get_device(num_gpus=n_gpu, local_rank=-1)
+        # get device
+        device, num_gpus = get_device(num_gpus=num_gpus, local_rank=local_rank)
 
         if seed is not None:
             Transformer.set_seed(seed, num_gpus > 0)
 
-        if max_steps > 0:
-            t_total = max_steps
-            num_train_epochs = (
-                max_steps // (len(train_dataloader) // gradient_accumulation_steps) + 1
-            )
-        else:
-            t_total = len(train_dataloader) // gradient_accumulation_steps * num_train_epochs
-
-        if optimizer is None:
-            no_decay = ["bias", "LayerNorm.weight"]
-            optimizer_grouped_parameters = [
-                {
-                    "params": [
-                        p
-                        for n, p in self.model.named_parameters()
-                        if not any(nd in n for nd in no_decay)
-                    ],
-                    "weight_decay": weight_decay,
-                },
-                {
-                    "params": [
-                        p
-                        for n, p in self.model.named_parameters()
-                        if any(nd in n for nd in no_decay)
-                    ],
-                    "weight_decay": 0.0,
-                },
-            ]
-            optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
-
-        if scheduler is None:
-            scheduler = get_linear_schedule_with_warmup(
-                optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
-            )
-
         if fp16:
             try:
                 from apex import amp
@@ -150,37 +133,22 @@ def fine_tune(
                 raise ImportError("Please install apex from https://www.github.com/nvidia/apex")
             self.model, optimizer = amp.initialize(self.model, optimizer, opt_level=fp16_opt_level)
 
-        if local_rank != -1:
-            self.model = torch.nn.parallel.DistributedDataParallel(
-                self.model,
-                device_ids=[local_rank],
-                output_device=local_rank,
-                find_unused_parameters=True,
-            )
-        else:
-            if isinstance(self.model, torch.nn.DataParallel):
-                self.model = self.model.module
-
-            if num_gpus > 1:
-                self.model = torch.nn.DataParallel(self.model, device_ids=list(range(num_gpus)))
-
-        self.model.to(device)
-        self.model.train()
+        # move model
+        self.model = move_model_to_device(self.model, device, num_gpus, gpu_ids, local_rank)
 
+        # init training
         global_step = 0
         tr_loss = 0.0
+        accum_loss = 0
+        self.model.train()
         self.model.zero_grad()
-        train_iterator = trange(
-            int(num_train_epochs), desc="Epoch", disable=local_rank not in [-1, 0] or not verbose
-        )
 
-        for _ in train_iterator:
-            epoch_iterator = tqdm(
-                train_dataloader, desc="Iteration", disable=local_rank not in [-1, 0] or not verbose
-            )
+        # train
+        start = time.time()
+        while global_step < max_steps:
+            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=local_rank not in [-1, 0] or not verbose)
             for step, batch in enumerate(epoch_iterator):
-                batch = tuple(t.to(device) for t in batch)
-                inputs = get_inputs(batch, self.model_name)
+                inputs = get_inputs(batch, device, self.model_name)
                 outputs = self.model(**inputs)
                 loss = outputs[0]
 
@@ -189,52 +157,57 @@ def fine_tune(
                 if gradient_accumulation_steps > 1:
                     loss = loss / gradient_accumulation_steps
 
-                if step % 10 == 0 and verbose:
-                    tqdm.write("Loss:{:.6f}".format(loss))
-
                 if fp16:
                     with amp.scale_loss(loss, optimizer) as scaled_loss:
                         scaled_loss.backward()
-                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_grad_norm)
                 else:
                     loss.backward()
-                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_grad_norm)
 
                 tr_loss += loss.item()
+                accum_loss += loss.item()
+
                 if (step + 1) % gradient_accumulation_steps == 0:
+                    global_step += 1
+
+                    if clip_grad_norm:
+                        if fp16:
+                            torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_grad_norm)
+                        else:
+                            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_grad_norm)
+
+                    if global_step % report_every == 0 and verbose:
+                        end = time.time()
+                        print(
+                            "loss:{0:.6f}, time:{1:f}, examples:{2:.0f}, step:{3:.0f}/{4:.0f}".format(
+                                accum_loss / report_every, end - start, len(batch), global_step, max_steps,
+                            )
+                        )
+                        accum_loss = 0
+                        start = end
+
                     optimizer.step()
-                    scheduler.step()
+                    if scheduler:
+                        scheduler.step()
                     self.model.zero_grad()
-                    global_step += 1
 
-                if max_steps > 0 and global_step > max_steps:
+                if global_step > max_steps:
                     epoch_iterator.close()
                     break
-            if max_steps > 0 and global_step > max_steps:
-                train_iterator.close()
-                break
 
-            # empty cache
-            del [batch]
-            torch.cuda.empty_cache()
         return global_step, tr_loss / global_step
 
-    def predict(self, eval_dataloader, get_inputs, n_gpu=1, verbose=True):
-        device, num_gpus = get_device(num_gpus=n_gpu, local_rank=-1)
-
-        if isinstance(self.model, torch.nn.DataParallel):
-            self.model = self.model.module
+    def predict(self, eval_dataloader, get_inputs, num_gpus, gpu_ids, verbose=True):
+        # get device
+        device, num_gpus = get_device(num_gpus=num_gpus, local_rank=-1)
 
-        if num_gpus > 1:
-            self.model = torch.nn.DataParallel(self.model, device_ids=list(range(num_gpus)))
+        # move model
+        self.model = move_model_to_device(self.model, device, num_gpus, gpu_ids, local_rank=-1)
 
-        self.model.to(device)
+        # predict
         self.model.eval()
-
-        for batch in tqdm(eval_dataloader, desc="Evaluating", disable=not verbose):
-            batch = tuple(t.to(device) for t in batch)
+        for batch in tqdm(eval_dataloader, desc="Scoring", disable=not verbose):
             with torch.no_grad():
-                inputs = get_inputs(batch, self.model_name, train_mode=False)
+                inputs = get_inputs(batch, device, self.model_name, train_mode=False)
                 outputs = self.model(**inputs)
                 logits = outputs[0]
             yield logits.detach().cpu().numpy()
diff --git a/utils_nlp/models/transformers/datasets.py b/utils_nlp/models/transformers/datasets.py
index 3c900d33b..72ab042d1 100644
--- a/utils_nlp/models/transformers/datasets.py
+++ b/utils_nlp/models/transformers/datasets.py
@@ -2,8 +2,9 @@
 # Licensed under the MIT License.
 
 import collections
+import itertools
 import torch
-from torch.utils.data import Dataset
+from torch.utils.data import Dataset, IterableDataset
 
 
 class SCDataSet(Dataset):
@@ -41,7 +42,6 @@ def __getitem__(self, idx):
                     torch.tensor(input_ids, dtype=torch.long),
                     torch.tensor(attention_mask, dtype=torch.long),
                     torch.tensor(token_type_ids, dtype=torch.long),
-
                 ]
             )
         labels = self.df.iloc[idx, self.label_col]
@@ -92,7 +92,9 @@ def __init__(self, df, text1_col, text2_col, label_col, transform, **transform_a
 
     def __getitem__(self, idx):
         input_ids, attention_mask, token_type_ids = self.transform(
-            self.df.iloc[idx, self.text1_col], self.df.iloc[idx, self.text2_col], **self.transform_args
+            self.df.iloc[idx, self.text1_col],
+            self.df.iloc[idx, self.text2_col],
+            **self.transform_args,
         )
 
         if self.label_col is None:
@@ -111,7 +113,6 @@ def __getitem__(self, idx):
                 torch.tensor(attention_mask, dtype=torch.long),
                 torch.tensor(token_type_ids, dtype=torch.long),
                 torch.tensor(labels, dtype=torch.long),
-
             ]
         )
 
@@ -214,3 +215,90 @@ def __getitem__(self, idx):
 
     def __len__(self):
         return self.df.shape[0]
+
+
+def _line_iter(file_path):
+    with open(file_path, "r", encoding="utf8") as fd:
+        for line in fd:
+            yield line
+
+
+def _preprocess(param):
+    """
+    Helper function to preprocess a list of paragraphs.
+
+    Args:
+        param (Tuple): params are tuple of (a list of strings,
+            a list of preprocessing functions, and function to tokenize
+            setences into words). A paragraph is represented with a
+            single string with multiple setnences.
+
+    Returns:
+        list of list of strings, where each string is a token or word.
+    """
+
+    sentences, preprocess_pipeline, word_tokenize = param
+    for function in preprocess_pipeline:
+        sentences = function(sentences)
+    return [word_tokenize(sentence) for sentence in sentences]
+
+
+def _create_data_from_iterator(iterator, preprocessing, word_tokenizer):
+    for line in iterator:
+        yield _preprocess((line, preprocessing, word_tokenizer))
+
+
+class SummarizationDataset(IterableDataset):
+    def __init__(
+        self,
+        source_file,
+        target_file,
+        source_preprocessing,
+        target_preprocessing,
+        word_tokenization,
+        top_n=-1,
+        **kwargs,
+    ):
+        """
+        Create a summarization dataset instance given the
+        paths of the source file and the target file
+
+        Args:
+            source_file (str): Full path of the file which contains a list of
+                the paragraphs with line break as seperator.
+            target_file (str): Full path of the file which contains a list of
+                the summaries for the paragraphs in the source file with line break as seperator.
+            source_preprocessing (list of functions): A list of preprocessing functions
+                to process the paragraphs in the source file.
+            target_preprocessing (list of functions): A list of preprocessing functions to
+                process the paragraphs in the source file.
+            word_tokenization (function): Tokenization function for tokenize the paragraphs
+                and summaries. The tokenization method is used for sentence selection
+                in :meth:`utils_nlp.models.transformers.extractive_summarization.ExtSumProcessor.preprocess`
+            top_n (int, optional): The number which specifies how many examples in the
+                beginning of the paragraph and summary lists that will be processed by
+                this function. Defaults to -1, which means the whole lists of paragraphs
+                and summaries should be procsssed.
+        """
+
+        source_iter = _line_iter(source_file)
+        target_iter = _line_iter(target_file)
+
+        if top_n != -1:
+            source_iter = itertools.islice(source_iter, top_n)
+            target_iter = itertools.islice(target_iter, top_n)
+
+        self._source = _create_data_from_iterator(
+            source_iter, source_preprocessing, word_tokenization
+        )
+
+        self._target = _create_data_from_iterator(
+            target_iter, target_preprocessing, word_tokenization
+        )
+
+    def __iter__(self):
+        for x in self._source:
+            yield x
+
+    def get_target(self):
+        return self._target
diff --git a/utils_nlp/models/transformers/extractive_summarization.py b/utils_nlp/models/transformers/extractive_summarization.py
new file mode 100644
index 000000000..1defdad5c
--- /dev/null
+++ b/utils_nlp/models/transformers/extractive_summarization.py
@@ -0,0 +1,733 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+# This script reuses some code from https://github.com/nlpyang/BertSum
+
+import itertools
+import logging
+import os
+import random
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, Dataset, IterableDataset, SequentialSampler
+
+# from torch.utils.data.distributed import DistributedSampler
+from transformers import BertModel, DistilBertModel
+
+from bertsum.models import data_loader, model_builder
+from bertsum.models.data_loader import Batch
+from bertsum.models.model_builder import Summarizer
+from utils_nlp.common.pytorch_utils import compute_training_steps, get_device
+from utils_nlp.dataset.sentence_selection import combination_selection, greedy_selection
+from utils_nlp.models.transformers.common import TOKENIZER_CLASS, Transformer
+
+MODEL_CLASS = {"bert-base-uncased": BertModel, "distilbert-base-uncased": DistilBertModel}
+
+logger = logging.getLogger(__name__)
+
+
+class Bunch(object):
+    """ Class which convert a dictionary to an object """
+
+    def __init__(self, adict):
+        self.__dict__.update(adict)
+
+
+def get_dataloader(data_iter, shuffle=True, is_labeled=False, batch_size=3000):
+    """
+    Function to get data iterator over a list of data objects.
+
+    Args:
+        data_iter (generator): data generator.
+        shuffle (bool): whether the data is shuffled.
+        is_labeled (bool): specifies whether the data objects are labeled data.
+        batch_size (int): number of tokens per batch.
+
+    Returns:
+        DataIterator
+    """
+
+    return data_loader.Dataloader(data_iter, batch_size, shuffle=shuffle, is_labeled=is_labeled)
+
+
+def get_dataset(file):
+    yield torch.load(file)
+
+
+class ExtSumProcessedIterableDataset(IterableDataset):
+    """Iterable dataset for extractive summarization preprocessed data
+    """
+
+    def __init__(self, file_list, is_shuffle=False):
+        """ Initiation function for iterable dataset for extractive summarization preprocessed data.
+
+        Args:
+            file_list (list of strings): List of files that the dataset is loaded from.
+            is_shuffle (bool, optional): A boolean value specifies whether the list of
+                files is shuffled when the dataset is loaded. Defaults to False.
+        """
+
+        self.file_list = file_list
+        self.is_shuffle = is_shuffle
+
+    def get_stream(self):
+        """ get a stream of cycled data from the dataset"""
+
+        if self.is_shuffle:
+            return itertools.chain.from_iterable(map(get_dataset, itertools.cycle(self.file_list)))
+        else:
+            return itertools.chain.from_iterable(map(get_dataset, itertools.cycle(random.shuffle(self.file_list))))
+
+    def __iter__(self):
+        return self.get_stream()
+
+
+class ExtSumProcessedDataset(Dataset):
+    """Dataset for extractive summarization preprocessed data
+    """
+
+    def __init__(self, file_list, is_shuffle=False):
+        """ Initiation function for dataset for extractive summarization preprocessed data.
+
+        Args:
+            file_list (list of strings): List of files that the dataset is loaded from.
+            is_shuffle (bool, optional): A boolean value specifies whether the list of
+                files is shuffled when the dataset is loaded. Defaults to False.
+        """
+
+        self.file_list = file_list
+        if is_shuffle:
+            random.shuffle(file_list)
+        self.data = []
+        for file in file_list:
+            self.data.extend(torch.load(file))
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+
+def get_pred(example, sent_scores, cal_lead=False, sentence_separator="<q>", block_trigram=True, top_n=3):
+    """
+        Get the summarization prediction for the paragraph example based on the scores
+        returned by the transformer summarization model.
+
+        Args:
+            example (str): The object with "src_txt" field as the paragraph which requries
+                summarization. The "src_txt" is a list of strings.
+            sent_scores (list of floats): List of scores of how likely of the sentence is
+                included in the summary.
+            cal_lead (bool, optional): Boolean value which specifies whether the prediction uses
+                the first few sentences as summary. Defaults to False
+            sentence_separator (str, optional): Seperator used in the generated summary.
+                Defaults to '<q>'.
+            block_trigram (bool, optional): Boolean value which specifies whether the
+                summary should include any sentence that has the same trigram as the
+                already selected sentences. Defaults to True.
+            top_n (int, optional): The maximum number of sentences that the summary
+                should included. Defaults to 3.
+
+        Returns:
+            A string which is the summary for the example.
+    """
+
+    def _get_ngrams(n, text):
+        ngram_set = set()
+        text_length = len(text)
+        max_index_ngram_start = text_length - n
+        for i in range(max_index_ngram_start + 1):
+            ngram_set.add(tuple(text[i : i + n]))
+        return ngram_set
+
+    def _block_tri(c, p):
+        tri_c = _get_ngrams(3, c.split())
+        for s in p:
+            tri_s = _get_ngrams(3, s.split())
+            if len(tri_c.intersection(tri_s)) > 0:
+                return True
+        return False
+
+    selected_ids = np.argsort(-sent_scores)
+    # selected_ids = np.argsort(-sent_scores, 1)
+    if cal_lead:
+        selected_ids = range(len(example["clss"]))
+    pred = []
+    # target = []
+    # for i, idx in enumerate(selected_ids):
+    _pred = []
+    if len(example["src_txt"]) == 0:
+        pred.append("")
+    for j in selected_ids[: len(example["src_txt"])]:
+        if j >= len(example["src_txt"]):
+            continue
+        candidate = example["src_txt"][j].strip()
+        if block_trigram:
+            if not _block_tri(candidate, _pred):
+                _pred.append(candidate)
+        else:
+            _pred.append(candidate)
+
+        # only select the top n
+        if len(_pred) == top_n:
+            break
+
+    # _pred = '<q>'.join(_pred)
+    _pred = sentence_separator.join(_pred)
+    pred.append(_pred.strip())
+    # target.append(example['tgt_txt'])
+    return pred  # , target
+
+
+class ExtSumProcessedData:
+    """class loaded data preprocessed as in
+    :class:`utils_nlp.models.transformers.datasets.SummarizationDataset`"""
+
+    @staticmethod
+    def save_data(data_iter, is_test=False, save_path="./", chunk_size=None):
+        """ Save the preprocessed data into files with specified chunk size
+
+        Args:
+            data_iter (iterator): Data iterator returned from
+                :class:`utils_nlp.models.transformers.datasets.SummarizationDataset`
+            is_test (bool): Boolean value which indicates whether target data is included.
+                If it is set True, the file name contains "test", otherwise,
+                the file name contains "train". Defaults to False.
+            save_path (str): Directory where the data should be saved. Defaults to "./".
+            chunk_size (int): The number of examples that should be included in each file.
+                Defaults to None, which means only one file is used.
+
+        Returns:
+            a list of strings which are the files the data is saved to.
+        """
+        os.makedirs(save_path, exist_ok=True)
+
+        def _chunks(iterable, chunk_size):
+            iterator = filter(None, iterable)
+            for first in iterator:
+                if chunk_size:
+                    yield itertools.chain([first], itertools.islice(iterator, chunk_size - 1))
+                else:
+                    yield itertools.chain([first], itertools.islice(iterator, None))
+
+        chunks = _chunks(data_iter, chunk_size)
+        filename_list = []
+        for i, chunked_data in enumerate(chunks):
+            filename = f"{i}_test" if is_test else f"{i}_train"
+            torch.save(list(chunked_data), os.path.join(save_path, filename))
+            filename_list.append(os.path.join(save_path, filename))
+        return filename_list
+
+    def _get_files(self, root):
+        train_files = []
+        test_files = []
+        files = [os.path.join(root, f) for f in os.listdir(root) if os.path.isfile(os.path.join(root, f))]
+        for fname in files:
+            if fname.find("train") != -1:
+                train_files.append(fname)
+            elif fname.find("test") != -1:
+                test_files.append(fname)
+
+        return train_files, test_files
+
+    def splits(self, root):
+        """Get the train and test dataset from the folder
+
+        Args:
+            root (str): Directory where the data can be loaded.
+
+        Returns:
+            Tuple of ExtSumProcessedIterableDataset as train dataset
+            and ExtSumProcessedDataset as test dataset.
+        """
+        train_files, test_files = self._get_files(root)
+        return (
+            ExtSumProcessedIterableDataset(train_files, is_shuffle=True),
+            ExtSumProcessedDataset(test_files, is_shuffle=False),
+        )
+
+
+class ExtSumProcessor:
+    """Class for preprocessing extractive summarization data."""
+
+    def __init__(
+        self,
+        model_name="distilbert-base-uncased",
+        to_lower=False,
+        cache_dir=".",
+        max_nsents=200,
+        max_src_ntokens=2000,
+        min_nsents=3,
+        min_src_ntokens=5,
+    ):
+        """ Initialize the preprocessor.
+
+        Args:
+            model_name (str, optional): Transformer model name used in preprocessing.
+                check MODEL_CLASS for supported models. Defaults to "bert-base-cased".
+            to_lower (bool, optional): Whether to convert all letters to lower case during
+                tokenization. This is determined by if a cased model is used.
+                Defaults to False, which corresponds to a cased model.
+            cache_dir (str, optional): Directory to cache the tokenizer. Defaults to ".".
+            max_nsents (int, optional): Max number of sentences that can be used
+                as input. Defaults to 200.
+            max_src_ntokens (int, optional): Max number of tokens that be used
+                as input. Defaults to 2000.
+            min_nsents (int, optional): Minimum number of sentences that are required
+                as input. If the input has less number of sentences than this value,
+                it's skipped and cannot be used as a valid input. Defaults to 3.
+            min_src_ntokens (int, optional): Minimum number of tokens that are required
+                as an input sentence.If the input sentence has less number of tokens than
+                this value, it's skipped and cannot be used as a valid sentence. Defaults to 5.
+
+        """
+        self.model_name = model_name
+        self.tokenizer = TOKENIZER_CLASS[self.model_name].from_pretrained(
+            self.model_name, do_lower_case=to_lower, cache_dir=cache_dir
+        )
+        self.sep_vid = self.tokenizer.vocab["[SEP]"]
+        self.cls_vid = self.tokenizer.vocab["[CLS]"]
+        self.pad_vid = self.tokenizer.vocab["[PAD]"]
+
+        self.max_nsents = max_nsents
+        self.max_src_ntokens = max_src_ntokens
+        self.min_nsents = min_nsents
+        self.min_src_ntokens = min_src_ntokens
+
+    @staticmethod
+    def list_supported_models():
+        return list(TOKENIZER_CLASS.keys())
+
+    @property
+    def model_name(self):
+        return self._model_name
+
+    @model_name.setter
+    def model_name(self, value):
+        if value not in self.list_supported_models():
+            raise ValueError(
+                "Model name {} is not supported by ExtSumProcessor. "
+                "Call 'ExtSumProcessor.list_supported_models()' to get all supported model "
+                "names.".format(value)
+            )
+
+        self._model_name = value
+
+    @staticmethod
+    def get_inputs(batch, device, model_name, train_mode=True):
+        """
+        Creates an input dictionary given a model name.
+
+        Args:
+            batch (object): A Batch containing input ids, segment ids, sentence class ids,
+                masks for the input ids, masks for  sentence class ids and source text.
+                If train_model is True, it also contains the labels and target text.
+            device (torch.device): A PyTorch device.
+            model_name (bool, optional): Model name used to format the inputs.
+            train_mode (bool, optional): Training mode flag.
+                Defaults to True.
+
+        Returns:
+            dict: Dictionary containing input ids, segment ids, sentence class ids,
+            masks for the input ids, masks for the sentence class ids and labels.
+            Labels are only returned when train_mode is True.
+        """
+
+        if model_name.split("-")[0] in ["bert", "distilbert"]:
+            if train_mode:
+                batch = batch.to(device)
+                # labels must be the last
+                return {
+                    "x": batch.src,
+                    "segs": batch.segs,
+                    "clss": batch.clss,
+                    "mask": batch.mask,
+                    "mask_cls": batch.mask_cls,
+                    "labels": batch.labels,
+                }
+            else:
+                batch = Bunch(batch)
+                return {
+                    "x": batch.src.to(device),
+                    "segs": batch.segs.to(device),
+                    "clss": batch.clss.to(device),
+                    "mask": batch.mask.to(device),
+                    "mask_cls": batch.mask_cls.to(device),
+                }
+        else:
+            raise ValueError("Model not supported: {}".format(model_name))
+
+    def preprocess(self, sources, targets=None, oracle_mode="greedy", selections=3):
+        """preprocess multiple data points
+
+           Args:
+              sources (list of list of strings): List of word tokenized sentences.
+              targets (list of list of strings, optional): List of word tokenized sentences.
+                  Defaults to None, which means it doesn't include summary and is
+                  not training data.
+              oracle_mode (str, optional): Sentence selection method. Defaults to "greedy".
+              selections (int, optional): The number of sentence used as summary. Defaults to 3.
+
+            Returns:
+                Iterator of dictory objects containing input ids, segment ids, sentence class ids,
+                labels, source text and target text. If targets is None, the label and target text
+                are None.
+        """
+
+        if targets is None:
+            for source in sources:
+                yield self._preprocess_single(source, None, oracle_mode, selections)
+        else:
+            for (source, target) in zip(sources, targets):
+                yield self._preprocess_single(source, target, oracle_mode, selections)
+
+    def _preprocess_single(self, source, target=None, oracle_mode="greedy", selections=3):
+        """preprocess single data point"""
+
+        oracle_ids = None
+        if target is not None:
+            if oracle_mode == "greedy":
+                oracle_ids = greedy_selection(source, target, selections)
+            elif oracle_mode == "combination":
+                oracle_ids = combination_selection(source, target, selections)
+
+        def _preprocess(src, tgt=None, oracle_ids=None):
+
+            if len(src) == 0:
+                return None
+
+            original_src_txt = [" ".join(s) for s in src]
+
+            labels = None
+            if oracle_ids is not None and tgt is not None:
+                labels = [0] * len(src)
+                for l in oracle_ids:
+                    labels[l] = 1
+
+            idxs = [i for i, s in enumerate(src) if (len(s) > self.min_src_ntokens)]
+
+            src = [src[i][: self.max_src_ntokens] for i in idxs]
+            src = src[: self.max_nsents]
+            if labels:
+                labels = [labels[i] for i in idxs]
+                labels = labels[: self.max_nsents]
+
+            if len(src) < self.min_nsents:
+                return None
+            if labels:
+                if len(labels) == 0:
+                    return None
+
+            src_txt = [" ".join(sent) for sent in src]
+            # text = [' '.join(ex['src_txt'][i].split()[:self.args.max_src_ntokens]) for i in idxs]
+            # text = [_clean(t) for t in text]
+            text = " [SEP] [CLS] ".join(src_txt)
+            src_subtokens = self.tokenizer.tokenize(text)
+            src_subtokens = src_subtokens[:510]
+            src_subtokens = ["[CLS]"] + src_subtokens + ["[SEP]"]
+
+            src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
+            _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
+            segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
+            segments_ids = []
+            for i, s in enumerate(segs):
+                if i % 2 == 0:
+                    segments_ids += s * [0]
+                else:
+                    segments_ids += s * [1]
+            cls_ids = [i for i, t in enumerate(src_subtoken_idxs) if t == self.cls_vid]
+            if labels:
+                labels = labels[: len(cls_ids)]
+
+            tgt_txt = None
+            if tgt:
+                tgt_txt = "<q>".join([" ".join(tt) for tt in tgt])
+            src_txt = [original_src_txt[i] for i in idxs]
+            return src_subtoken_idxs, labels, segments_ids, cls_ids, src_txt, tgt_txt
+
+        b_data = _preprocess(source, target, oracle_ids)
+
+        if b_data is None:
+            return None
+        indexed_tokens, labels, segments_ids, cls_ids, src_txt, tgt_txt = b_data
+        return {
+            "src": indexed_tokens,
+            "labels": labels,
+            "segs": segments_ids,
+            "clss": cls_ids,
+            "src_txt": src_txt,
+            "tgt_txt": tgt_txt,
+        }
+
+
+class ExtractiveSummarizer(Transformer):
+    """class which performs extractive summarization fine tuning and prediction """
+
+    def __init__(self, model_name="distilbert-base-uncased", encoder="transformer", cache_dir="."):
+        """Initialize a ExtractiveSummarizer.
+
+        Args:
+            model_name (str, optional): Transformer model name used in preprocessing.
+                check MODEL_CLASS for supported models. Defaults to "distilbert-base-uncased".
+            encoder (str, optional): Encoder algorithm used by summarization layer.
+                There are four options:
+                    - baseline: it used a smaller transformer model to replace the bert model
+                      and with transformer summarization layer.
+                    - classifier: it uses pretrained BERT and fine-tune BERT with simple logistic
+                      classification summarization layer.
+                    - transformer: it uses pretrained BERT and fine-tune BERT with transformer
+                      summarization layer.
+                    - RNN: it uses pretrained BERT and fine-tune BERT with LSTM summarization layer.
+                Defaults to "transformer".
+            cache_dir (str, optional): Directory to cache the tokenizer. Defaults to ".".
+        """
+
+        super().__init__(model_class=MODEL_CLASS, model_name=model_name, num_labels=0, cache_dir=cache_dir)
+        if model_name not in self.list_supported_models():
+            raise ValueError(
+                "Model name {} is not supported by ExtractiveSummarizer. "
+                "Call 'ExtractiveSummarizer.list_supported_models()' to get all supported model "
+                "names.".format(value)
+            )
+
+        self.model_class = MODEL_CLASS[model_name]
+        default_summarizer_layer_parameters = {
+            "ff_size": 512,
+            "heads": 4,
+            "dropout": 0.1,
+            "inter_layers": 2,
+            "hidden_size": 128,
+            "rnn_size": 512,
+            "param_init": 0.0,
+            "param_init_glorot": True,
+        }
+
+        args = Bunch(default_summarizer_layer_parameters)
+        self.model = Summarizer(encoder, args, self.model_class, model_name, None, cache_dir)
+
+    @staticmethod
+    def list_supported_models():
+        return list(MODEL_CLASS.keys())
+
+    def fit(
+        self,
+        train_dataset,
+        num_gpus=None,
+        gpu_ids=None,
+        batch_size=3000,
+        local_rank=-1,
+        max_steps=5e5,
+        warmup_steps=1e5,
+        learning_rate=2e-3,
+        optimization_method="adam",
+        max_grad_norm=0,
+        beta1=0.9,
+        beta2=0.999,
+        decay_method="noam",
+        gradient_accumulation_steps=2,
+        report_every=50,
+        verbose=True,
+        seed=None,
+        **kwargs,
+    ):
+        """
+        Fine-tune pre-trained transofmer models for extractive summarization.
+
+        Args:
+            train_dataset (ExtSumProcessedIterableDataset): Training dataset.
+            num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
+                be used. If set to 0 or GPUs are not available, CPU device will
+                be used. Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
+            batch_size (int, optional): Maximum number of tokens in each batch.
+            local_rank (int, optional): Local_rank for distributed training on GPUs. Defaults to
+                -1, which means non-distributed training.
+            max_steps (int, optional): Maximum number of training steps. Defaults to 5e5.
+            warmup_steps (int, optional): Number of steps taken to increase learning rate from 0
+                to `learning_rate`. Defaults to 1e5.
+            learning_rate (float, optional):  Learning rate of the AdamW optimizer. Defaults to
+                5e-5.
+            optimization_method (string, optional): Optimization method used in fine tuning.
+            max_grad_norm (float, optional): Maximum gradient norm for gradient clipping.
+                Defaults to 0.
+            gradient_accumulation_steps (int, optional): Number of batches to accumulate
+                gradients on between each model parameter update. Defaults to 1.
+            decay_method (string, optional): learning rate decrease method. Default to 'noam'.
+            report_every (int, optional): The interval by steps to print out the trainint log.
+                Defaults to 50.
+            beta1 (float, optional): The exponential decay rate for the first moment estimates.
+                Defaults to 0.9.
+            beta2 (float, optional): The exponential decay rate for the second-moment estimates.
+                This value should be set close to 1.0 on problems with a sparse gradient.
+                Defaults to 0.99.
+            verbose (bool, optional): Whether to print out the training log. Defaults to True.
+            seed (int, optional): Random seed used to improve reproducibility. Defaults to None.
+        """
+
+        # init optimizer
+        optimizer = model_builder.build_optim(
+            optimization_method,
+            learning_rate,
+            max_grad_norm,
+            beta1,
+            beta2,
+            decay_method,
+            warmup_steps,
+            self.model,
+            None,
+        )
+
+        # batch_size is the number of tokens in a batch
+        train_dataloader = get_dataloader(train_dataset.get_stream(), is_labeled=True, batch_size=batch_size)
+
+        # compute the max number of training steps
+        max_steps = compute_training_steps(
+            train_dataloader, max_steps=max_steps, gradient_accumulation_steps=gradient_accumulation_steps,
+        )
+
+        super().fine_tune(
+            train_dataloader=train_dataloader,
+            get_inputs=ExtSumProcessor.get_inputs,
+            num_gpus=num_gpus,
+            gpu_ids=gpu_ids,
+            max_steps=max_steps,
+            max_grad_norm=max_grad_norm,
+            gradient_accumulation_steps=gradient_accumulation_steps,
+            optimizer=optimizer,
+            scheduler=None,
+            verbose=verbose,
+            seed=seed,
+            report_every=report_every,
+            clip_grad_norm=False,
+        )
+
+    def predict(
+        self,
+        test_dataset,
+        num_gpus=1,
+        gpu_ids=None,
+        batch_size=16,
+        sentence_separator="<q>",
+        top_n=3,
+        block_trigram=True,
+        cal_lead=False,
+        verbose=True,
+    ):
+        """
+        Predict the summarization for the input data iterator.
+
+        Args:
+            test_dataset (Dataset): Dataset for which the summary to be predicted
+            num_gpus (int, optional): The number of GPUs used in prediction. Defaults to 1.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
+            batch_size (int, optional): The number of test examples in each batch. Defaults to 16.
+            sentence_separator (str, optional): String to be inserted between sentences in
+                the prediction. Defaults to '<q>'.
+            top_n (int, optional): The number of sentences that should be selected
+                from the paragraph as summary. Defaults to 3.
+            block_trigram (bool, optional): voolean value which specifies whether
+                the summary should include any sentence that has the same trigram
+                as the already selected sentences. Defaults to True.
+            cal_lead (bool, optional): Boolean value which specifies whether the
+                prediction uses the first few sentences as summary. Defaults to False.
+            verbose (bool, optional): Whether to print out the training log. Defaults to True.
+
+        Returns:
+            List of strings which are the summaries
+
+        """
+
+        def collate_fn(dict_list):
+            # tuple_batch =  [list(col) for col in zip(*[d.values() for d in dict_list]
+            if dict_list is None or len(dict_list) <= 0:
+                return None
+            is_labeled = False
+            if "labels" in dict_list[0]:
+                is_labeled = True
+            tuple_batch = [list(d.values()) for d in dict_list]
+            ## generate mask and mask_cls, and only select tensors for the model input
+            batch = Batch(tuple_batch, is_labeled=True)
+            if is_labeled:
+                return {
+                    "src": batch.src,
+                    "segs": batch.segs,
+                    "clss": batch.clss,
+                    "mask": batch.mask,
+                    "mask_cls": batch.mask_cls,
+                    "labels": batch.labels,
+                }
+            else:
+                return {
+                    "src": batch.src,
+                    "segs": batch.segs,
+                    "clss": batch.clss,
+                    "mask": batch.mask,
+                    "mask_cls": batch.mask_cls,
+                }
+
+        test_sampler = SequentialSampler(test_dataset)
+        test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=batch_size, collate_fn=collate_fn)
+        sent_scores = self.predict_scores(test_dataloader, num_gpus=num_gpus, gpu_ids=gpu_ids)
+        sent_scores_list = list(sent_scores)
+        scores_list = []
+        for i in sent_scores_list:
+            scores_list.extend(i)
+        prediction = []
+        for i in range(len(test_dataset)):
+            temp_pred = get_pred(
+                test_dataset[i],
+                scores_list[i],
+                cal_lead=cal_lead,
+                sentence_separator=sentence_separator,
+                block_trigram=block_trigram,
+                top_n=top_n,
+            )
+            prediction.extend(temp_pred)
+        return prediction
+
+    def predict_scores(self, test_dataloader, num_gpus=1, gpu_ids=None, verbose=True):
+        """
+        Scores a dataset using a fine-tuned model and a given dataloader.
+
+        Args:
+            test_dataloader (Dataloader): Dataloader for scoring the data.
+            num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
+                be used. If set to 0 or GPUs are not available, CPU device will be used.
+                Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
+            verbose (bool, optional): Whether to print out the training log. Defaults to True.
+
+        Returns
+            1darray: numpy array of predicted sentence scores.
+        """
+
+        device, num_gpus = get_device(num_gpus=num_gpus, local_rank=-1)
+
+        preds = list(
+            super().predict(
+                eval_dataloader=test_dataloader,
+                get_inputs=ExtSumProcessor.get_inputs,
+                num_gpus=num_gpus,
+                gpu_ids=gpu_ids,
+                verbose=verbose,
+            )
+        )
+        return preds
+
+    def save_model(self, name):
+        output_model_dir = os.path.join(self.cache_dir, "fine_tuned")
+
+        os.makedirs(self.cache_dir, exist_ok=True)
+        os.makedirs(output_model_dir, exist_ok=True)
+
+        full_name = os.path.join(output_model_dir, name)
+        logger.info("Saving model checkpoint to %s", full_name)
+        torch.save(self.model, name)
diff --git a/utils_nlp/models/transformers/named_entity_recognition.py b/utils_nlp/models/transformers/named_entity_recognition.py
old mode 100644
new mode 100755
index 9e11e3e14..76d0b5e37
--- a/utils_nlp/models/transformers/named_entity_recognition.py
+++ b/utils_nlp/models/transformers/named_entity_recognition.py
@@ -2,20 +2,20 @@
 # Licensed under the MIT License.
 
 import logging
+from collections import Iterable
+
 import numpy as np
 import torch
-import torch.nn as nn
-
-from collections import Iterable
 from torch.utils.data import TensorDataset
 from transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BertForTokenClassification
-from utils_nlp.common.pytorch_utils import get_device
-from utils_nlp.models.transformers.common import MAX_SEQ_LEN, TOKENIZER_CLASS, Transformer
-from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
-from torch.utils.data.distributed import DistributedSampler
+from transformers.modeling_distilbert import DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DistilBertForTokenClassification
 
+from utils_nlp.common.pytorch_utils import compute_training_steps
+from utils_nlp.models.transformers.common import MAX_SEQ_LEN, TOKENIZER_CLASS, Transformer
 
-TC_MODEL_CLASS = {k: BertForTokenClassification for k in BERT_PRETRAINED_MODEL_ARCHIVE_MAP}
+TC_MODEL_CLASS = {}
+TC_MODEL_CLASS.update({k: BertForTokenClassification for k in BERT_PRETRAINED_MODEL_ARCHIVE_MAP})
+TC_MODEL_CLASS.update({k: DistilBertForTokenClassification for k in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP})
 
 
 class TokenClassificationProcessor:
@@ -40,27 +40,36 @@ def __init__(self, model_name="bert-base-cased", to_lower=False, cache_dir="."):
         )
 
     @staticmethod
-    def get_inputs(batch, model_name, train_mode=True):
+    def get_inputs(batch, device, model_name, train_mode=True):
         """
-        Produce a dictionary object for model training or prediction.
+        Creates an input dictionary given a model name.
 
         Args:
-            model_name (str): The pretained model name.
-            train_mode (bool, optional): Whether it's for model training. Set it to False if
-                it's for testing and it won't have the 'labels' data field.
-                Defaults to True, for model training.
+            batch (tuple): A tuple containing input ids, attention mask,
+                segment ids, and labels tensors.
+            device (torch.device): A PyTorch device.
+            model_name (bool, optional): Model name used to format the inputs.
+            train_mode (bool, optional): Training mode flag.
+                Defaults to True.
 
         Returns:
-            dict: A dictionary object contains all needed information for training or testing.
+            dict: Dictionary containing input ids, segment ids, masks, and labels.
+                Labels are only returned when train_mode is True.
         """
+        batch = tuple(t.to(device) for t in batch)
+        if model_name.split("-")[0] in ["bert", "distilbert"]:
+            if train_mode:
+                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+            else:
+                inputs = {"input_ids": batch[0], "attention_mask": batch[1]}
 
-        if model_name.split("-")[0] not in ["bert"]:
-            raise ValueError("Model not supported: {}".format(model_name))
+            # distilbert doesn't support segment ids
+            if model_name.split("-")[0] not in ["distilbert"]:
+                inputs["token_type_ids"] = batch[2]
 
-        if train_mode:
-            return {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+            return inputs
         else:
-            return {"input_ids": batch[0], "attention_mask": batch[1]}
+            raise ValueError("Model not supported: {}".format(model_name))
 
     @staticmethod
     def create_label_map(label_lists, trailing_piece_tag="X"):
@@ -87,9 +96,7 @@ def create_label_map(label_lists, trailing_piece_tag="X"):
             label_map[trailing_piece_tag] = len(label_set)
         return label_map
 
-    def preprocess_for_bert(
-        self, text, max_len=MAX_SEQ_LEN, labels=None, label_map=None, trailing_piece_tag="X"
-    ):
+    def preprocess_for_bert(self, text, max_len=MAX_SEQ_LEN, labels=None, label_map=None, trailing_piece_tag="X"):
         """
         Tokenize and preprocesses input word lists, involving the following steps
             0. WordPiece tokenization.
@@ -123,7 +130,7 @@ def preprocess_for_bert(
         Returns:
             TensorDataset: A TensorDataset containing the following four tensors.
                 1. input_ids_all: Tensor. Each sublist contains numerical values,
-                    i.e. token ids, corresponding to the tokens in the input 
+                    i.e. token ids, corresponding to the tokens in the input
                     text data.
                 2. input_mask_all: Tensor. Each sublist contains the attention
                     mask of the input token id list, 1 for input tokens and 0 for
@@ -144,9 +151,7 @@ def _is_iterable_but_not_string(obj):
             return isinstance(obj, Iterable) and not isinstance(obj, str)
 
         if max_len > MAX_SEQ_LEN:
-            logging.warning(
-                "Setting max_len to max allowed sequence length: {}".format(MAX_SEQ_LEN)
-            )
+            logging.warning("Setting max_len to max allowed sequence length: {}".format(MAX_SEQ_LEN))
             max_len = MAX_SEQ_LEN
 
         if not _is_iterable_but_not_string(text):
@@ -179,9 +184,7 @@ def _is_iterable_but_not_string(obj):
         for t, t_labels in zip(text, labels):
             if len(t) != len(t_labels):
                 raise ValueError(
-                    "The number of words is {0}, but the number of labels is {1}.".format(
-                        len(t), len(t_labels)
-                    )
+                    "The number of words is {0}, but the number of labels is {1}.".format(len(t), len(t_labels))
                 )
 
             new_labels = []
@@ -195,11 +198,7 @@ def _is_iterable_but_not_string(obj):
                     new_tokens.append(sub_word)
 
             if len(new_tokens) > max_len:
-                logging.warn(
-                    "Text after tokenization with length {} has been truncated".format(
-                        len(new_tokens)
-                    )
-                )
+                logging.warn("Text after tokenization with length {} has been truncated".format(len(new_tokens)))
                 new_tokens = new_tokens[:max_len]
                 new_labels = new_labels[:max_len]
             input_ids = self.tokenizer.convert_tokens_to_ids(new_tokens)
@@ -216,9 +215,7 @@ def _is_iterable_but_not_string(obj):
             input_mask += padding
             new_labels += label_padding
 
-            trailing_token_mask_all.append(
-                [True if label != trailing_piece_tag else False for label in new_labels]
-            )
+            trailing_token_mask_all.append([True if label != trailing_piece_tag else False for label in new_labels])
 
             if label_map:
                 label_ids = [label_map[label] for label in new_labels]
@@ -233,32 +230,17 @@ def _is_iterable_but_not_string(obj):
             td = TensorDataset(
                 torch.tensor(input_ids_all, dtype=torch.long),
                 torch.tensor(input_mask_all, dtype=torch.long),
-                torch.tensor(trailing_token_mask_all, dtype=torch.bool),
+                torch.tensor(trailing_token_mask_all, dtype=torch.long),
                 torch.tensor(label_ids_all, dtype=torch.long),
             )
         else:
             td = TensorDataset(
                 torch.tensor(input_ids_all, dtype=torch.long),
                 torch.tensor(input_mask_all, dtype=torch.long),
-                torch.tensor(trailing_token_mask_all, dtype=torch.bool),
+                torch.tensor(trailing_token_mask_all, dtype=torch.long),
             )
         return td
 
-    def create_dataloader_from_dataset(
-        self, dataset, shuffle=False, batch_size=32, num_gpus=None, distributed=False
-    ):
-        if num_gpus is None:
-            num_gpus = torch.cuda.device_count()
-
-        batch_size = batch_size * max(1, num_gpus)
-
-        if distributed:
-            sampler = DistributedSampler(dataset)
-        else:
-            sampler = RandomSampler(dataset) if shuffle else SequentialSampler(dataset)
-
-        return DataLoader(dataset, sampler=sampler, batch_size=batch_size)
-
 
 class TokenClassifier(Transformer):
     """
@@ -275,10 +257,7 @@ class TokenClassifier(Transformer):
 
     def __init__(self, model_name="bert-base-cased", num_labels=2, cache_dir="."):
         super().__init__(
-            model_class=TC_MODEL_CLASS,
-            model_name=model_name,
-            num_labels=num_labels,
-            cache_dir=cache_dir,
+            model_class=TC_MODEL_CLASS, model_name=model_name, num_labels=num_labels, cache_dir=cache_dir,
         )
 
     @staticmethod
@@ -289,7 +268,10 @@ def fit(
         self,
         train_dataloader,
         num_epochs=1,
+        max_steps=-1,
+        gradient_accumulation_steps=1,
         num_gpus=None,
+        gpu_ids=None,
         local_rank=-1,
         weight_decay=0.0,
         learning_rate=5e-5,
@@ -299,73 +281,96 @@ def fit(
         seed=None,
     ):
         """
-        Fit the TokenClassifier model using the given training dataset.
+        Fine-tunes a pre-trained token classification model.
 
         Args:
-            train_dataloader (DataLoader): DataLoader instance for training.
-            num_epochs (int, optional): Number of training epochs.
-                Defaults to 1.
+            train_dataloader (Dataloader): A PyTorch DataLoader to be used for training.
+            num_epochs (int, optional): Number of training epochs. Defaults to 1.
+            max_steps (int, optional): Total number of training steps.
+                If set to a positive value, it overrides num_epochs.
+                Otherwise, it's determined by the dataset length, gradient_accumulation_steps, and num_epochs.
+                Defualts to -1.
+            gradient_accumulation_steps (int, optional): Number of steps to accumulate
+                before performing a backward/update pass.
+                Default to 1.
             num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
-                be used. If set to 0 or GPUs are not available, CPU device will
-                be used. Defaults to None.
-            local_rank (int, optional): Whether need to do distributed training.
-                Defaults to -1, no distributed training.
-            weight_decay (float, optional): Weight decay rate.
-                Defaults to 0.
-            learning_rate (float, optional): The learning rate.
-                Defaults to 5e-5.
-            adam_espilon (float, optional): The 'eps' parameter for the 'AdamW' optimizer.
-                Defaults to 1e-8.
-            warmup_steps (int, optional): Number of warmup steps for 'WarmupLinearSchedule'.
-                Defaults to 0.
-            verbose (bool, optional): Verbose model.
-                Defaults to False.
-            seed (int, optional): The seed for the transformers.
-                Defaults to None, use the default seed.
+                be used. If set to 0 or GPUs are not available, CPU device will be used.
+                Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
+            local_rank (int, optional): Local_rank for distributed training on GPUs. Defaults to
+                -1, which means non-distributed training.
+            weight_decay (float, optional): Weight decay to apply after each parameter update.
+                Defaults to 0.0.
+            learning_rate (float, optional):  Learning rate of the AdamW optimizer. Defaults to
+                5e-5.
+            adam_epsilon (float, optional): Epsilon of the AdamW optimizer. Defaults to 1e-8.
+            warmup_steps (int, optional): Number of steps taken to increase learning rate from 0
+                to `learning rate`. Defaults to 0.
+            verbose (bool, optional): Whether to print out the training log. Defaults to True.
+            seed (int, optional): Random seed used to improve reproducibility. Defaults to None.
         """
 
+        # init optimizer
+        optimizer = Transformer.get_default_optimizer(self.model, weight_decay, learning_rate, adam_epsilon)
+
+        # compute the max number of training steps
+        max_steps = compute_training_steps(
+            train_dataloader,
+            num_epochs=num_epochs,
+            max_steps=max_steps,
+            gradient_accumulation_steps=gradient_accumulation_steps,
+        )
+
+        # init scheduler
+        scheduler = Transformer.get_default_scheduler(
+            optimizer=optimizer, warmup_steps=warmup_steps, num_training_steps=max_steps,
+        )
+
+        # fine tune
         super().fine_tune(
             train_dataloader=train_dataloader,
             get_inputs=TokenClassificationProcessor.get_inputs,
-            n_gpu=num_gpus,
-            num_train_epochs=num_epochs,
-            weight_decay=weight_decay,
-            learning_rate=learning_rate,
-            adam_epsilon=adam_epsilon,
-            warmup_steps=warmup_steps,
+            num_gpus=num_gpus,
+            gpu_ids=gpu_ids,
+            max_steps=max_steps,
+            gradient_accumulation_steps=gradient_accumulation_steps,
+            optimizer=optimizer,
+            scheduler=scheduler,
+            local_rank=local_rank,
             verbose=verbose,
             seed=seed,
         )
 
-    def predict(self, eval_dataloader, num_gpus=None, verbose=True):
+    def predict(self, test_dataloader, num_gpus=None, gpu_ids=None, verbose=True):
         """
-        Test on an evaluation dataset and get the token label predictions.
+        Scores a dataset using a fine-tuned model and a given dataloader.
 
         Args:
-            eval_dataset (TensorDataset): A TensorDataset for evaluation.
+            test_dataloader (DataLoader): DataLoader for scoring the data.
             num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
-                be used. If set to 0 or GPUs are not available, CPU device will
-                be used. Defaults to None.
-            verbose (bool, optional): Verbose model.
-                Defaults to False.
-
-        Returns:
-            ndarray: Numpy ndarray of raw predictions. The shape of the ndarray is
-            [number_of_examples, sequence_length, number_of_labels]. Each
-            value in the ndarray is not normalized. Post-process will be needed
-            to get the probability for each class label.
+                be used. If set to 0 or GPUs are not available, CPU device will be used.
+                Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
+            verbose (bool, optional): Whether to print out the training log. Defaults to True.
+
+        Returns
+            1darray: numpy array of predicted label indices.
         """
 
         preds = list(
             super().predict(
-                eval_dataloader=eval_dataloader,
+                eval_dataloader=test_dataloader,
                 get_inputs=TokenClassificationProcessor.get_inputs,
-                n_gpu=num_gpus,
+                num_gpus=num_gpus,
+                gpu_ids=gpu_ids,
                 verbose=verbose,
             )
         )
-        preds_np = np.concatenate(preds)
-        return preds_np
+        return np.concatenate(preds)
 
     def get_predicted_token_labels(self, predictions, label_map, dataset):
         """
@@ -374,21 +379,19 @@ def get_predicted_token_labels(self, predictions, label_map, dataset):
         Args:
             predictions (ndarray): A numpy ndarray produced from the `predict` function call.
                 The shape of the ndarray is [number_of_examples, sequence_length, number_of_labels].
-            label_map (dict): A dictionary object to map a label (str) to an ID (int). 
+            label_map (dict): A dictionary object to map a label (str) to an ID (int).
                 dataset (TensorDataset): The TensorDataset for evaluation.
             dataset (Dataset): The test Dataset instance.
 
         Returns:
             list: A list of lists. The size of the retured list is the number of testing samples.
-            Each sublist represents the predicted label for each token. 
+            Each sublist represents the predicted label for each token.
         """
 
         num_samples = len(dataset.tensors[0])
         if num_samples != predictions.shape[0]:
             raise ValueError(
-                "Predictions have {0} samples, but got {1} samples in dataset".format(
-                    predictions.shape[0], num_samples
-                )
+                "Predictions have {0} samples, but got {1} samples in dataset".format(predictions.shape[0], num_samples)
             )
 
         label_id2str = {v: k for k, v in label_map.items()}
@@ -407,7 +410,7 @@ def get_predicted_token_labels(self, predictions, label_map, dataset):
                 if attention_mask[sid] == 0:
                     break
 
-                if not trailing_mask[sid]:
+                if not bool(trailing_mask[sid]):
                     continue
 
                 label_id = seq_probs[sid].argmax()
@@ -420,13 +423,13 @@ def get_true_test_labels(self, label_map, dataset):
         Get the true testing label values.
 
         Args:
-            label_map (dict): A dictionary object to map a label (str) to an ID (int). 
+            label_map (dict): A dictionary object to map a label (str) to an ID (int).
                 dataset (TensorDataset): The TensorDataset for evaluation.
             dataset (Dataset): The test Dataset instance.
 
         Returns:
             list: A list of lists. The size of the retured list is the number of testing samples.
-            Each sublist represents the predicted label for each token. 
+            Each sublist represents the predicted label for each token.
         """
 
         num_samples = len(dataset.tensors[0])
diff --git a/utils_nlp/models/transformers/question_answering.py b/utils_nlp/models/transformers/question_answering.py
old mode 100644
new mode 100755
index 4f48e58d9..c0415a579
--- a/utils_nlp/models/transformers/question_answering.py
+++ b/utils_nlp/models/transformers/question_answering.py
@@ -17,38 +17,30 @@
 # Modifications copyright © Microsoft Corporation
 
 
-import os
-import logging
-from tqdm import tqdm
 import collections
 import json
+import logging
 import math
-import jsonlines
+import os
 
+import jsonlines
 import torch
-from torch.utils.data import TensorDataset, SequentialSampler, DataLoader, RandomSampler
-from torch.utils.data.distributed import DistributedSampler
-
-from transformers.tokenization_bert import BasicTokenizer, whitespace_tokenize
+from torch.utils.data import TensorDataset
+from tqdm import tqdm
+from transformers.modeling_albert import ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP, AlbertForQuestionAnswering
 from transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BertForQuestionAnswering
-from transformers.modeling_xlnet import (
-    XLNET_PRETRAINED_MODEL_ARCHIVE_MAP,
-    XLNetForQuestionAnswering,
-)
-from transformers.modeling_distilbert import (
-    DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
-    DistilBertForQuestionAnswering,
-)
+from transformers.modeling_distilbert import DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DistilBertForQuestionAnswering
+from transformers.modeling_xlnet import XLNET_PRETRAINED_MODEL_ARCHIVE_MAP, XLNetForQuestionAnswering
+from transformers.tokenization_bert import BasicTokenizer, whitespace_tokenize
 
-from utils_nlp.common.pytorch_utils import get_device
+from utils_nlp.common.pytorch_utils import compute_training_steps, get_device, move_model_to_device
 from utils_nlp.models.transformers.common import MAX_SEQ_LEN, TOKENIZER_CLASS, Transformer
 
 MODEL_CLASS = {}
 MODEL_CLASS.update({k: BertForQuestionAnswering for k in BERT_PRETRAINED_MODEL_ARCHIVE_MAP})
 MODEL_CLASS.update({k: XLNetForQuestionAnswering for k in XLNET_PRETRAINED_MODEL_ARCHIVE_MAP})
-MODEL_CLASS.update(
-    {k: DistilBertForQuestionAnswering for k in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP}
-)
+MODEL_CLASS.update({k: DistilBertForQuestionAnswering for k in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP})
+MODEL_CLASS.update({k: AlbertForQuestionAnswering for k in ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP})
 
 # cached files during preprocessing
 # these are used in postprocessing to generate the final answer texts
@@ -85,9 +77,7 @@ class QAProcessor:
         cache_dir (str, optional): Directory to cache the tokenizer. Defaults to ".".
     """
 
-    def __init__(
-        self, model_name="bert-base-cased", to_lower=False, custom_tokenize=None, cache_dir="."
-    ):
+    def __init__(self, model_name="bert-base-cased", to_lower=False, custom_tokenize=None, cache_dir="."):
         self.model_name = model_name
         self.tokenizer = TOKENIZER_CLASS[model_name].from_pretrained(
             model_name, do_lower_case=to_lower, cache_dir=cache_dir, output_loading_info=False
@@ -116,13 +106,14 @@ def model_type(self):
         return self._model_type
 
     @staticmethod
-    def get_inputs(batch, model_name, train_mode=True):
+    def get_inputs(batch, device, model_name, train_mode=True):
         """
         Creates an input dictionary given a model name.
 
         Args:
             batch (tuple): A tuple containing input ids, attention mask,
                 segment ids, and labels tensors.
+            device (torch.device): A PyTorch device.
             model_name (bool, optional): Model name used to format the inputs.
             train_mode (bool, optional): Training mode flag.
                 Defaults to True.
@@ -131,6 +122,7 @@ def get_inputs(batch, model_name, train_mode=True):
             dict: Dictionary containing input ids, segment ids, masks, and labels.
                 Labels are only returned when train_mode is True.
         """
+        batch = tuple(t.to(device) for t in batch)
         model_type = model_name.split("-")[0]
 
         inputs = {"input_ids": batch[0], "attention_mask": batch[1]}
@@ -191,6 +183,8 @@ def preprocess(
                 directory. These files are required during postprocessing to generate the final
                 answer texts from predicted answer start and answer end indices. Defaults to
                 "./cached_qa_features".
+        Returns:
+            DataSet: A Pytorch DataSet.
         """
 
         if not os.path.exists(feature_cache_dir):
@@ -223,9 +217,7 @@ def preprocess(
 
                 qa_examples.append(qa_example_cur)
 
-                qa_examples_json.append(
-                    {"qa_id": qa_example_cur.qa_id, "doc_tokens": qa_example_cur.doc_tokens}
-                )
+                qa_examples_json.append({"qa_id": qa_example_cur.qa_id, "doc_tokens": qa_example_cur.doc_tokens})
 
                 features_cur = _create_qa_features(
                     qa_example_cur,
@@ -271,28 +263,13 @@ def preprocess(
             start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
             end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
             qa_dataset = TensorDataset(
-                input_ids,
-                input_mask,
-                segment_ids,
-                start_positions,
-                end_positions,
-                cls_index,
-                p_mask,
+                input_ids, input_mask, segment_ids, start_positions, end_positions, cls_index, p_mask,
             )
         else:
             unique_id_all = torch.tensor(unique_id_all, dtype=torch.long)
-            qa_dataset = TensorDataset(
-                input_ids, input_mask, segment_ids, cls_index, p_mask, unique_id_all
-            )
-
-        if num_gpus is not None:
-            batch_size = batch_size * max(1, num_gpus)
-        if distributed:
-            sampler = DistributedSampler(qa_dataset)
-        else:
-            sampler = RandomSampler(qa_dataset) if is_training else SequentialSampler(qa_dataset)
+            qa_dataset = TensorDataset(input_ids, input_mask, segment_ids, cls_index, p_mask, unique_id_all)
 
-        return DataLoader(qa_dataset, sampler=sampler, batch_size=batch_size)
+        return qa_dataset
 
     def postprocess(
         self,
@@ -420,14 +397,7 @@ class QAResult(QAResult_):
 
 QAResultExtended_ = collections.namedtuple(
     "QAResultExtended",
-    [
-        "unique_id",
-        "start_top_log_probs",
-        "start_top_index",
-        "end_top_log_probs",
-        "end_top_index",
-        "cls_logits",
-    ],
+    ["unique_id", "start_top_log_probs", "start_top_index", "end_top_log_probs", "end_top_index", "cls_logits",],
 )
 
 
@@ -489,18 +459,16 @@ def list_supported_models():
     def fit(
         self,
         train_dataloader,
-        num_gpus=None,
         num_epochs=1,
-        learning_rate=5e-5,
-        max_grad_norm=1.0,
         max_steps=-1,
         gradient_accumulation_steps=1,
-        warmup_steps=0,
+        num_gpus=None,
+        gpu_ids=None,
+        local_rank=-1,
         weight_decay=0.0,
+        learning_rate=5e-5,
         adam_epsilon=1e-8,
-        fp16=False,
-        fp16_opt_level="O1",
-        local_rank=-1,
+        warmup_steps=0,
         verbose=True,
         seed=None,
         cache_model=True,
@@ -509,31 +477,30 @@ def fit(
         Fine-tune pre-trained transofmer models for question answering.
 
         Args:
-            train_dataloader (Dataloader): Dataloader for the training data.
-            num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
-                be used. If set to 0 or GPUs are not available, CPU device will
-                be used. Defaults to None.
+            train_dataloader (Dataloader): A PyTorch DataLoader to be used for training.
             num_epochs (int, optional): Number of training epochs. Defaults to 1.
+            max_steps (int, optional): Total number of training steps.
+                If set to a positive value, it overrides num_epochs.
+                Otherwise, it's determined by the dataset length, gradient_accumulation_steps, and num_epochs.
+                Defualts to -1.
+            gradient_accumulation_steps (int, optional): Number of steps to accumulate
+                before performing a backward/update pass.
+                Default to 1.
+            num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
+                be used. If set to 0 or GPUs are not available, CPU device will be used.
+                Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
+            local_rank (int, optional): Local_rank for distributed training on GPUs. Defaults to
+                -1, which means non-distributed training.
+            weight_decay (float, optional): Weight decay to apply after each parameter update.
+                Defaults to 0.0.
             learning_rate (float, optional):  Learning rate of the AdamW optimizer. Defaults to
                 5e-5.
-            max_grad_norm (float, optional): Maximum gradient norm for gradient clipping.
-                Defaults to 1.0.
-            max_steps (int, optional): Maximum number of training steps. If specified,
-                `num_epochs` will be ignored. Defaults to -1.
-            gradient_accumulation_steps (int, optional): Number of batches to accumulate
-                gradients on between each model parameter update. Defaults to 1.
+            adam_epsilon (float, optional): Epsilon of the AdamW optimizer. Defaults to 1e-8.
             warmup_steps (int, optional): Number of steps taken to increase learning rate from 0
                 to `learning rate`. Defaults to 0.
-            weight_decay (float, optional): Weight decay to apply after each parameter update.
-                Defaults to 0.0.
-            adam_epsilon (float, optional): Epsilon of the AdamW optimizer. Defaults to 1e-8.
-            fp16 (bool, optional): Whether to use 16-bit (mixed) precision (through NVIDIA apex)
-                instead of 32-bit. Defaults to False.
-            fp16_opt_level (str, optional): For fp16: Apex AMP optimization level selected in
-                ['O0', 'O1', 'O2', and 'O3']. See details at https://nvidia.github.io/apex/amp.html.
-                Defaults to "O1",
-            local_rank (int, optional): Local_rank for distributed training on GPUs. Defaults to
-                -1, which means non-distributed training.
             verbose (bool, optional): Whether to print out the training log. Defaults to True.
             seed (int, optional): Random seed used to improve reproducibility. Defaults to None.
             cache_model (bool, optional): Whether to save the fine-tuned model. If True,
@@ -542,39 +509,53 @@ def fit(
 
         """
 
+        # init optimizer
+        optimizer = Transformer.get_default_optimizer(self.model, weight_decay, learning_rate, adam_epsilon)
+
+        # compute the max number of training steps
+        max_steps = compute_training_steps(
+            train_dataloader,
+            num_epochs=num_epochs,
+            max_steps=max_steps,
+            gradient_accumulation_steps=gradient_accumulation_steps,
+        )
+
+        # inin scheduler
+        scheduler = Transformer.get_default_scheduler(
+            optimizer=optimizer, warmup_steps=warmup_steps, num_training_steps=max_steps,
+        )
+
+        # fine tune
         super().fine_tune(
             train_dataloader=train_dataloader,
             get_inputs=QAProcessor.get_inputs,
+            num_gpus=num_gpus,
+            gpu_ids=gpu_ids,          
             max_steps=max_steps,
-            num_train_epochs=num_epochs,
-            max_grad_norm=max_grad_norm,
             gradient_accumulation_steps=gradient_accumulation_steps,
-            n_gpu=num_gpus,
-            weight_decay=weight_decay,
-            learning_rate=learning_rate,
-            adam_epsilon=adam_epsilon,
-            warmup_steps=warmup_steps,
-            fp16=fp16,
-            fp16_opt_level=fp16_opt_level,
+            optimizer=optimizer,
+            scheduler=scheduler,
             local_rank=local_rank,
             verbose=verbose,
             seed=seed,
         )
+
         if cache_model:
             self.save_model()
 
-    def predict(self, test_dataloader, num_gpus=None, verbose=True):
+    def predict(self, test_dataloader, num_gpus=None, gpu_ids=None, verbose=True):
 
         """
         Predicts answer start and end logits.
 
         Args:
-            test_dataloader (QADataset): Dataloader for the testing data.
+            test_dataloader (DataLoader): DataLoader for scoring the data.
             num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
                 be used. If set to 0 or GPUs are not available, CPU device will
                 be used. Defaults to None.
-            local_rank (int, optional): Local_rank for distributed training on GPUs. Defaults to
-                -1, which means non-distributed.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
             verbose (bool, optional): Whether to print out the predicting log. Defaults to True.
 
         Returns:
@@ -584,25 +565,16 @@ def predict(self, test_dataloader, num_gpus=None, verbose=True):
         def _to_list(tensor):
             return tensor.detach().cpu().tolist()
 
+        # get device
         device, num_gpus = get_device(num_gpus=num_gpus, local_rank=-1)
-
-        if isinstance(self.model, torch.nn.DataParallel):
-            self.model = self.model.module
-
-        if num_gpus > 1:
-            self.model = torch.nn.DataParallel(self.model, device_ids=list(range(num_gpus)))
-
-        self.model.to(device)
-        self.model.eval()
+        # move model
+        self.model = move_model_to_device(self.model, device, num_gpus, gpu_ids, local_rank=-1)
 
         all_results = []
         for batch in tqdm(test_dataloader, desc="Evaluating", disable=not verbose):
-            batch = tuple(t.to(device) for t in batch)
             with torch.no_grad():
-                inputs = QAProcessor.get_inputs(batch, self.model_name, train_mode=False)
-
+                inputs = QAProcessor.get_inputs(batch, device, self.model_name, train_mode=False)
                 outputs = self.model(**inputs)
-
                 unique_id_tensor = batch[5]
 
             for i, u_id in enumerate(unique_id_tensor):
@@ -617,9 +589,7 @@ def _to_list(tensor):
                     )
                 else:
                     result = QAResult(
-                        unique_id=u_id.item(),
-                        start_logits=_to_list(outputs[0][i]),
-                        end_logits=_to_list(outputs[1][i]),
+                        unique_id=u_id.item(), start_logits=_to_list(outputs[0][i]), end_logits=_to_list(outputs[1][i]),
                     )
                 all_results.append(result)
             torch.cuda.empty_cache()
@@ -783,9 +753,7 @@ def postprocess_bert_answer(
 
         # Sort by the sum of the start and end logits in ascending order,
         # so that the first element is the most probable answer
-        prelim_predictions = sorted(
-            prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True
-        )
+        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
 
         seen_predictions = {}
         nbest = []
@@ -818,19 +786,11 @@ def postprocess_bert_answer(
                 final_text = ""
                 seen_predictions[final_text] = True
 
-            nbest.append(
-                _NbestPrediction(
-                    text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit
-                )
-            )
+            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))
         # if we didn't include the empty option in the n-best, include it
         if unanswerable_exists:
             if "" not in seen_predictions:
-                nbest.append(
-                    _NbestPrediction(
-                        text="", start_logit=null_start_logit, end_logit=null_end_logit
-                    )
-                )
+                nbest.append(_NbestPrediction(text="", start_logit=null_start_logit, end_logit=null_end_logit))
 
             # In very rare edge cases we could only have single null prediction.
             # So we just create a nonce prediction in this case to avoid failure.
@@ -874,9 +834,7 @@ def postprocess_bert_answer(
             all_probs[example["qa_id"]] = nbest_json[0]["probability"]
         else:
             # predict "" iff the null score - the score of best non-null > threshold
-            score_diff = (
-                score_null - best_non_null_entry.start_logit - (best_non_null_entry.end_logit)
-            )
+            score_diff = score_null - best_non_null_entry.start_logit - (best_non_null_entry.end_logit)
             scores_diff_json[example["qa_id"]] = score_diff
             if score_diff > null_score_diff_threshold:
                 all_predictions[example["qa_id"]] = ""
@@ -1042,9 +1000,7 @@ def postprocess_xlnet_answer(
                         )
                     )
 
-        prelim_predictions = sorted(
-            prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True
-        )
+        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
 
         seen_predictions = {}
         nbest = []
@@ -1075,20 +1031,14 @@ def postprocess_xlnet_answer(
             tok_text = " ".join(tok_text.split())
             orig_text = " ".join(orig_tokens)
 
-            final_text = _get_final_text(
-                tok_text, orig_text, tokenizer.do_lower_case, verbose_logging
-            )
+            final_text = _get_final_text(tok_text, orig_text, tokenizer.do_lower_case, verbose_logging)
 
             if final_text in seen_predictions:
                 continue
 
             seen_predictions[final_text] = True
 
-            nbest.append(
-                _NbestPrediction(
-                    text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit
-                )
-            )
+            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))
 
         # In very rare edge cases we could have no valid predictions. So we
         # just create a nonce prediction in this case to avoid failure.
@@ -1235,9 +1185,7 @@ def _is_whitespace(c):
             actual_text = " ".join(d_tokens[start_position : (end_position + 1)])
             cleaned_answer_text = " ".join(whitespace_tokenize(a_text))
             if actual_text.find(cleaned_answer_text) == -1:
-                logger.warning(
-                    "Could not find answer: '%s' vs. '%s'", actual_text, cleaned_answer_text
-                )
+                logger.warning("Could not find answer: '%s' vs. '%s'", actual_text, cleaned_answer_text)
                 return
         else:
             start_position = -1
@@ -1696,9 +1644,7 @@ def _strip_spaces(text):
 
     if len(orig_ns_text) != len(tok_ns_text):
         if verbose_logging:
-            logger.info(
-                "Length not equal after stripping spaces: '%s' vs '%s'", orig_ns_text, tok_ns_text
-            )
+            logger.info("Length not equal after stripping spaces: '%s' vs '%s'", orig_ns_text, tok_ns_text)
         return orig_text
 
     # We then project the characters in `pred_text` back to `orig_text` using
diff --git a/utils_nlp/models/transformers/sequence_classification.py b/utils_nlp/models/transformers/sequence_classification.py
old mode 100644
new mode 100755
index 93668471b..e8a4a288b
--- a/utils_nlp/models/transformers/sequence_classification.py
+++ b/utils_nlp/models/transformers/sequence_classification.py
@@ -2,37 +2,25 @@
 # Licensed under the MIT License.
 
 import numpy as np
-import torch
-from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
-from torch.utils.data.distributed import DistributedSampler
-from transformers.modeling_bert import (
-    BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
-    BertForSequenceClassification,
-)
+from transformers.modeling_albert import ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP, AlbertForSequenceClassification
+from transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BertForSequenceClassification
 from transformers.modeling_distilbert import (
     DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
     DistilBertForSequenceClassification,
 )
-from transformers.modeling_roberta import (
-    ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
-    RobertaForSequenceClassification,
-)
-from transformers.modeling_xlnet import (
-    XLNET_PRETRAINED_MODEL_ARCHIVE_MAP,
-    XLNetForSequenceClassification,
-)
+from transformers.modeling_roberta import ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, RobertaForSequenceClassification
+from transformers.modeling_xlnet import XLNET_PRETRAINED_MODEL_ARCHIVE_MAP, XLNetForSequenceClassification
+
+from utils_nlp.common.pytorch_utils import compute_training_steps
 from utils_nlp.models.transformers.common import MAX_SEQ_LEN, TOKENIZER_CLASS, Transformer
 from utils_nlp.models.transformers.datasets import SCDataSet, SPCDataSet
 
 MODEL_CLASS = {}
 MODEL_CLASS.update({k: BertForSequenceClassification for k in BERT_PRETRAINED_MODEL_ARCHIVE_MAP})
-MODEL_CLASS.update(
-    {k: RobertaForSequenceClassification for k in ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP}
-)
+MODEL_CLASS.update({k: RobertaForSequenceClassification for k in ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP})
 MODEL_CLASS.update({k: XLNetForSequenceClassification for k in XLNET_PRETRAINED_MODEL_ARCHIVE_MAP})
-MODEL_CLASS.update(
-    {k: DistilBertForSequenceClassification for k in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP}
-)
+MODEL_CLASS.update({k: DistilBertForSequenceClassification for k in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP})
+MODEL_CLASS.update({k: AlbertForSequenceClassification for k in ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP})
 
 
 class Processor:
@@ -56,13 +44,14 @@ def __init__(self, model_name="bert-base-cased", to_lower=False, cache_dir="."):
         )
 
     @staticmethod
-    def get_inputs(batch, model_name, train_mode=True):
+    def get_inputs(batch, device, model_name, train_mode=True):
         """
         Creates an input dictionary given a model name.
 
         Args:
             batch (tuple): A tuple containing input ids, attention mask,
                 segment ids, and labels tensors.
+            device (torch.device): A PyTorch device.
             model_name (bool, optional): Model name used to format the inputs.
             train_mode (bool, optional): Training mode flag.
                 Defaults to True.
@@ -71,7 +60,8 @@ def get_inputs(batch, model_name, train_mode=True):
             dict: Dictionary containing input ids, segment ids, masks, and labels.
                 Labels are only returned when train_mode is True.
         """
-        if model_name.split("-")[0] in ["bert", "xlnet", "roberta", "distilbert"]:
+        batch = tuple(t.to(device) for t in batch)
+        if model_name.split("-")[0] in ["bert", "xlnet", "roberta", "distilbert", "albert"]:
             if train_mode:
                 inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
             else:
@@ -103,11 +93,7 @@ def text_transform(text, tokenizer, max_len=MAX_SEQ_LEN):
             print("setting max_len to max allowed sequence length: {}".format(MAX_SEQ_LEN))
             max_len = MAX_SEQ_LEN
         # truncate and add CLS & SEP markers
-        tokens = (
-            [tokenizer.cls_token]
-            + tokenizer.tokenize(text)[0 : max_len - 2]
-            + [tokenizer.sep_token]
-        )
+        tokens = [tokenizer.cls_token] + tokenizer.tokenize(text)[0 : max_len - 2] + [tokenizer.sep_token]
         # get input ids
         input_ids = tokenizer.convert_tokens_to_ids(tokens)
         # pad sequence
@@ -127,7 +113,7 @@ def text_pair_transform(text_1, text_2, tokenizer, max_len=MAX_SEQ_LEN):
 
         Args:
             text_1 (str): Input text 1.
-            text_1 (str): Input text 2.
+            text_2 (str): Input text 2.
             tokenizer (PreTrainedTokenizer): A pretrained tokenizer.
             max_len (int, optional): Max sequence length. Defaults to 512.
 
@@ -188,29 +174,13 @@ def _truncate_seq_pair(tokens_a, tokens_b, max_length):
 
         return input_ids, attention_mask, token_type_ids
 
-    def create_dataloader_from_df(
-        self,
-        df,
-        text_col,
-        label_col=None,
-        text2_col=None,
-        shuffle=False,
-        max_len=MAX_SEQ_LEN,
-        batch_size=32,
-        num_gpus=None,
-        distributed=False,
-    ):
+    def dataset_from_dataframe(self, df, text_col, label_col=None, text2_col=None, max_len=MAX_SEQ_LEN):
         if text2_col is None:
-            ds = SCDataSet(
-                df,
-                text_col,
-                label_col,
-                transform=Processor.text_transform,
-                tokenizer=self.tokenizer,
-                max_len=max_len,
+            return SCDataSet(
+                df, text_col, label_col, transform=Processor.text_transform, tokenizer=self.tokenizer, max_len=max_len,
             )
         else:
-            ds = SPCDataSet(
+            return SPCDataSet(
                 df,
                 text_col,
                 text2_col,
@@ -220,26 +190,11 @@ def create_dataloader_from_df(
                 max_len=max_len,
             )
 
-        if num_gpus is None:
-            num_gpus = torch.cuda.device_count()
-
-        batch_size = batch_size * max(1, num_gpus)
-
-        if distributed:
-            sampler = DistributedSampler(ds)
-        else:
-            sampler = RandomSampler(ds) if shuffle else SequentialSampler(ds)
-
-        return DataLoader(ds, sampler=sampler, batch_size=batch_size)
-
 
 class SequenceClassifier(Transformer):
     def __init__(self, model_name="bert-base-cased", num_labels=2, cache_dir="."):
         super().__init__(
-            model_class=MODEL_CLASS,
-            model_name=model_name,
-            num_labels=num_labels,
-            cache_dir=cache_dir,
+            model_class=MODEL_CLASS, model_name=model_name, num_labels=num_labels, cache_dir=cache_dir,
         )
 
     @staticmethod
@@ -250,7 +205,10 @@ def fit(
         self,
         train_dataloader,
         num_epochs=1,
+        max_steps=-1,
+        gradient_accumulation_steps=1,
         num_gpus=None,
+        gpu_ids=None,
         local_rank=-1,
         weight_decay=0.0,
         learning_rate=5e-5,
@@ -263,11 +221,21 @@ def fit(
         Fine-tunes a pre-trained sequence classification model.
 
         Args:
-            train_dataloader (Dataloader): Dataloader for the training data.
+            train_dataloader (Dataloader): A PyTorch DataLoader to be used for training.
             num_epochs (int, optional): Number of training epochs. Defaults to 1.
+            max_steps (int, optional): Total number of training steps.
+                If set to a positive value, it overrides num_epochs.
+                Otherwise, it's determined by the dataset length, gradient_accumulation_steps, and num_epochs.
+                Defualts to -1.
+            gradient_accumulation_steps (int, optional): Number of steps to accumulate
+                before performing a backward/update pass.
+                Default to 1.
             num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
                 be used. If set to 0 or GPUs are not available, CPU device will be used.
                 Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
             local_rank (int, optional): Local_rank for distributed training on GPUs. Defaults to
                 -1, which means non-distributed training.
             weight_decay (float, optional): Weight decay to apply after each parameter update.
@@ -281,28 +249,49 @@ def fit(
             seed (int, optional): Random seed used to improve reproducibility. Defaults to None.
         """
 
+        # init optimizer
+        optimizer = Transformer.get_default_optimizer(self.model, weight_decay, learning_rate, adam_epsilon)
+
+        # compute the max number of training steps
+        max_steps = compute_training_steps(
+            train_dataloader,
+            num_epochs=num_epochs,
+            max_steps=max_steps,
+            gradient_accumulation_steps=gradient_accumulation_steps,
+        )
+
+        # init scheduler
+        scheduler = Transformer.get_default_scheduler(
+            optimizer=optimizer, warmup_steps=warmup_steps, num_training_steps=max_steps,
+        )
+
+        # fine tune
         super().fine_tune(
             train_dataloader=train_dataloader,
             get_inputs=Processor.get_inputs,
-            n_gpu=num_gpus,
-            num_train_epochs=num_epochs,
-            weight_decay=weight_decay,
-            learning_rate=learning_rate,
-            adam_epsilon=adam_epsilon,
-            warmup_steps=warmup_steps,
+            num_gpus=num_gpus,
+            gpu_ids=gpu_ids,
+            max_steps=max_steps,
+            gradient_accumulation_steps=gradient_accumulation_steps,
+            optimizer=optimizer,
+            scheduler=scheduler,
+            local_rank=local_rank,
             verbose=verbose,
             seed=seed,
         )
 
-    def predict(self, eval_dataloader, num_gpus=None, verbose=True):
+    def predict(self, test_dataloader, num_gpus=None, gpu_ids=None, verbose=True):
         """
         Scores a dataset using a fine-tuned model and a given dataloader.
 
         Args:
-            eval_dataloader (Dataloader): Dataloader for the evaluation data.
+            test_dataloader (DataLoader): DataLoader for scoring the data.
             num_gpus (int, optional): The number of GPUs to use. If None, all available GPUs will
                 be used. If set to 0 or GPUs are not available, CPU device will be used.
                 Defaults to None.
+            gpu_ids (list): List of GPU IDs to be used.
+                If set to None, the first num_gpus GPUs will be used.
+                Defaults to None.
             verbose (bool, optional): Whether to print out the training log. Defaults to True.
 
         Returns
@@ -311,12 +300,12 @@ def predict(self, eval_dataloader, num_gpus=None, verbose=True):
 
         preds = list(
             super().predict(
-                eval_dataloader=eval_dataloader,
+                eval_dataloader=test_dataloader,
                 get_inputs=Processor.get_inputs,
-                n_gpu=num_gpus,
+                num_gpus=num_gpus,
+                gpu_ids=gpu_ids,
                 verbose=verbose,
             )
         )
         preds = np.concatenate(preds)
-        # todo generator & probs
         return np.argmax(preds, axis=1)
diff --git a/utils_nlp/models/xlnet/sequence_classification.py b/utils_nlp/models/xlnet/sequence_classification.py
index 32c239866..a7a086ea3 100644
--- a/utils_nlp/models/xlnet/sequence_classification.py
+++ b/utils_nlp/models/xlnet/sequence_classification.py
@@ -2,23 +2,20 @@
 # Licensed under the MIT License.
 
 """Utilities for Xlnet Sequence Classification"""
-import numpy as np
+import os
 from collections import namedtuple
+
+import mlflow
+import mlflow.pytorch
+import numpy as np
 import torch
 import torch.nn as nn
-from transformers import (
-    XLNetConfig,
-    XLNetForSequenceClassification,
-    AdamW,
-    WarmupLinearSchedule,
-)
-from tqdm import tqdm
 from torch.utils.data import DataLoader, RandomSampler, TensorDataset
-from utils_nlp.common.pytorch_utils import get_device, move_to_device
+from tqdm import tqdm
+from transformers import AdamW, WarmupLinearSchedule, XLNetConfig, XLNetForSequenceClassification
+
+from utils_nlp.common.pytorch_utils import get_device, move_model_to_device
 from utils_nlp.models.xlnet.common import Language
-import mlflow
-import mlflow.pytorch
-import os
 
 
 class XLNetSequenceClassifier:
@@ -79,9 +76,7 @@ def __init__(
         self.max_grad_norm = max_grad_norm
 
         # create classifier
-        self.config = XLNetConfig.from_pretrained(
-            self.language.value, num_labels=num_labels, cache_dir=cache_dir
-        )
+        self.config = XLNetConfig.from_pretrained(self.language.value, num_labels=num_labels, cache_dir=cache_dir)
         self.model = XLNetForSequenceClassification(self.config)
 
     def fit(
@@ -114,7 +109,7 @@ def fit(
         """
 
         device, num_gpus = get_device(self.num_gpus)
-        self.model = move_to_device(self.model, device, self.num_gpus)
+        self.model = move_model_to_device(self.model, device, self.num_gpus)
 
         token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
         input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
@@ -128,24 +123,17 @@ def fit(
             token_type_ids_tensor = torch.tensor(token_type_ids, dtype=torch.long)
             val_token_type_ids_tensor = torch.tensor(val_token_type_ids, dtype=torch.long)
 
-            train_dataset = TensorDataset(
-                token_ids_tensor, input_mask_tensor, token_type_ids_tensor, labels_tensor
-            )
+            train_dataset = TensorDataset(token_ids_tensor, input_mask_tensor, token_type_ids_tensor, labels_tensor)
 
             val_dataset = TensorDataset(
-                val_token_ids_tensor,
-                val_input_mask_tensor,
-                val_token_type_ids_tensor,
-                val_labels_tensor,
+                val_token_ids_tensor, val_input_mask_tensor, val_token_type_ids_tensor, val_labels_tensor,
             )
 
         else:
 
             train_dataset = TensorDataset(token_ids_tensor, input_mask_tensor, labels_tensor)
 
-            val_dataset = TensorDataset(
-                val_token_ids_tensor, val_input_mask_tensor, val_labels_tensor
-            )
+            val_dataset = TensorDataset(val_token_ids_tensor, val_input_mask_tensor, val_labels_tensor)
 
         # define optimizer and model parameters
         param_optimizer = list(self.model.named_parameters())
@@ -155,10 +143,7 @@ def fit(
                 "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
                 "weight_decay": self.weight_decay,
             },
-            {
-                "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
-                "weight_decay": 0.0,
-            },
+            {"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
         ]
 
         val_sampler = RandomSampler(val_dataset)
@@ -181,9 +166,7 @@ def fit(
 
             train_sampler = RandomSampler(train_dataset)
 
-            train_dataloader = DataLoader(
-                train_dataset, sampler=train_sampler, batch_size=self.batch_size
-            )
+            train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=self.batch_size)
 
             tr_loss = 0.0
             logging_loss = 0.0
@@ -191,18 +174,13 @@ def fit(
 
             for i, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
                 if token_type_ids:
-                    x_batch, mask_batch, token_type_ids_batch, y_batch = tuple(
-                        t.to(device) for t in batch
-                    )
+                    x_batch, mask_batch, token_type_ids_batch, y_batch = tuple(t.to(device) for t in batch)
                 else:
                     token_type_ids_batch = None
                     x_batch, mask_batch, y_batch = tuple(t.to(device) for t in batch)
 
                 outputs = self.model(
-                    input_ids=x_batch,
-                    token_type_ids=token_type_ids_batch,
-                    attention_mask=mask_batch,
-                    labels=y_batch,
+                    input_ids=x_batch, token_type_ids=token_type_ids_batch, attention_mask=mask_batch, labels=y_batch,
                 )
 
                 loss = outputs[0]  # model outputs are always tuple in pytorch-transformers
@@ -220,9 +198,7 @@ def fit(
                 if logging_steps > 0 and global_step % logging_steps == 0:
                     mlflow.log_metric("learning rate", scheduler.get_lr()[0], step=global_step)
                     mlflow.log_metric(
-                        "training loss",
-                        (tr_loss - logging_loss) / (logging_steps * self.batch_size),
-                        step=global_step,
+                        "training loss", (tr_loss - logging_loss) / (logging_steps * self.batch_size), step=global_step,
                     )
                     logging_loss = tr_loss
                 # model checkpointing
@@ -245,9 +221,7 @@ def fit(
                             )
                         else:
                             token_type_ids_batch = None
-                            val_x_batch, val_mask_batch, val_y_batch = tuple(
-                                t.to(device) for t in val_batch
-                            )
+                            val_x_batch, val_mask_batch, val_y_batch = tuple(t.to(device) for t in val_batch)
                         val_outputs = self.model(
                             input_ids=val_x_batch,
                             token_type_ids=val_token_type_ids_batch,
@@ -256,9 +230,7 @@ def fit(
                         )
                         vloss = val_outputs[0]
                         val_loss += vloss.sum().item()
-                    mlflow.log_metric(
-                        "validation loss", val_loss / len(val_dataset), step=global_step
-                    )
+                    mlflow.log_metric("validation loss", val_loss / len(val_dataset), step=global_step)
                     self.model.train()
 
                 if verbose:
@@ -300,13 +272,7 @@ def fit(
         torch.cuda.empty_cache()
 
     def predict(
-        self,
-        token_ids,
-        input_mask,
-        token_type_ids=None,
-        num_gpus=None,
-        batch_size=8,
-        probabilities=False,
+        self, token_ids, input_mask, token_type_ids=None, num_gpus=None, batch_size=8, probabilities=False,
     ):
         """Scores the given dataset and returns the predicted classes.
 
@@ -330,7 +296,7 @@ def predict(
         """
 
         device, num_gpus = get_device(num_gpus)
-        self.model = move_to_device(self.model, device, num_gpus)
+        self.model = move_model_to_device(self.model, device, num_gpus)
 
         self.model.eval()
         preds = []
@@ -342,16 +308,11 @@ def predict(
                 x_batch = torch.tensor(token_ids[start:end], dtype=torch.long, device=device)
                 mask_batch = torch.tensor(input_mask[start:end], dtype=torch.long, device=device)
 
-                token_type_ids_batch = torch.tensor(
-                    token_type_ids[start:end], dtype=torch.long, device=device
-                )
+                token_type_ids_batch = torch.tensor(token_type_ids[start:end], dtype=torch.long, device=device)
 
                 with torch.no_grad():
                     pred_batch = self.model(
-                        input_ids=x_batch,
-                        token_type_ids=token_type_ids_batch,
-                        attention_mask=mask_batch,
-                        labels=None,
+                        input_ids=x_batch, token_type_ids=token_type_ids_batch, attention_mask=mask_batch, labels=None,
                     )
                     preds.append(pred_batch[0].cpu())
                     if i % batch_size == 0: