Skip to content

Commit

Permalink
Add Wikipedia Persian Dataset (#3629)
Browse files Browse the repository at this point in the history
Currently, the Open-assistant model doesn't support Farsi. This is a
text-only dataset to learn Farsi (Persian).

One of my friends fine-tuned LLaMa on this dataset and It could
understand Farsi grammar and word usage very well. If the Open-assistant
team wants to add support to Farsi, this should be the first step.

I have transformed the dataset into the standard that has been mentioned
[here](https://projects.laion.ai/Open-Assistant/docs/data/datasets) and
uploaded it to [my huggingface
account](https://huggingface.co/datasets/pourmand1376/fa-wikipedia).


- #2974
  • Loading branch information
pourmand1376 authored Aug 3, 2023
1 parent aa96100 commit 65f5c2b
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 0 deletions.
1 change: 1 addition & 0 deletions data/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"tv_dialogue": "sedthh/tv_dialogue", # TV and Movie dialogues and transcripts
"fd_dialogue": "sedthh/fd_dialogue", # TV and Movie dialogues and transcripts from ForeverDreaming
"tlcv2.0_oa": "pythainlp/tlcv2.0_oa", # Thai classical literature texts
"fa-wikipedia": "pourmand1376/fa-wikipedia", # Farsi Wikipedia texts
}

INSTRUCTION_DATASETS = {
Expand Down
6 changes: 6 additions & 0 deletions data/datasets/fa-wikipedia/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
This dataset is crawled from
[farsi wikipedia](https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C).
This is valuable clean text data in persian (Farsi). It contains information
about all subjects.

It has 2.53M Articles.

0 comments on commit 65f5c2b

Please sign in to comment.