download the dataset #12

YuMiaoTHU · 2019-03-25T01:35:09Z

thanks for your excellent work！

when I run download-bbc-articles.py, it showed that

I want to konw why, thanks for your help~

shashiongithub · 2019-03-25T10:56:42Z

Maybe accessing WebArxive urls are restricted! If the problem remains, drop me an email.

YuMiaoTHU · 2019-03-25T11:36:29Z

Thanks for your reply! sometimes It' hard for us to access some website....... I still can't get the dataset, could you send me the raw data or the processed data via google drive or dropbox? Thanks for your hard work!

thinkwee · 2019-04-17T04:16:05Z

I have the same problem. The server is not stable. I downloaded about 2000 data for the first time then i rerun the scripts, it cannot download anymore.

joelowj · 2019-10-07T14:18:24Z

Hi @shashiongithub, I am having similar issues downloading the data with the script. At the moment, I am working on a paper and would love to use xsum dataset for my experiment. I was hoping if you could share them with me through other channels. I tried contacting you through your email but could not get the email to send to your mailbox. My email is [email protected] Thanks a lot!

shashiongithub · 2019-11-25T16:47:39Z

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets.
Let me know if you have any questions.

shahbazsyed · 2020-04-03T07:53:12Z

Hi,
The link provided above is broken. Is there another way to get the dataset ?

mingzi151 · 2020-04-04T02:09:11Z

Hey, I'm not able to open the link either. Can you please help?

shashiongithub · 2020-04-04T15:37:56Z

http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

shahbazsyed · 2020-04-06T07:58:39Z

Thanks!

fatihbeyhan · 2020-04-20T12:18:29Z

hey! link is broken :/ can you share updated one for me, so i can download the dataset..

shashiongithub · 2020-04-22T09:13:51Z

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

sshleifer · 2020-05-22T17:56:00Z

That url creates a dir called bbc-summary-data containing files like bbc-summary-data/{bbcid}.summary.
Which code is meant to be run after that to continue preprocessing? bbcid.summary files are not mentioned in the README. Thanks!

sshleifer · 2020-05-22T18:14:32Z

First file bbc-summary-data/10000983.summary looks like this:

shashiongithub · 2020-05-22T19:13:49Z

Few things to keep in mind:

There are some extra summary files here, you should ignore them
(they have more one sentence in their summary etc).

Please use the training/dev/test ids provided here to find which one to use:
https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset

There are few mismatches (between data here and the formats on the Github):

bbc-summary-data/bbcid.summary --> xsum-extracts-from-downloads/bbcid.data

In each summary file:
[SN]URL[SN] => [XSUM]URL[XSUM]
[SN]TITLE[SN] => Ignore this, not used.
[SN]FIRST-SENTENCE[SN] => [XSUM]FIRST-SENTENCE[XSUM]
[SN]RESTBODY[SN] => [XSUM]RESTBODY[XSUM]

With these changes the preprocessing scripts should work.

sshleifer · 2020-05-25T17:11:30Z

Verifying that I don't need to run prepare_bbc_data.py after doing the SN --> XSUM replacement, right?
Which field is the summary? Or is that in another file?

For context, I'm trying to replicate the results in the bart paper

Thanks!

msadat3 · 2021-04-12T02:54:31Z

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

StevenTang1998 · 2021-08-25T01:16:26Z

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

The same problem here. Could you please post a new URL?

anamtaamin · 2023-04-20T06:57:52Z

hey! link is broken :/ can you share updated one for me, so i can download the dataset..
try...
https://huggingface.co/datasets/xsum/resolve/main/data/XSUM-EMNLP18-Summary-Data-Original.tar.gz

mikechen66 · 2023-09-22T05:45:01Z

It geneates an error if use the downloaded dataset. Please see the details as follows.

While write the abiove-mentioned weblink (listed as follows again)

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

into the xsum.py

_URL_DATA = "http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz"

And then run the code as follows.

from datasets import load_dataset
raw_datasets = load_dataset("xsum.py",  "raw_datasets")

It generates the error as follows..

ReadError: unexpected end of data

The above exception was the direct cause of the following exception:
File ~/miniconda3/envs/tf/lib/python3.10/site-packages/datasets/builder.py:1712, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1710 if isinstance(e, SchemaInferenceError) and e.context is not None:
1711 e = e.context
-> 1712 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1714 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

The dataset source may have a problem.

Notes:

However, if use the original code, it can run successfully.

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

pyfisch mentioned this issue Feb 6, 2020

How to use dataset #17

Open

wonjininfo mentioned this issue Apr 11, 2020

Raw dataset #20

Closed

wonjininfo mentioned this issue May 28, 2020

Need help for summarization task for the XSum dataset (Out of range error) google-research/pegasus#7

Closed

JingqingZ mentioned this issue Dec 22, 2020

Anyone prepared xsum dataset(manual) from tfds to work with pegasus google-research/pegasus#153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download the dataset #12

download the dataset #12

YuMiaoTHU commented Mar 25, 2019

shashiongithub commented Mar 25, 2019

YuMiaoTHU commented Mar 25, 2019 via email •

edited

Loading

thinkwee commented Apr 17, 2019

joelowj commented Oct 7, 2019

shashiongithub commented Nov 25, 2019

shahbazsyed commented Apr 3, 2020

mingzi151 commented Apr 4, 2020

shashiongithub commented Apr 4, 2020

shahbazsyed commented Apr 6, 2020

fatihbeyhan commented Apr 20, 2020

shashiongithub commented Apr 22, 2020

sshleifer commented May 22, 2020 •

edited

Loading

sshleifer commented May 22, 2020

shashiongithub commented May 22, 2020

sshleifer commented May 25, 2020

msadat3 commented Apr 12, 2021 •

edited

Loading

StevenTang1998 commented Aug 25, 2021

anamtaamin commented Apr 20, 2023

mikechen66 commented Sep 22, 2023 •

edited

Loading

download the dataset #12

download the dataset #12

Comments

YuMiaoTHU commented Mar 25, 2019

shashiongithub commented Mar 25, 2019

YuMiaoTHU commented Mar 25, 2019 via email • edited Loading

thinkwee commented Apr 17, 2019

joelowj commented Oct 7, 2019

shashiongithub commented Nov 25, 2019

shahbazsyed commented Apr 3, 2020

mingzi151 commented Apr 4, 2020

shashiongithub commented Apr 4, 2020

shahbazsyed commented Apr 6, 2020

fatihbeyhan commented Apr 20, 2020

shashiongithub commented Apr 22, 2020

sshleifer commented May 22, 2020 • edited Loading

sshleifer commented May 22, 2020

shashiongithub commented May 22, 2020

sshleifer commented May 25, 2020

msadat3 commented Apr 12, 2021 • edited Loading

StevenTang1998 commented Aug 25, 2021

anamtaamin commented Apr 20, 2023

mikechen66 commented Sep 22, 2023 • edited Loading

YuMiaoTHU commented Mar 25, 2019 via email •

edited

Loading

sshleifer commented May 22, 2020 •

edited

Loading

msadat3 commented Apr 12, 2021 •

edited

Loading

mikechen66 commented Sep 22, 2023 •

edited

Loading