Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download the dataset #12

Open
YuMiaoTHU opened this issue Mar 25, 2019 · 19 comments
Open

download the dataset #12

YuMiaoTHU opened this issue Mar 25, 2019 · 19 comments

Comments

@YuMiaoTHU
Copy link

thanks for your excellent work!

when I run download-bbc-articles.py, it showed that
image

I want to konw why, thanks for your help~

@shashiongithub
Copy link
Collaborator

Maybe accessing WebArxive urls are restricted! If the problem remains, drop me an email.

@YuMiaoTHU
Copy link
Author

YuMiaoTHU commented Mar 25, 2019 via email

@thinkwee
Copy link

I have the same problem. The server is not stable. I downloaded about 2000 data for the first time then i rerun the scripts, it cannot download anymore.

@joelowj
Copy link

joelowj commented Oct 7, 2019

Hi @shashiongithub, I am having similar issues downloading the data with the script. At the moment, I am working on a paper and would love to use xsum dataset for my experiment. I was hoping if you could share them with me through other channels. I tried contacting you through your email but could not get the email to send to your mailbox. My email is [email protected] Thanks a lot!

@shashiongithub
Copy link
Collaborator

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets.
Let me know if you have any questions.

@shahbazsyed
Copy link

Hi,
The link provided above is broken. Is there another way to get the dataset ?

@mingzi151
Copy link

Hey, I'm not able to open the link either. Can you please help?

@shashiongithub
Copy link
Collaborator

http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

@shahbazsyed
Copy link

Thanks!

@wonjininfo wonjininfo mentioned this issue Apr 11, 2020
@fatihbeyhan
Copy link

hey! link is broken :/ can you share updated one for me, so i can download the dataset..

@shashiongithub
Copy link
Collaborator

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

@sshleifer
Copy link

sshleifer commented May 22, 2020

That url creates a dir called bbc-summary-data containing files like bbc-summary-data/{bbcid}.summary.
Which code is meant to be run after that to continue preprocessing? bbcid.summary files are not mentioned in the README. Thanks!

@sshleifer
Copy link

First file bbc-summary-data/10000983.summary looks like this:
image

@shashiongithub
Copy link
Collaborator

Few things to keep in mind:

  1. There are some extra summary files here, you should ignore them
    (they have more one sentence in their summary etc).

Please use the training/dev/test ids provided here to find which one to use:
https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset

  1. There are few mismatches (between data here and the formats on the Github):

bbc-summary-data/bbcid.summary --> xsum-extracts-from-downloads/bbcid.data

In each summary file:
[SN]URL[SN] => [XSUM]URL[XSUM]
[SN]TITLE[SN] => Ignore this, not used.
[SN]FIRST-SENTENCE[SN] => [XSUM]FIRST-SENTENCE[XSUM]
[SN]RESTBODY[SN] => [XSUM]RESTBODY[XSUM]

With these changes the preprocessing scripts should work.

@sshleifer
Copy link

  1. Verifying that I don't need to run prepare_bbc_data.py after doing the SN --> XSUM replacement, right?

  2. Which field is the summary? Or is that in another file?

For context, I'm trying to replicate the results in the bart paper

Thanks!

@msadat3
Copy link

msadat3 commented Apr 12, 2021

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

@StevenTang1998
Copy link

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

The same problem here. Could you please post a new URL?

@anamtaamin
Copy link

hey! link is broken :/ can you share updated one for me, so i can download the dataset..
try...
https://huggingface.co/datasets/xsum/resolve/main/data/XSUM-EMNLP18-Summary-Data-Original.tar.gz

@mikechen66
Copy link

mikechen66 commented Sep 22, 2023

It geneates an error if use the downloaded dataset. Please see the details as follows.

While write the abiove-mentioned weblink (listed as follows again)

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

into the xsum.py

_URL_DATA = "http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz"

And then run the code as follows.

from datasets import load_dataset
raw_datasets = load_dataset("xsum.py",  "raw_datasets")

It generates the error as follows..

ReadError: unexpected end of data

The above exception was the direct cause of the following exception:
File ~/miniconda3/envs/tf/lib/python3.10/site-packages/datasets/builder.py:1712, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1710 if isinstance(e, SchemaInferenceError) and e.context is not None:
1711 e = e.context
-> 1712 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1714 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

The dataset source may have a problem.

Notes:

However, if use the original code, it can run successfully.

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests