Skip to content

A small site based on the short story collection "The Garden Party: and Other Stories" by Katherine Mansfield. This site was built for a web scraping exercise for students at the University of Canterbury.

Notifications You must be signed in to change notification settings

polsci/scraping-garden-party

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping Garden Party

August 9, 2019
Geoff Ford
https://polsci.github.io/
https://github.com/polsci/

Introduction

This repo contains files for a small and simple Github pages website (i.e. using Jekyll): https://ucdh.github.io/scraping-garden-party/

The site contains stories from The Garden Party: and Other Stories by Katherine Mansfield. It was built by Geoff Ford as a simple website for students to work with for a web scraping exercise.

It is based on this text file from Project Gutenberg.

The scraping exercise

The scraping task is intended to get students inspecting HTML markup on a very small and simple website and to think about how to write code to extract and follow links and extract specific text.

The scraping task required students to retrieve the 15 links to short stories using CSS selectors and Python and the Requests and BeautifulSoup libraries. They then needed to write some code that looped through the links, built a full URL, retrieved each short story page and output a specific part of the short story page (specifically the text in the paragraph with class="introductory-section"). They were required to put in a 2 second delay between each request using time.sleep().

I've added a couple of things subsequently:

  • A input field in the left column - you can enter CSS selectors and it will highlight the relevant content. This has been added as a tool to demonstrate CSS selectors in class and for students to try this outside of class.
  • Author and date in HTML metadata and in the page to demonstrate that this could be scraped with text to structure our data. Dates are the dates stories were first published in periodicals from a quick read of Wikipedia OR are 1922-01-01 (the year the collection was first published). Note: only year is displayed on short story pages.
  • A link to one of Mansfield's poems, however in this page the text is loaded via AJAX. This is used to demonstrate that not all text is straightforwardly in the HTML in modern websites.

Scraping Garden Party Corpus

A corpus of Garden Party short stories is included in the directory garden-party-corpus. You can download a zip file of the Scraping Garden Party Corpus. The filenames contain a recognisable version of the short story name. The text files contain the short story without the title.

About

A small site based on the short story collection "The Garden Party: and Other Stories" by Katherine Mansfield. This site was built for a web scraping exercise for students at the University of Canterbury.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 94.7%
  • SCSS 5.3%