This web scraper allows you to extract blog titles and links from a given URL. You can specify CSS selectors for the title and link elements to precisely extract the data. The scraper also supports pagination, allowing you to scrape multiple pages.
-
What is the URL? The URL is the web address of the blog or website you want to scrape. For example:
https://realpython.com/
https://example-blog.com/
Paste the URL in the Enter URL field of the GUI.
-
What is a Title Selector? The Title Selector is a CSS selector used to identify the HTML elements containing the article titles on the webpage.
- For example, if the title is inside an
<h2>
element with the classcard-title
, the selector would be:h2.card-title
- For example, if the title is inside an
-
How to find the title selector?
- Open the webpage in a browser (e.g., Google Chrome).
- Right-click on the title of an article and select Inspect.
- Find the HTML element that wraps the title (e.g.,
<h2 class="card-title">Article Title</h2>
). - Use this element's tag (e.g.,
h2
) and class (e.g.,.card-title
) as your selector.
-
What is a Link Selector? The Link Selector is a CSS selector used to identify the HTML elements containing the links to the full articles.
- For example, if the link is inside an anchor (
<a>
) tag inside theh2
element, the selector would be:h2.card-title a
- For example, if the link is inside an anchor (
-
How to find the link selector?
- Right-click on the link to the article (e.g., the title text) and select Inspect.
- Find the
<a>
tag inside the title element (e.g.,<a href="article-link">Article Title</a>
). - Use the anchor tag (
<a>
) as your selector.
- What is this field for?
If the website uses pagination (i.e., content is spread across multiple pages), you can specify how many pages to scrape.
- For example, if you want to scrape pages 1 through 5, enter
5
in the Number of Pages to Scrape field. - If you want to scrape just the first page, leave it as
1
(default value).
- For example, if you want to scrape pages 1 through 5, enter
- Enter the web address of the page you want to scrape. Example:
https://realpython.com/
- Enter the CSS selector for the title of each article. Example for
https://realpython.com/
:h2.card-title
.
- Enter the CSS selector for the link to the article. Example for
https://realpython.com/
:h2.card-title a
.
- Enter how many pages you want to scrape. For example, enter
1
for the first page, or5
for the first five pages.
- Click the Scrape button. The scraper will begin extracting titles and links from the specified URL and pages.
- Once scraping is complete, the extracted results will be displayed in the Results area.
- To save the results, click Export Results. You can export the data to CSV, JSON, or TXT files.
- The
config.json
file stores the last used settings (URL, title selector, link selector, number of pages). This allows you to easily continue scraping without re-entering these values. - If the file doesn’t exist, the program will create it with default values.
{
"url": "https://realpython.com/",
"title_selector": "h2.card-title",
"link_selector": "h2.card-title a",
"pages": 1
}