IndeedScraping 🖱️

Let's dive into the code in more detail 🕵️‍♀️ You can find below a detailed breakdown of the code's functionality, explaining how it performs web scraping to collect job information from the Indeed website, e.g., for the occupation "kuchař" (cook) in the Czech Republic.

The code starts by loading necessary R packages (RSelenium, wdman, openxlsx, and dplyr) to support web scraping and data manipulation.
It then sets up an automated web browser (Google Chrome) using RSelenium and assigns a random port number to it.
The script navigates to the Czech Republic's Indeed website (https://cz.indeed.com/), which is the target website for job scraping.
Before performing the job search, the script has commented lines to navigate to other country-specific Indeed websites (Germany, Netherlands, Belgium, UK, Ireland, Spain, Hungary, and Switzerland). These lines can be uncommented to scrape data from those countries as well.
The script locates the search box on the website using its XPath expression and enters the search term "kuchař" (which means "cook" in Czech) into the search box. Then, it simulates pressing the "Enter" key to initiate the search.
The main part of the code is inside a while (TRUE) loop, meaning it will continue running indefinitely until manually stopped. The purpose of this loop is to keep scraping multiple pages of job listings.
Within the loop, the code iterates over the job listings on the current page (up to 15 listings) and collects the following information for each job:
- company: The name of the hiring company.
- rating: The job rating (if available). Note: Some job listings may not have a rating, so it will be NA.
- location: The location of the job.
- occupation: The job title or occupation (in this case, "kuchař").
- desc: A short description of the job.
- type: Type of employment, full-time or part-time. Note: Some job listings may not have information, so it will be NA.
- salary: The salary information (if available). Note: Some job listings may not have information, so it will be NA.
- salaryANDtype: type+salary.
- date: The posting date of the job.
To extract this information from each job listing, the script uses various XPath expressions for each element on the page. The getElementTextOrNA function is created to handle the extraction. If an element is not found or any error occurs during extraction, it returns NA.
After extracting the data from one page of job listings, the script clicks the "Next" button to move to the next page. A short pause of 1 second (Sys.sleep(1)) is added to allow the new page to load before proceeding with scraping.
The loop continues this process of scraping data from multiple pages until manually stopped.
Once the scraping process is complete, the code proceeds to merge the separate vectors (company, rating, location, occupation, desc, salary, type,salaryANDtype, and date) into a single data frame named all.
Finally, the script writes the collected data to an Excel file named "cook_cz.xlsx" using the write.xlsx function from the openxlsx package.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
Scrape Indeed.R		Scrape Indeed.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndeedScraping 🖱️

About

Releases

Packages

Languages

kketevani/IndeedScraping

Folders and files

Latest commit

History

Repository files navigation

IndeedScraping 🖱️

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages