Skip to content

netique/cermine

Repository files navigation

cermine

Lifecycle: experimental CRAN status R-CMD-check

CZVV neither publishes a REST API nor provides exam data in easy-to-process JSON or XML formats. Instead, the data are published in separate XLSX files that are difficult to access programmatically because the links are dynamically generated by an ASP.NET server. This package provides a simple R API that allows you to gather links to all available exam data files and returns them in a tidy table.

Installation

You can install the development version of {cermine} like so:

remotes::install_github("netique/cermine")

Basic workflow

Say we want to download all item-level “Maturita” data for 2024. First, we need to load {cermine} and {tidyverse} packages:

library(cermine)
library(tidyverse)

Then we can get the list of individual exam events. Note that all HTTP requests are cached for the current R session, so you don’t have to worry much about hammering the server. If you believe that the data have changed, you can force a refresh by setting the force argument to TRUE or simply restart your R session.

exams <- get_exam_events()
#> ℹ Extracting exam events...✔ Done extracting exam events... [5ms]

exams
#> # A tibble: 27 × 2
#>    project           year                        
#>    <chr>             <chr>                       
#>  1 Maturitní zkoušky 2024 - podzim               
#>  2 Maturitní zkoušky 2024 - Jaro                 
#>  3 Maturitní zkoušky 2023 - Podzim               
#>  4 Maturitní zkoušky 2023 - Jaro                 
#>  5 Maturitní zkoušky 2022 - Podzim               
#>  6 Maturitní zkoušky 2022 - Jaro                 
#>  7 Maturitní zkoušky 2021 - Podzim               
#>  8 Maturitní zkoušky 2021 - Jaro mimořádný termín
#>  9 Maturitní zkoušky 2021 - Jaro                 
#> 10 Maturitní zkoušky 2020 - Podzim               
#> # ℹ 17 more rows

Now we can filter out the entries we are interested in. Because the year column can contain the season besides the year itself, we need to use the str_detect() function from the {stringr} package to catch any year that contains “2024”. But don’t worry, as the mine_links() function will tell you that you’ve provided an option that is not available.

exams_filtered <- exams |>
  filter(project == "Maturitní zkoušky", str_detect(year, "2024"))

Next, we can mine the links to the data files. This may take a while, as the function has to “submit” the ASP.NET forms to obtain HTTP response with link-populated HTMLs from the server. Provide the projects and years arguments with the columns from the exams_filtered tibble (there is no need to provide only unique values, so you can use the column as is). The data_type argument is set to “item” only. The function will return a tibble with the project, year, data type, data file name, and the associated link. Progress bar will show up if the operation is expected to take longer than a couple of seconds.

links <- mine_links(
  projects = exams_filtered$project,
  years = exams_filtered$year,
  data_type = "item"
)

links
#> # A tibble: 13 × 5
#>    project           year          data_type file_name                  file_url
#>    <fct>             <fct>         <fct>     <chr>                      <chr>   
#>  1 Maturitní zkoušky 2024 - podzim item      MZ2024p_AJ_polozkova_data… https:/…
#>  2 Maturitní zkoušky 2024 - podzim item      MZ2024p_CJ_polozkova_data… https:/…
#>  3 Maturitní zkoušky 2024 - podzim item      MZ2024p_MA_polozkova_data… https:/…
#>  4 Maturitní zkoušky 2024 - podzim item      MZ2024p_NJ_polozkova_data… https:/…
#>  5 Maturitní zkoušky 2024 - podzim item      MZ2024p_RJ_polozkova_data… https:/…
#>  6 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_AJ_polozkova_data… https:/…
#>  7 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_CJ_polozkova_data… https:/…
#>  8 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_FJ_polozkova_data… https:/…
#>  9 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_MA_polozkova_data… https:/…
#> 10 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_MX_polozkova_data… https:/…
#> 11 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_NJ_polozkova_data… https:/…
#> 12 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_RJ_polozkova_data… https:/…
#> 13 Maturitní zkoušky 2024 - Jaro   item      MZ2024j_SJ_polozkova_data… https:/…

To download the files, you can use download.file() function as follows. Because the files can be large, it is recommended to set the timeout to a higher value than the default 60 seconds. The download.file() function is not vectorized, so we have to use purrr::map2() to iterate over the file names and URLs. You can opt for a simple for loop as well.

# optionally set the timeout to 10 minutes, if the download fails
# options(timeout = 60 * 10)

map2(
  links$file_name, links$file_url,
  \(file_name, file_url, ...) {
    download.file(file_url, file_name)
  }
)

Et voilà! You have downloaded the data files to your current working directory. The files are named according to the file_name column in the links tibble. You can now read them into R using the readxl package or any other package that can read XLSX files.

About

Scrape raw or agreggated data for Maturita and JPZ exams by CZVV

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages