CZVV neither publishes a REST API nor provides exam data in easy-to-process JSON or XML formats. Instead, the data are published in separate XLSX files that are difficult to access programmatically because the links are dynamically generated by an ASP.NET server. This package provides a simple R API that allows you to gather links to all available exam data files and returns them in a tidy table.
You can install the development version of {cermine} like so:
remotes::install_github("netique/cermine")
Say we want to download all item-level “Maturita” data for 2024. First, we need to load {cermine} and {tidyverse} packages:
library(cermine)
library(tidyverse)
Then we can get the list of individual exam events. Note that all HTTP
requests are cached for the current R session, so you don’t have to
worry much about hammering the server. If you believe that the data have
changed, you can force a refresh by setting the force
argument to
TRUE
or simply restart your R session.
exams <- get_exam_events()
#> ℹ Extracting exam events...✔ Done extracting exam events... [5ms]
exams
#> # A tibble: 27 × 2
#> project year
#> <chr> <chr>
#> 1 Maturitní zkoušky 2024 - podzim
#> 2 Maturitní zkoušky 2024 - Jaro
#> 3 Maturitní zkoušky 2023 - Podzim
#> 4 Maturitní zkoušky 2023 - Jaro
#> 5 Maturitní zkoušky 2022 - Podzim
#> 6 Maturitní zkoušky 2022 - Jaro
#> 7 Maturitní zkoušky 2021 - Podzim
#> 8 Maturitní zkoušky 2021 - Jaro mimořádný termín
#> 9 Maturitní zkoušky 2021 - Jaro
#> 10 Maturitní zkoušky 2020 - Podzim
#> # ℹ 17 more rows
Now we can filter out the entries we are interested in. Because the year
column can contain the season besides the year itself, we need to use
the str_detect()
function from the {stringr} package to catch any year
that contains “2024”. But don’t worry, as the mine_links()
function
will tell you that you’ve provided an option that is not available.
exams_filtered <- exams |>
filter(project == "Maturitní zkoušky", str_detect(year, "2024"))
Next, we can mine the links to the data files. This may take a while, as
the function has to “submit” the ASP.NET forms to obtain HTTP response
with link-populated HTMLs from the server. Provide the projects
and
years
arguments with the columns from the exams_filtered
tibble
(there is no need to provide only unique values, so you can use the
column as is). The data_type
argument is set to “item” only. The
function will return a tibble with the project, year, data type, data
file name, and the associated link. Progress bar will show up if the
operation is expected to take longer than a couple of seconds.
links <- mine_links(
projects = exams_filtered$project,
years = exams_filtered$year,
data_type = "item"
)
links
#> # A tibble: 13 × 5
#> project year data_type file_name file_url
#> <fct> <fct> <fct> <chr> <chr>
#> 1 Maturitní zkoušky 2024 - podzim item MZ2024p_AJ_polozkova_data… https:/…
#> 2 Maturitní zkoušky 2024 - podzim item MZ2024p_CJ_polozkova_data… https:/…
#> 3 Maturitní zkoušky 2024 - podzim item MZ2024p_MA_polozkova_data… https:/…
#> 4 Maturitní zkoušky 2024 - podzim item MZ2024p_NJ_polozkova_data… https:/…
#> 5 Maturitní zkoušky 2024 - podzim item MZ2024p_RJ_polozkova_data… https:/…
#> 6 Maturitní zkoušky 2024 - Jaro item MZ2024j_AJ_polozkova_data… https:/…
#> 7 Maturitní zkoušky 2024 - Jaro item MZ2024j_CJ_polozkova_data… https:/…
#> 8 Maturitní zkoušky 2024 - Jaro item MZ2024j_FJ_polozkova_data… https:/…
#> 9 Maturitní zkoušky 2024 - Jaro item MZ2024j_MA_polozkova_data… https:/…
#> 10 Maturitní zkoušky 2024 - Jaro item MZ2024j_MX_polozkova_data… https:/…
#> 11 Maturitní zkoušky 2024 - Jaro item MZ2024j_NJ_polozkova_data… https:/…
#> 12 Maturitní zkoušky 2024 - Jaro item MZ2024j_RJ_polozkova_data… https:/…
#> 13 Maturitní zkoušky 2024 - Jaro item MZ2024j_SJ_polozkova_data… https:/…
To download the files, you can use download.file()
function as
follows. Because the files can be large, it is recommended to set the
timeout to a higher value than the default 60 seconds. The
download.file()
function is not vectorized, so we have to use
purrr::map2()
to iterate over the file names and URLs. You can opt for
a simple for loop as well.
# optionally set the timeout to 10 minutes, if the download fails
# options(timeout = 60 * 10)
map2(
links$file_name, links$file_url,
\(file_name, file_url, ...) {
download.file(file_url, file_name)
}
)
Et voilà! You have downloaded the data files to your current working
directory. The files are named according to the file_name
column in
the links
tibble. You can now read them into R using the readxl
package or any other package that can read XLSX files.