Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresher Capability for MBOX Downloader #284

Open
6 tasks done
ian-lastname opened this issue Mar 9, 2024 · 30 comments
Open
6 tasks done

Refresher Capability for MBOX Downloader #284

ian-lastname opened this issue Mar 9, 2024 · 30 comments
Assignees

Comments

@ian-lastname
Copy link
Collaborator

ian-lastname commented Mar 9, 2024

1. Purpose

The purpose of this issue is to add refresh capability for the mod mbox downloader and pipermail downloader. I'll have to create a refresh function for both downloaders, as well as a parser function that parses the latest downloaded mail file. There are two mod mbox downloader functions: download_mod_mbox and download_mod_mbox_per_month. Since pagination is required for a refresh function, I will only be focusing on the download_mod_mbox_per_month function.

2. Process

I will base my changes and new code on the already existing code related to the mbox downloader and parser. For the refresh capability, I will look through Sean's jira downloader refresher to get a good idea on how I should make it. Though from what I already know about it, I will most definitely be making a new function that takes a date of some sort.

3. Endpoints

From the meeting, apparently I only have year and month to work with when it comes to end points. I'll do a bit more checking around just to make sure.

4. Task List

  • Double check the Jira download refresher to get a better understanding on how a refresh function would work.
  • Most likely change the naming convention of the downloaded files for the purpose of getting the latest date for the refresh function. (mailinglist_archive_yearmonth.mbox)
  • Create the download refresher (called "mbox_download_refresher")
  • Find a way to delete the duplicates that may be downloaded by the downloader refresher.
  • Make a "get latest date" function similar to the "get latest date" function that I made for Jira. Most likely called "mbox_latest_date" (parse_mbox_latest_date(mbox_save_path))
  • Edit download_pipermail so that it downloads mail as mbox files instead of just txt files, also so that it makes the downloaded files' names adhere to the naming convention (mailinglist_archive_yearmonth.mbox)

Refresher (Endpoint)

I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year.

Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_mod_mbox_per_month, downloading mbox files starting from the from_year parameter, to the current real-life year.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year

Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_pipermail, which downloads all mail from a selected mailing list in a selected archive.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year. Very similar to refresh_mod_mbox; the only reason why I had to make a separate function for pipermail is that the pipermail downloader works differently from the mod mbox downloader.

New Parser: parse_mbox_latest_date(mbox_path)

  • Finds the latest downloaded mail file in a selected mbox save folder. It does so based on the yearmonth part of the file name. This function returns the name of the latest downloaded mail file for use by the refresher function.
  • Can be used for both mail downloaded via download_pipermail and download_mod_mbox_per_month since both downloader functions download the mail files as mbox files (as in, both downloaders save mail with ".mbox" as the extension)

Incorporating Month as an Endpoint Along With Year

Currently, the endpoint parameters for the downloader/refresher functions that take them only take a year (i.e. 2004). Due to this, the downloaders will always start at the beginning of the year when downloaded at a certain "from" year. It is 100% possible to make it so that the downloader can start at a specified month as well as a year. The logic in order to do so is as follows:

  • The "from_year" and "to_year" parameters in download_mod_mbox_per_month can take in a date in a yearmonth format (i.e. January 2017 -> 201701)
  • Extract the year and month values from the parameters using as.numeric(substr([variable], 1, 4)) and as.numeric(substr([variable], 5, 6)) respectively
  • Currently, the mbox downloader loops through each month to download files from the current iterated month and year. When iterating through the months on the extracted year from the "from_year" parameter, just make it so that it starts on the extracted month. Like wise, when iterating through the months of the "to_year" parameter, just make it so that it ends on the extracted month for that parameter.

Pipermail: Manually Prompting Pipermail Refresher to Start After a Certain Year and Month

Pipermail archives have their archived mail in txt or txt.gz formats. Here is an example of a pipermail archive
piper1
In this picture, you can see that the downloadable versions of each mail file are viewable with a link to the txt file. Clicking on the link takes you to this page:
piper2
As you can see, this is a raw file of all the mail messages in April 2018. Notice the naming convention of the downloadable file, which is underlined in red. The file is named on a year-month basis. You'll want to download the file whose date you want to start from, and put it in the save folder in which you will be running the pipermail refresh on.

Next, you will want to rename your downloaded file to the correct naming format (i.e. openssl_mta_201804.mbox as per the second picture). With that, the refresher should start from the month and year that your downloaded file is from.

Chances are, you might not even need to name the file in the correct naming format; as long as you have the yearmonth aspect of the name and the correct extension (i.e. 201804.mbox should be enough to start from April 2018), it should work. You might not even need to actually manually download the file from the mail archive to begin with; just having a blank file with the correct naming convention (or at the very least yearmonth.mbox) should be sufficient enough as the refresher will just delete that file, then replace it with the actual mail file at that year and month.

@ian-lastname ian-lastname self-assigned this Mar 9, 2024
@ian-lastname ian-lastname changed the title Mbox Downloader Refresher Mbox Downloader Refresher (Milestone 2) Mar 11, 2024
@carlosparadis carlosparadis changed the title Mbox Downloader Refresher (Milestone 2) Refresher Capability for MBOX Downloader (Milestone 2) Mar 18, 2024
@ian-lastname
Copy link
Collaborator Author

@ian-lastname
Copy link
Collaborator Author

  • Explain logic behind implementing month in the parameter for from_year and to_year
  • Post hyperlink to pipermail openssl-dev archive all files and the most recent file

@carlosparadis
Copy link
Member

@ian-lastname Please add here the notes requested during the last meeting Friday:

  • Screenshots / urls / examples of how the pipermail .txt file can be obtained to manually prompt your refresher to start after a given year and month

There was another item, what was it?

@ian-lastname
Copy link
Collaborator Author

@ian-lastname Please add here the notes requested during the last meeting Friday:

  • Screenshots / urls / examples of how the pipermail .txt file can be obtained to manually prompt your refresher to start after a given year and month

There was another item, what was it?

I remember the other item; it was to link to the part of the code in the pipermail refresher that would supposedly put a warning message when there is no file found error at a certain url. Turns out, I just removed the code that actually printed a warning message when the error is encountered.

@carlosparadis
Copy link
Member

@ian-lastname If the code already exists, could you make a commit to just place it back? I have not start reviewing your code yet

@carlosparadis
Copy link
Member

The pipermail mbox refresher has a main IF and ELSE. In the case the IF enters, it will default the entire code logic to download_pipermail.

Download pipermail downloads the main page of the mailing list archive (e.g. https://mta.openssl/pipermail/openssl-users/) this page contains the list of all URLs of the mbox as either .txt or .gz. Both are mbox in disguise, we only need to rename the file extensions.

download_pipermail will get the urls, download the appropriate files and rename. download_pipermail relies on this file to know if .gz or .txt will be available and what dates. Without said file, it is impossible to know which will be the case.

The Else portion of pipermail refresher will not rely on the file. Therefore, it will not know the year to end, other than system time, and will also not know whether txt, gz or both are available. In addition, the code logic for current year and last year was split into two functions. Combined with the txt or gz functions, this results in 4 functions being fired every year/month all the way to current year/month from system time. This generates a number of empty files saved, which are subsequently deleted as they are downloaded all the way to current year.

The rework of the else function should rely on the download_pipermail function, and re-obtain the list of all files, use the last file year_month, and then download only the files of either .txt or .gz according to the URLs extracted from said file. This will reduce the number of function calls to only 1 per year month, and also prevent firing for years and months that are not available (perhaps because the archive stopped storing data way before the current year date).

@carlosparadis
Copy link
Member

download_mod_mbox was not tested on a project that the data was not available to current date, as most apache projects had them. I suspect there will be a problem where empty files will be saved (edit this comment later to refer to issue lihan posted about that or I did).

@daomcgill
Copy link
Collaborator

daomcgill commented Sep 12, 2024


Purpose

Rework mbox and pipermail download functions. Add refresh capability for both.

Process

Start by working on pipermail download and refresh functions. Update config files and relevant notebook. Move on to mbox download and refresh.

Task List

  • mail.R/download_pipermail: Create this function.
        - Use SSL Archive (just one of the lists, do not need all) for pipermail mailing list. Should be able to convert by changing .txt to .mbox extension. Fix so it does this.
        - Look at kaiaulu mailing list downloaders for url examples (these do not refresh).
        - Use kaiaulu Jira downloader for example, except uses URL (not API).
  • mail.R/refresh_pipermail: Edit.
        - Conform to refresher cheatsheet.
  • mail.R/convert_pipermail_to_mbox: Remove this function.
  • mail.R/download_mod_mbox_per_month: Remove this function.
  • mail.R/download_mod_mbox: Edit.
  • mail.R/refresh_mod_mbox: Edit.
  • mail.R/parse_mbox(perceval_path, mbox_path): Decide what to do with this.
  • Create notebook explaining how to use functions.

Functions

Pipermail Downloader

download_pipermail(archive_url, mailing_list, start_year_month, end_year_month, archive type, save_folder_path):

  • Gets the year_month of all mail from the table found in archive_url. Example archive URL.
  • If year_month is within start_year_month and end_year_month parameters, download the file from URL into save_folder_path. Save file as ''kaiaulu_'year_month.mbox'.

Pipermail Refresher

refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path):

  • If save_folder_path is empty, download all links in mailing_list.
  • Else, find the most recent year_month from the files in save_folder_path, using the filenames. Delete this file and set as most_recent_year_month.
  • Call download_pipermail(start_year_month = most_recent_year_month, end_year_month = current_year_month) to download months starting with most recent.

Mbox Downloader

download_mod_mbox(base_url, mailing_list, start_year_month, end_year_month, save_folder_path):

  • Downloads mod mbox within specified time range from mailing_list. Saves files as ''kaiaulu_'datetime.mbox'. Example base_url.

Mbox Refresher

refresh_mod_mbox(archive_url, mailing_list, archive_type, start_year_month, save_folder_path):

  • If save_folder_path is empty, download all links in mailing_list.
  • If start_year_month = 'most_recent', find the most recent datetime from the filenames in save_folder_path. Delete the most recent one.
  • Get the datetimes from the mailing_list and download from the deleted one onwards.

Parser

parse_mbox_latest_date(mbox_path):

Libraries

  • httr
  • stringi

@daomcgill
Copy link
Collaborator

daomcgill commented Sep 12, 2024

Question

I tried using the mail.R/download_mod_mbox_per_month function. When the from_year parameter for download_mbox_per_month is set to 201801 and to_year is current_year (to_year is set within the function, not a user parameter), it starts downloading from 201801 and works backwards. Is this expected behavior? My assumption was that it would download files starting from 201801 and move forwards towards more recent years, ending in the current year. The resulting saved mbox file has a size of 0 bytes.
Here is what I did:

conf <- yaml::read_yaml("conf/helix.yml")
save_path_mbox <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mbox"]]
mod_mbox_url <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_url"]]
mailing_list <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mailing_list"]]
archive_url <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_url"]]
archive_type <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_type"]]
from_year <- 201801
save_folder_path <- "save_folder_mail"
refresh_mod_mbox(
    archive_url = archive_url,
    mailing_list = mailing_list,
    archive_type = archive_type,
    from_year = from_year,
    save_folder_path = save_folder_path,
    verbose = TRUE
)

Here it is still downloading, now having reached 2002:
Screenshot 2024-09-12 at 12 45 01 PM

@daomcgill
Copy link
Collaborator

I found this working link of the openssl-project Archives.

@carlosparadis
Copy link
Member

@daomcgill

You can use that or any of the ones here: https://mta.openssl.org/mailman/listinfo/

The behavior of going backwards is not intended. Neither is making 4 calls to download the same file:

kaiaulu/R/mail.R

Lines 544 to 577 in d2ce222

download_txt_files_latest_downloaded_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
latest_downloaded_month=latest_downloaded_month,
current_year = current_year,
current_month = current_month,
save_folder_path=save_folder_path)
download_txt_gz_files_latest_downloaded_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
latest_downloaded_month=latest_downloaded_month,
current_year = current_year,
current_month = current_month,
save_folder_path=save_folder_path)
download_txt_files_current_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
current_year=current_year,
current_month = current_month,
save_folder_path=save_folder_path)
download_txt_gz_files_current_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
current_year = current_year,
current_month = current_month,
save_folder_path=save_folder_path)
}

I would also like to make sure your specification reflects the "refresher" concept, which this mail function has to abide by. To implement this, you will want to look on:

this comment #284 (comment) section:

Pipermail: Manually Prompting Pipermail Refresher to Start After a Certain Year and Month

Looking at the refreshet cheatsheet is likely needed to understand the concept that applies to Kaiaulu downloaders, of which this one also should implement: https://github.com/sailuh/kaiaulu_cheatsheet/blob/main/cheatsheets/refresher-cheatsheet.pdf

@carlosparadis
Copy link
Member

@daomcgill, on the closing week of this PR, I went over what had to be fixed for this to be merged. The summary of that can be found in this comment at a logic flow level:

#284 (comment)

@carlosparadis
Copy link
Member

@daomcgill

If you get a chance, would you mind checking my updated specifications? I want to make sure this part is right before I do anything else.

Ian's specification should still be the target interface we want (copy and pasting from the first message in this issue the part that is relevant to you:


Refresher (Endpoint)

I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year.

Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_mod_mbox_per_month, downloading mbox files starting from the from_year parameter, to the current real-life year.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year

Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_pipermail, which downloads all mail from a selected mailing list in a selected archive.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year. Very similar to refresh_mod_mbox; the only reason why I had to make a separate function for pipermail is that the pipermail downloader works differently from the mod mbox downloader.

New Parser: parse_mbox_latest_date(mbox_path)

  • Finds the latest downloaded mail file in a selected mbox save folder. It does so based on the yearmonth part of the file name. This function returns the name of the latest downloaded mail file for use by the refresher function.
  • Can be used for both mail downloaded via download_pipermail and download_mod_mbox_per_month since both downloader functions download the mail files as mbox files (as in, both downloaders save mail with ".mbox" as the extension)

Note the defined set of functions above offer you the logic to implement "refresh". You need a file name convention (as shown on the cheatsheet), and a function that assumes said convention to find out what is the latest year and month on the system (that of course assumes the user did not introduce gaps manually).

The set of 3 functions above rely on the implementation of download_mod_mbox() and download_pipermail(). Ian did not specify that in his specification, but you should (i'd also appreciate if you format this so the header is not as big as his, this is hard to read).

Maybe you can reuse these 3 functions from him, you will need to check. The download_mod_mbox should be able to take a start_year_month and end_year_month parameter, and so should the download_pipermail(). As we discussed, the download_pipermail() logic needs as a re-do. I believe the download_mod_mbox() outside this PR needs to incorporate the month, and the ability to select a time range so it does not download the entire year.

Try taking another pass on the specification with this, and then post a comment here pinging me (it is easier for me than e-mail). We want this exchange documented here too so it is easy to find and reference in the future.

There was at some point on Spring a joint effort on putting all the signatures together: #292 however this issue specification I am pasting in this comment is the most current one.

@carlosparadis
Copy link
Member

Ian's format of specification is also generally what you want: The function signature and a few bullets giving me some idea of your logic under said function. Try to do that for the download_pipermail() and download_mod_mbox(), and also add the parameters to the signature.

@daomcgill
Copy link
Collaborator

@carlosparadis could you please review updated specifications.

@carlosparadis
Copy link
Member

convert_pipermail_to_mbox(filelist):

I don't believe you need this function. Just try to save the files as .mbox instead of .txt when naming them and see if parse_mbox() recognizes it.

If save_folder_path is empty, throw an error to first call download_pipermail with a specified start_year_month and end_year_month.

I am not sure you should throw any errors. If the folder is empty, it means you need to start from scratch. In pipermail you can just use the file you download with all links to infer the start date. You may need to give some thought on what your options are on mod mbox.

You need this function:

parse_mbox_latest_date(mbox_path)

The refresh should erase the more recent file and re-download, because the mbox files are available monthly. This means the current month is always incomplete and need to be re-downloaded.

In your notes, it mentions a script to keep running: Don't worry about this. This is done via a cron job, which lies outside R. You want a function i can point to a folder and will run on an empty folder, and if i delete one of the recent files, it will just dowload new files. In short, the function works for the empty case, and for the case where files are in there.

After these corrections, I think it should be fine to start coding. Just make sure the logic and purpose of every function is clear. Thanks!

daomcgill added a commit that referenced this issue Sep 15, 2024
- Remove archive_url and archive_type parameters from download_pipermail().
- Add start_year_month and end_year_month parameters for date filtering.
- Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically.
- Change file naming convention to 'kaiaulu_'YYYYMM.mbox'.
- Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection.

Signed-off-by: Dao McGill <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill

Thank you for the update! I believe you are using the specification from Anthony:

#286 (comment)

mailing_list:
  mod_mbox: 
    mail_key_1:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
    mail_key_2:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
  pipermail:
    mail_key_1:
      archive_url: http://some/pipermail/url
      mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/
  1. Is that correct? If so this is fine, except we should change from mail_key_1 to just project_key_1 etc for consistency with the other downloaders (you will want to defer this to the pair working on the project config, but just to make sure we are all on the same page). Pipermail should also have multiple project keys, similar to mod_mbox.

  2. You will also want to update this Notebook: http://itm0.shidler.hawaii.edu/kaiaulu/articles/download_mod_mbox.html and create sections that explain how to use your downloaders, how mailing lists are organized (remember the openssl example I gave you pointing to their page, where there are multiple mailing lists, and then multiple archives? we should add to the text, so it is self contained. Even if it does not relate to using the function per se, it helps a newcomer on understanding what the data is before downloading). You should also introduce the idea of refresh to the users.

Sean did a great job on this one: https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_jira_issues.Rmd (note this is out of sync with the docs, so you will need to read the local text).

  1. Another thing I am noticing is the error on GitHub Actions seems to be associated with XML package does not exist. Could you press the "Check" button on RStudio to run the checks locally, and see if they pass there?

  2. Make the function call explicit for external libraries (the data.table and stringi package is fine) (httr:: or XML::)

I may notice others as you proceed, but this is a iterative process hereafter. Thank you for getting things going.

@carlosparadis
Copy link
Member

To be clear on item 2: You will want to make that Notebook file name to be download_mail.Rmd, and create separate sections for mod_mbox and pipermail showcasing their use. This, in turn, will facilitate it for you to test run everything is working as intended.

I forgot to mention an item 5): We do not need to download both zip and .txt. If given the option, (which you can infer from the first file downloaded with everything), you can just download the .txt. If only the zip is available, then you download the zip. But there is no reason to download both. Therefore, the total number of requests per file from the website should just be 1, after you download the file containing everything.

In your function and in your notebook you should also mention that for pipermail, users should expect an extra file to be downloaded. We need to think where to put it, or at least update parse_mbox to ensure it does not try to read it in thinking it is an .mbox file.

Did you already fix the logic of downloading files reverse in time or once for current year and once for the years prior?

daomcgill added a commit that referenced this issue Sep 17, 2024
…mail()

- Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]]
- Added project_key_2 to helix.yml
- Created /vignettes/download_mail.Rmd to document information about pipermail downloader
- Made function calls explicit for external libraries
- ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
@daomcgill
Copy link
Collaborator

@carlosparadis Updated the function according to you comments. It currently downloads just the gz and then unzips it locally, before deleting the compressed one. Does that work? The files download in the correct order.
I am having issues with Build -> Check, as can be seen in my most recent commit message.

daomcgill added a commit that referenced this issue Sep 17, 2024
…process_gz_to_mbox_in_folder()

- download_pipermail: Attempts to download .txt file first. If unavailable fallback to .gz. If using .gz file, unzips and writes output in .mbox
- Added log messages
- download_pipermail: Added timeout parameter to deal with case that server takes too long to respond
- Added refresh_pipermail function
- Updated vignettes/download_mail.Rmd to include refresh_pipermail
- Added process_gz_to_mbox_in_folder function
@daomcgill
Copy link
Collaborator

@carlosparadis Made changes according to your in-line comment. Please let me know if this seems sound to you. Here are my notes for the proposed changes:

Edited download_pipermail(mailing_list, start_year_month, end_year_month, save_folder_path)

  1. Create Directory: The function first ensures that the save_folder_path directory exists. If it doesn't, create the directory.
  2. Ensure Correct Mailing List URL: The mailing_list URL is verified to end with a /, which is important when constructing the links for downloading files.
  3. Download and Parse the Mailing List: The function sends a GET request to the mailing list’s URL to retrieve content. The content is parsed to extract the rows of data from the table that contains the file links.
  4. Extract Date and Links from Rows: The function loops through the table rows (skipping the header) to extract the dates and links from each row. It converts the date to YYYYMM format and checks if the date falls within the specified start_year_month and end_year_month. If a link exists for that date, it is stored for later download.
  5. File Download Process: The function tries to download the .txt version of the file first. If the .txt file is unavailable, it attempts to download the .gz version of the file. If both attempts fail, the function skips the link and logs a message.
  6. Handling .gz Files: If the .gz file is downloaded, the function unzips it and writes its contents to an .mbox file. After unzipping, the .gz file is deleted to avoid having multiple versions of the same data.
  7. File Writing: If the .txt file is available, it is downloaded directly and saved with a .mbox extension (skips step 6). The final list of downloaded .mbox files is returned.

Added refresh_pipermail(mailing_list, start_year_month, save_folder_path)

  1. Create Directory: The function first checks whether the directory save_folder_path exists. If not, it creates the directory recursively.
  2. Check if Folder is Empty:If the folder is empty, it calls the download_pipermail function from start_year_month to the current month (end_year_month), which is found using Sys.Date().
  3. Find the Most Recent Month: If the folder is not empty, the function looks for files in the folder matching the pattern kaiaulu_YYYYMM.mbox. It extracts the YYYYMM parts from the filenames and finds the most recent month using max().
  4. Delete the Most Recent File: The function deletes the most recent file (assuming it's the last one downloaded). This is because we want to redownload that month to ensure it's up to date.
  5. Redownload the Most Recent to Current Month: After deleting the most recent file, the function calls download_pipermail again, starting from the most recent month up to the current month.

Added process_gz_to_mbox_in_folder(folder_path)

As per your request, I added a process_gz_to_mbox_in_folder(folder_path) function. My understanding was that you want to be able to receive a folder that may contain .gz or .mboz files. Any .gz files are then unzipped and renamed to .mbox. If any .mbox file with that name already exists, it will be overwritten. Question: is this necessary? Assuming the user already has this folder containing both types of files, I could see why this would be useful. If, however, they are using the download_pipermail function, this function should never be necessary as .gz files are already processed during the download.

Note: I have not yet started working on the parser or mod mbox functions, so those are yet unchanged.
Next step: Start on the parsers?

@carlosparadis
Copy link
Member

@daomcgill I have a request: the level of detail in your post would be great if you moved exactly as you stated into the code right around where it is implemented. It may appear excessive to be in code, but since Kaiaulu is code that has great benefit from ICS 496 students, I am perfectly fine we are excessive in explaining the code (R does not come natural to everyone).

I would also like to save you time on having to post them here for me, so I can review directly in code as you document it.

I am not sure if there is anything you need to do for parser. At the end of the day both pipermail and mod_mbox will give you a folder of .mbox files. parse_mbox() wants to see that. Is there a reason you wanted to edit it? Or was that to ensure it only reads *.mbox files?

I would say try to run parse_mbox() and then proceed to download_mod_mbox() changes.

p.s.: Let's agree to continue the specification discussion here, since a part is on PR and a part is now on issue. For more specific in line code comments, we can use the PR since GitHub will auto post them there.

p.s.2: If you ever feel you are spending too much time in anything going in circles because it is not clear in text, we can set an additional call to go over it too.

Thank you for your hard work, I am really impressed!

@carlosparadis carlosparadis assigned daomcgill and unassigned daomcgill and beydlern Nov 11, 2024
@carlosparadis carlosparadis added this to the ics496-fall24-m1 milestone Nov 11, 2024
@carlosparadis carlosparadis changed the title Refresher Capability for MBOX Downloader (Milestone 2) Refresher Capability for MBOX Downloader Nov 11, 2024
carlosparadis added a commit that referenced this issue Nov 12, 2024
Moves some of the narrative to functions. Function
documentation can be accessed on demand instead of
placed on Notebook. Some of the docs was moved to
inline on the function of pipermail.

Signed-off-by: Carlos Paradis <[email protected]>
carlosparadis added a commit that referenced this issue Nov 12, 2024
carlosparadis added a commit that referenced this issue Nov 12, 2024
The rest of kaiaulu does not use this.

Signed-off-by: Carlos Paradis <[email protected]>
carlosparadis added a commit that referenced this issue Nov 12, 2024
added url tag, etc.

Signed-off-by: Carlos Paradis <[email protected]>
daomcgill added a commit that referenced this issue Nov 12, 2024
…ration

- Use refresh in exec
- Use getters in exec
- Change kaiaulu version in exec
- Fix paths in helix.yml
- Remove unused parameters
- Change cat to message

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Nov 13, 2024
- Fixed uri parameter in parse_mbox for perceval
- Small change in example

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Nov 13, 2024
Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
This reverts commit 56dff9c.
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Nov 13, 2024
daomcgill added a commit that referenced this issue Dec 1, 2024
Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

@carlosparadis mail notebook is updated.

daomcgill added a commit that referenced this issue Dec 6, 2024
daomcgill added a commit that referenced this issue Dec 6, 2024
daomcgill added a commit that referenced this issue Dec 8, 2024
- Make start_year_month optional
- Determine start_year_month from existing files if they exist
- Return error if no existing files, and no date specified

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 8, 2024
daomcgill added a commit that referenced this issue Dec 9, 2024
- Takes file path for mbox file to parse
- No longer need to pass project_conf

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Dec 9, 2024
daomcgill added a commit that referenced this issue Dec 9, 2024
Signed-off-by: Dao McGill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants