-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresher Capability for MBOX Downloader #284
Comments
|
@ian-lastname Please add here the notes requested during the last meeting Friday:
There was another item, what was it? |
I remember the other item; it was to link to the part of the code in the pipermail refresher that would supposedly put a warning message when there is no file found error at a certain url. Turns out, I just removed the code that actually printed a warning message when the error is encountered. |
@ian-lastname If the code already exists, could you make a commit to just place it back? I have not start reviewing your code yet |
The pipermail mbox refresher has a main IF and ELSE. In the case the IF enters, it will default the entire code logic to download_pipermail. Download pipermail downloads the main page of the mailing list archive (e.g. https://mta.openssl/pipermail/openssl-users/) this page contains the list of all URLs of the mbox as either .txt or .gz. Both are mbox in disguise, we only need to rename the file extensions. download_pipermail will get the urls, download the appropriate files and rename. download_pipermail relies on this file to know if .gz or .txt will be available and what dates. Without said file, it is impossible to know which will be the case. The Else portion of pipermail refresher will not rely on the file. Therefore, it will not know the year to end, other than system time, and will also not know whether txt, gz or both are available. In addition, the code logic for current year and last year was split into two functions. Combined with the txt or gz functions, this results in 4 functions being fired every year/month all the way to current year/month from system time. This generates a number of empty files saved, which are subsequently deleted as they are downloaded all the way to current year. The rework of the else function should rely on the download_pipermail function, and re-obtain the list of all files, use the last file year_month, and then download only the files of either .txt or .gz according to the URLs extracted from said file. This will reduce the number of function calls to only 1 per year month, and also prevent firing for years and months that are not available (perhaps because the archive stopped storing data way before the current year date). |
download_mod_mbox was not tested on a project that the data was not available to current date, as most apache projects had them. I suspect there will be a problem where empty files will be saved (edit this comment later to refer to issue lihan posted about that or I did). |
PurposeRework mbox and pipermail download functions. Add refresh capability for both. ProcessStart by working on pipermail download and refresh functions. Update config files and relevant notebook. Move on to mbox download and refresh. Task List
FunctionsPipermail Downloaderdownload_pipermail(archive_url, mailing_list, start_year_month, end_year_month, archive type, save_folder_path):
Pipermail Refresherrefresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path):
Mbox Downloaderdownload_mod_mbox(base_url, mailing_list, start_year_month, end_year_month, save_folder_path):
Mbox Refresherrefresh_mod_mbox(archive_url, mailing_list, archive_type, start_year_month, save_folder_path):
Parserparse_mbox_latest_date(mbox_path): Libraries
|
QuestionI tried using the mail.R/download_mod_mbox_per_month function. When the from_year parameter for download_mbox_per_month is set to 201801 and to_year is current_year (to_year is set within the function, not a user parameter), it starts downloading from 201801 and works backwards. Is this expected behavior? My assumption was that it would download files starting from 201801 and move forwards towards more recent years, ending in the current year. The resulting saved mbox file has a size of 0 bytes. conf <- yaml::read_yaml("conf/helix.yml")
save_path_mbox <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mbox"]]
mod_mbox_url <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_url"]]
mailing_list <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mailing_list"]]
archive_url <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_url"]]
archive_type <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_type"]]
from_year <- 201801
save_folder_path <- "save_folder_mail" refresh_mod_mbox(
archive_url = archive_url,
mailing_list = mailing_list,
archive_type = archive_type,
from_year = from_year,
save_folder_path = save_folder_path,
verbose = TRUE
) |
I found this working link of the openssl-project Archives. |
You can use that or any of the ones here: https://mta.openssl.org/mailman/listinfo/ The behavior of going backwards is not intended. Neither is making 4 calls to download the same file: Lines 544 to 577 in d2ce222
I would also like to make sure your specification reflects the "refresher" concept, which this mail function has to abide by. To implement this, you will want to look on: this comment #284 (comment) section:
Looking at the refreshet cheatsheet is likely needed to understand the concept that applies to Kaiaulu downloaders, of which this one also should implement: https://github.com/sailuh/kaiaulu_cheatsheet/blob/main/cheatsheets/refresher-cheatsheet.pdf |
@daomcgill, on the closing week of this PR, I went over what had to be fixed for this to be merged. The summary of that can be found in this comment at a logic flow level: |
Ian's specification should still be the target interface we want (copy and pasting from the first message in this issue the part that is relevant to you: Refresher (Endpoint)I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year. Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)
Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)
New Parser: parse_mbox_latest_date(mbox_path)
Note the defined set of functions above offer you the logic to implement "refresh". You need a file name convention (as shown on the cheatsheet), and a function that assumes said convention to find out what is the latest year and month on the system (that of course assumes the user did not introduce gaps manually). The set of 3 functions above rely on the implementation of download_mod_mbox() and download_pipermail(). Ian did not specify that in his specification, but you should (i'd also appreciate if you format this so the header is not as big as his, this is hard to read). Maybe you can reuse these 3 functions from him, you will need to check. The download_mod_mbox should be able to take a start_year_month and end_year_month parameter, and so should the download_pipermail(). As we discussed, the download_pipermail() logic needs as a re-do. I believe the download_mod_mbox() outside this PR needs to incorporate the month, and the ability to select a time range so it does not download the entire year. Try taking another pass on the specification with this, and then post a comment here pinging me (it is easier for me than e-mail). We want this exchange documented here too so it is easy to find and reference in the future. There was at some point on Spring a joint effort on putting all the signatures together: #292 however this issue specification I am pasting in this comment is the most current one. |
Ian's format of specification is also generally what you want: The function signature and a few bullets giving me some idea of your logic under said function. Try to do that for the download_pipermail() and download_mod_mbox(), and also add the parameters to the signature. |
@carlosparadis could you please review updated specifications. |
I don't believe you need this function. Just try to save the files as .mbox instead of .txt when naming them and see if parse_mbox() recognizes it.
I am not sure you should throw any errors. If the folder is empty, it means you need to start from scratch. In pipermail you can just use the file you download with all links to infer the start date. You may need to give some thought on what your options are on mod mbox. You need this function:
The refresh should erase the more recent file and re-download, because the mbox files are available monthly. This means the current month is always incomplete and need to be re-downloaded. In your notes, it mentions a script to keep running: Don't worry about this. This is done via a cron job, which lies outside R. You want a function i can point to a folder and will run on an empty folder, and if i delete one of the recent files, it will just dowload new files. In short, the function works for the empty case, and for the case where files are in there. After these corrections, I think it should be fine to start coding. Just make sure the logic and purpose of every function is clear. Thanks! |
- Remove archive_url and archive_type parameters from download_pipermail(). - Add start_year_month and end_year_month parameters for date filtering. - Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically. - Change file naming convention to 'kaiaulu_'YYYYMM.mbox'. - Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection. Signed-off-by: Dao McGill <[email protected]>
Thank you for the update! I believe you are using the specification from Anthony: mailing_list:
mod_mbox:
mail_key_1:
archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
mail_key_2:
archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
pipermail:
mail_key_1:
archive_url: http://some/pipermail/url
mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/
Sean did a great job on this one: https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_jira_issues.Rmd (note this is out of sync with the docs, so you will need to read the local text).
I may notice others as you proceed, but this is a iterative process hereafter. Thank you for getting things going. |
To be clear on item 2: You will want to make that Notebook file name to be download_mail.Rmd, and create separate sections for mod_mbox and pipermail showcasing their use. This, in turn, will facilitate it for you to test run everything is working as intended. I forgot to mention an item 5): We do not need to download both zip and .txt. If given the option, (which you can infer from the first file downloaded with everything), you can just download the .txt. If only the zip is available, then you download the zip. But there is no reason to download both. Therefore, the total number of requests per file from the website should just be 1, after you download the file containing everything. In your function and in your notebook you should also mention that for pipermail, users should expect an extra file to be downloaded. We need to think where to put it, or at least update parse_mbox to ensure it does not try to read it in thinking it is an .mbox file. Did you already fix the logic of downloading files reverse in time or once for current year and once for the years prior? |
…mail() - Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]] - Added project_key_2 to helix.yml - Created /vignettes/download_mail.Rmd to document information about pipermail downloader - Made function calls explicit for external libraries - ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
@carlosparadis Updated the function according to you comments. It currently downloads just the gz and then unzips it locally, before deleting the compressed one. Does that work? The files download in the correct order. |
…process_gz_to_mbox_in_folder() - download_pipermail: Attempts to download .txt file first. If unavailable fallback to .gz. If using .gz file, unzips and writes output in .mbox - Added log messages - download_pipermail: Added timeout parameter to deal with case that server takes too long to respond - Added refresh_pipermail function - Updated vignettes/download_mail.Rmd to include refresh_pipermail - Added process_gz_to_mbox_in_folder function
@carlosparadis Made changes according to your in-line comment. Please let me know if this seems sound to you. Here are my notes for the proposed changes: Edited download_pipermail(mailing_list, start_year_month, end_year_month, save_folder_path)
Added refresh_pipermail(mailing_list, start_year_month, save_folder_path)
Added process_gz_to_mbox_in_folder(folder_path)As per your request, I added a process_gz_to_mbox_in_folder(folder_path) function. My understanding was that you want to be able to receive a folder that may contain .gz or .mboz files. Any .gz files are then unzipped and renamed to .mbox. If any .mbox file with that name already exists, it will be overwritten. Question: is this necessary? Assuming the user already has this folder containing both types of files, I could see why this would be useful. If, however, they are using the download_pipermail function, this function should never be necessary as .gz files are already processed during the download. Note: I have not yet started working on the parser or mod mbox functions, so those are yet unchanged. |
@daomcgill I have a request: the level of detail in your post would be great if you moved exactly as you stated into the code right around where it is implemented. It may appear excessive to be in code, but since Kaiaulu is code that has great benefit from ICS 496 students, I am perfectly fine we are excessive in explaining the code (R does not come natural to everyone). I would also like to save you time on having to post them here for me, so I can review directly in code as you document it. I am not sure if there is anything you need to do for parser. At the end of the day both pipermail and mod_mbox will give you a folder of .mbox files. parse_mbox() wants to see that. Is there a reason you wanted to edit it? Or was that to ensure it only reads *.mbox files? I would say try to run parse_mbox() and then proceed to download_mod_mbox() changes. p.s.: Let's agree to continue the specification discussion here, since a part is on PR and a part is now on issue. For more specific in line code comments, we can use the PR since GitHub will auto post them there. p.s.2: If you ever feel you are spending too much time in anything going in circles because it is not clear in text, we can set an additional call to go over it too. Thank you for your hard work, I am really impressed! |
Moves some of the narrative to functions. Function documentation can be accessed on demand instead of placed on Notebook. Some of the docs was moved to inline on the function of pipermail. Signed-off-by: Carlos Paradis <[email protected]>
Signed-off-by: Carlos Paradis <[email protected]>
The rest of kaiaulu does not use this. Signed-off-by: Carlos Paradis <[email protected]>
added url tag, etc. Signed-off-by: Carlos Paradis <[email protected]>
…ration - Use refresh in exec - Use getters in exec - Change kaiaulu version in exec - Fix paths in helix.yml - Remove unused parameters - Change cat to message Signed-off-by: Dao McGill <[email protected]>
- Fixed uri parameter in parse_mbox for perceval - Small change in example Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
@carlosparadis mail notebook is updated. |
Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
- Make start_year_month optional - Determine start_year_month from existing files if they exist - Return error if no existing files, and no date specified Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
- Takes file path for mbox file to parse - No longer need to pass project_conf Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
Signed-off-by: Dao McGill <[email protected]>
1. Purpose
The purpose of this issue is to add refresh capability for the mod mbox downloader and pipermail downloader. I'll have to create a refresh function for both downloaders, as well as a parser function that parses the latest downloaded mail file. There are two mod mbox downloader functions: download_mod_mbox and download_mod_mbox_per_month. Since pagination is required for a refresh function, I will only be focusing on the download_mod_mbox_per_month function.
2. Process
I will base my changes and new code on the already existing code related to the mbox downloader and parser. For the refresh capability, I will look through Sean's jira downloader refresher to get a good idea on how I should make it. Though from what I already know about it, I will most definitely be making a new function that takes a date of some sort.
3. Endpoints
From the meeting, apparently I only have year and month to work with when it comes to end points. I'll do a bit more checking around just to make sure.
4. Task List
Refresher (Endpoint)
I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year.
Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)
Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)
New Parser: parse_mbox_latest_date(mbox_path)
Incorporating Month as an Endpoint Along With Year
Currently, the endpoint parameters for the downloader/refresher functions that take them only take a year (i.e. 2004). Due to this, the downloaders will always start at the beginning of the year when downloaded at a certain "from" year. It is 100% possible to make it so that the downloader can start at a specified month as well as a year. The logic in order to do so is as follows:
Pipermail: Manually Prompting Pipermail Refresher to Start After a Certain Year and Month
Pipermail archives have their archived mail in txt or txt.gz formats. Here is an example of a pipermail archive
In this picture, you can see that the downloadable versions of each mail file are viewable with a link to the txt file. Clicking on the link takes you to this page:
As you can see, this is a raw file of all the mail messages in April 2018. Notice the naming convention of the downloadable file, which is underlined in red. The file is named on a year-month basis. You'll want to download the file whose date you want to start from, and put it in the save folder in which you will be running the pipermail refresh on.
Next, you will want to rename your downloaded file to the correct naming format (i.e. openssl_mta_201804.mbox as per the second picture). With that, the refresher should start from the month and year that your downloaded file is from.
Chances are, you might not even need to name the file in the correct naming format; as long as you have the yearmonth aspect of the name and the correct extension (i.e. 201804.mbox should be enough to start from April 2018), it should work. You might not even need to actually manually download the file from the mail archive to begin with; just having a blank file with the correct naming convention (or at the very least yearmonth.mbox) should be sufficient enough as the refresher will just delete that file, then replace it with the actual mail file at that year and month.
The text was updated successfully, but these errors were encountered: