-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commitmessage functionalities #281
base: dev
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #281 +/- ##
==========================================
+ Coverage 81.69% 81.87% +0.17%
==========================================
Files 16 16
Lines 5146 5219 +73
==========================================
+ Hits 4204 4273 +69
- Misses 942 946 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
first draft of default function for stemming including preprocessing steps Signed-off-by: Leo Sendelbach <[email protected]>
refactor preprocessing steps to be in their own method, add tokenization, lemmatization, keyword search and token count functionalities Signed-off-by: Leo Sendelbach <[email protected]>
add new file 'test-data-misc' for tests, incomplete Signed-off-by: Leo Sendelbach <[email protected]>
Also change order in 'install.R' Signed-off-by: Leo Sendelbach <[email protected]>
Signel test for each step and one for combination Signed-off-by: Leo Sendelbach <[email protected]>
One test with only lowercase, one with all preprocessing steps Signed-off-by: Leo Sendelbach <[email protected]>
Single test as there is no preprocessing for tokenization Signed-off-by: Leo Sendelbach <[email protected]>
One test with only lowercase preprocessing, one with all preprocessing steps Signed-off-by: Leo Sendelbach <[email protected]>
multiple tests for different match functions: any, all and a custom function Signed-off-by: Leo Sendelbach <[email protected]>
Single test as all functionality is covered Signed-off-by: Leo Sendelbach <[email protected]>
force matrix version 1.5.0 or higher as required for textstem to work Signed-off-by: Leo Sendelbach <[email protected]>
now also testing the possible configuration of only title vs full message Signed-off-by: Leo Sendelbach <[email protected]>
change method name from 'get.preprocessed.messages' to 'get.preprocessed.commit.messages' for consistency Signed-off-by: Leo Sendelbach <[email protected]>
Since version 1.5.4 did not work, change matrix version to 1.5.0 Signed-off-by: Leo Sendelbach <[email protected]>
install.R
Outdated
Matrix.version = installed.packages()[rownames(installed.packages()) == "Matrix", "Version"] | ||
if (compareVersion(Matrix.version, "1.3.0") == -1) { | ||
print("WARNING: Matrix version 1.3.0 or higher is necessary for using coronet. Re-install package Matrix...") | ||
install.packages("Matrix", dependencies = NA, verbose = TRUE, quiet = TRUE) | ||
if (compareVersion(Matrix.version, "1.5.0") == -1) { | ||
print("WARNING: Matrix version 1.5.0 or higher is necessary for using coronet. Re-install package Matrix...") | ||
matrix.1.5.4.url = "https://cran.r-project.org/src/contrib/Archive/Matrix/Matrix_1.5-4.tar.gz" | ||
install.packages(matrix.1.5.4.url, repos = NULL, dependencies = NA, verbose = TRUE, quiet = TRUE) | ||
## redo installation of textstem, which fails if matrix is outdated or not present | ||
install.packages("textstem", dependencies = NA, verbose = TRUE, quiet = TRUE) | ||
Matrix.version = installed.packages()[rownames(installed.packages()) == "Matrix", "Version"] | ||
if (compareVersion(Matrix.version, "1.3.0") == -1) { | ||
if (compareVersion(Matrix.version, "1.5.0") == -1) { | ||
print("WARNING: Re-installation of package Matrix did not end up in the necessary package version.") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that this worked for R4.1, but not for R4.0. Unfortunately, I did not find any other reason for why quanteda stills fails to be installed on R4.0 although Matrix is present.
I have a couple of ideas, but I don't know whether any of this helps:
(1) Try to install system dependency "libtbb-dev" as suggested by quanteda. To do so, could you please add "sudo apt-get install --assume-yes libtbb-dev" to file ".github/workflows/pull_request.yml" in which all the system dependencies are installed for our test pipeline? Maybe this helps... (Please make sure to be able to undo all these installation changes via rebase if they don't work. I don't want to have such things ending up in the coronet commit history...)
(2) If this does not help, we need to figure out why the package installation fails. To do so, we need to find a way to figure out why this fails and need to get access to the full installation logs. Actually, I don't have an idea whether we can enable verbose logs for our CI - I don't think so. The best option would be to run the package installation on a local VM or container on which R4.0 is installed. Unfortunately, we don't have such a local container available right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding libtbb-dev does not solve the problem. I will try to run the installation on R4.0 locally next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set up a conda virtual environment for further testing.
Installing r-base 4.0.5 and running 'install.packages("quanteda")' (after installing matrix 1.5.0) results in the following error message:
** R
Error in parse(outFile) :
/tmp/RtmpNJiizT/R.INSTALL1d8d2689383b6/quanteda/R/char_select.R:53:61: unexpected '>'
52: valuetype = valuetype,
53: case_insensitive = case_insensitive) |>
^
ERROR: unable to collate and parse R files for package ‘quanteda’
I have no idea what that could imply. The only other warning was the following:
configure: WARNING: Intel TBB not installed; install TBB devel package for parallel processing or update PKG_CONFIG_PATH environment variable
I do not think this has anything to do with it since it seems like a configuration option and not something that would result in a failing installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations @Leo-Send! You have found a bug/inconsistency that very likely violates the dependencies of package quanteda
🎉
The code of quanteda
(assuming that the lines you quoted from "char_select.R" belong to this package) uses the so-called "native pipe" |>
, which has been introduced in R 4.1.0. Therefore, this does not work with R 4.0 and ends up in an error there.
We definitively should report this to quanteda
. There are two options how this can be fixed:
(1) Either they want to support older R versions and don't want to restrict the R version - then they need to replace the native pipe in their code by code that uses only language constructs that are present in R 4.0.
(2) Or they don't want to replace the native pipe - then they have to update their R version dependency in their DESCRIPTION
file to be compatible with CRAN requirements.
Option (1) would be the best option for us and won't require any changes on our side (except for the need to wait until they release their fix).
Option (2) would restrict us to discontinue the support for R 4.0 or to install an older version of quanteda
in this case.
In the end, it is up to the package maintainers which option they choose, so we need to wait for that - there is nothing that we can do in the meantime, as we don't know how they will decide. In the past, I've already reported two of such requirements violations to two different R packages (one of them also was about the native pipe). The one R package reacted to my report in a way that they fixed their dependency requirement. The other R package reacted to my report by removing the native pipe from their code as they wanted to support R 4.0 for at least one more year.
So, the question is: Would you like to report this to the package yourself, or would you like me to do that? If you would like to do that, here a few tips:
- Don't link it to our project here - just mention that you have an environment that still relies on R 4.0.5 and that you are unable to change for legacy reasons. (This might still lead package maintainers to recommend to use a more recent R version, but one could still give it a try).
- Don't mention any other packages (such as matrix) or unrelated warnings (such as the TBB warning) because this is not necessary for reproducing this problem.
- Explain that the use of the native pipe conflicts with the required R version in their
DESCRIPTION
file of their most recent release. - Cc @bockthom and @hechtlC in the issue there such that we receive notifications from the issue.
You can also draft a message first and let Christian or me check it before you post it there if you are unsure, or @hechtlC or I can report it if you don't want to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am opening an issue with quanteda. I will, however, need to mention Matrix as the manual installation of matrix 1.5.0 is needed to reproduce the bug (because the installation would otherwise fail earlier).
install.R
Outdated
install.packages("RcppParallel", dependencies = NA, verbose = TRUE, quiet = TRUE) | ||
install.packages("stringi", dependencies = NA, verbose = TRUE, quiet = TRUE) | ||
quanteda.3.3.1.url = "https://cran.r-project.org/src/contrib/Archive/quanteda/quanteda_3.3.1.tar.gz" | ||
install.packages(quanteda.3.3.1.url, repos = NULL, dependencies = NA, verbose = TRUE, quiet = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check that installs an older version of quanteda manually, similar to the workaround for matrix. It is ugly, as this version requires additional dependencies (which apparently need to be installed seperately... I am unsure why). Nevertheless, we could use this workaround to continue supporting R4.0 without relying on quanteda fixing their bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the issue report and for finding a temporary workaround.
A few comments on this:
- Regarding the other hard-coded dependencies: Did you figure out whether setting the
dependencies
parameter of the quanteda installation toTRUE
(instead ofNA
) would help to avoid hard coding the installation of the other dependent packages? - We are not on a rush. Let's wait how the quanteda maintainers react to your issue. If they fix it on there side, we don't need this fix on our side. If not, we need to think about whether we prefer such lengthy workarounds with hard-coded packages, or whether we prefer ceasing support for older R versions - but there still is enough time to do discuss this in a couple of weeks.
- Just a note for future debugging (nothing to change): It might help to temporarily set the
quiet
parameter toFALSE
to get a more detailed error log on the CI pipeline if we end up in a situation where we don't know what's going on... - Please don't forget to remove the commit that added the system library. (And when I say "remove", I don't mean to undo the commit, but really remove it via interactive rebase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you figure out whether setting the dependencies parameter of the quanteda installation to TRUE (instead of NA)
I tried that, but it didn't work. The installation fails because it could not get the dependencies.
It might help to temporarily set the quiet parameter to FALSE to get a more detailed error log on the CI pipeline
Since the bug was reproducible locally, I did not think it was necessary. I did debug without the quiet parameter though.
remove it via interactive rebase
On it :)
7698768
to
236dd6a
Compare
added new section in 'additional functionalities' Signed-off-by: Leo Sendelbach <[email protected]>
236dd6a
to
ee63a6e
Compare
Prerequisites
showcase.R
with respect to my changes.dev
.Description
Adds commit message functionalities to coronet, as discussed in issue #277. Includes Stemming, Tokenization, Lemmatization, keyword search and a token count metric.
Changelog