-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.4 #144
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contains only basic computation. Tests aren't adapted. The function add.vertex.attribute.first.activity is able to handle multiple data sources now. The result is stored as a list containing all minimums, named with the corresponding data source. Signed-off-by: Klara Schlueter <[email protected]>
Concerns method add.vertex.attribute.first.activity. If no information is found for some person and data source, than the corresponding element of the list added as vertex atttribute is set to NA (instead of not being created). Signed-off-by: Klara Schlueter <[email protected]>
A flag of the function add.vertex.attribute.first.activity now indicates, if the first acticity should be computed and stored per single data source or "globally" over all given data sources. Signed-off-by: Klara Schlueter <[email protected]>
Improve naming and add helper methods. Signed-off-by: Klara Schlueter <[email protected]>
One of three cases is tested: given multiple data sources, compute the first activity per person and datasource. Tests for the other two cases are added in commit will be added in upcomming commits. Signed-off-by: Klara Schlueter <[email protected]>
Tests one of three cases: compute the first activity per person over all given data sources (one value per person). Tests for the other two cases are added in commit 9224150 and one future commit. Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Tests one of three cases: given one data source, compute the first activity per person per data source. Tests for the other two cases were added in commit 9224150 and commit d7ebb79. Signed-off-by: Klara Schlueter <[email protected]>
Rename function, rectify documentation structure, delete browser statement. Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Improve add.vertex.attribute.first.activity Reviewed-by: Claus Hunsen <[email protected]> Reviewed-by: Thomas Bock <[email protected]>
When the data for a data source are empty (i.e., an empty data.frame), we need to deal with the corresponding empty list 'minimums.per.person' in the function 'get.first.activity.data'; otherwise, the next lines of code will throw an error. The easy fix is to check whether the data are empty, add NA values to the intermediate data structure and, then, skip the next lines of code. Signed-off-by: Claus Hunsen <[email protected]>
As a follow-up to 4a9ad23, the tests for the vertex attribute 'first.activity' are changed to contain empty issue data. This works against potential regressions when empty data sources are not properly handled. Signed-off-by: Claus Hunsen <[email protected]>
When planning the functionality for cmputing the vertex attribute 'first.activity', the straight-forward idea was to "merge lists" and then simply run 'min' for the global first activity if needed. When implementing the functionality in commit 334c258, the "simple" merge has proven to be not that easy to perform. With some effort, in this patch, we replace the current implementation with the originally planned version. When performing a microbenchmark on BusyBox sample data (five two-year ranges of data and networks), comparing both versions of the code (see column 'expr' below), the new version proves to be faster (also due to better parallelization): expr min lq mean median uq max neval old 83.07881 84.16226 85.79368 85.44123 87.45646 89.70994 20 new 75.85983 78.09042 78.76604 78.71732 79.44771 83.42560 20 Additionally, the code is more concise and we do not need additional helper functions. Add namespace imports to module 'networks-covariates'. Signed-off-by: Claus Hunsen <[email protected]>
For convenience, we move the vector 'c("Base_Feature", "File_Level")' to be a global constant in the module 'data'. Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
When calling the function 'ProjectData$get.artifacts()', we currently retrieve the list of source-code artifacts only. With this patch, we can give the desired data source as a parameter to the function. To implement this generic functionality, we need to add two global map constants: 'DATASOURCE.TO.ARTIFACT.FUNCTION' and DATASOURCE.TO.ARTIFACT.COLUMN. The first maps the data sources to the specific function that is needed to get the correct (filtered) artifact data. The latter maps the data sources to the specific column in the data where the artifacts are stored. This puts work on issue #97. Additionally, adjust some TODO items. Signed-off-by: Claus Hunsen <[email protected]>
As a follow-up to 0d184b8, the data-mapping methods of ProjectData are adapted as follows. The following methods are incorporated into 'get.author2artifact' and 'get.artifact2author', respectively, where the data source needs to be passed as a parameter: - get.thread2author, - get.author2thread, - get.issue2author, and - get.author2issue. The following method is incorporated into 'get.artifact2artifact' which retrieves data for the bipartite relations (the data source needs to be passed as a parameter): - get.commit2artifact. The following methods are removed: - get.author2commit, - get.author2mail, - get.author2file, and - get.commit2file. Affected methods and functions as well as the showcase file are adapted appropriately. The test suite is fixed due to a re-order of vertices in one test case. This fixes #97. Signed-off-by: Claus Hunsen <[email protected]>
Currently, we have the possibility to split data in a time-based manner by specifying a time period or specific bins. However, we do not have the possibility to specify the number of windows. With this commit, we add this functionality. To implement this functionality, the functions 'split.data.time.based' and 'split.get.bins.time.based' both get a new parameter 'number.windows' and the function 'generate.date.sequence' a parameter 'length.out'. Additionally, adjust function documentation appropriately. This fixes #49. Signed-off-by: Claus Hunsen <[email protected]>
When selecting the right data for a call-graph-based artifact network, it is necessary to check additionally whether we have RangeData objects at hand – otherwise, we catch non-expressive error messages. Signed-off-by: Claus Hunsen <[email protected]>
With the new function 'delete.authors.without.bipartite.edges', the user is able to delete author vertices from a network that do not have incident bipartite edges. This fixes #76. Signed-off-by: Claus Hunsen <[email protected]>
Different from issue templates, pull-request templates in the folder '.github/PULL_REQUEST_TEMPLATE/' only work with the URL parameter 'template' [1], i.e., there is no default template that is filled in any time. Therefore, the template is moved to a place where it is recognized properly. [1] https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/ Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
The 'sapply' in the function 'split.networks.time.based' does not work if the provided list of networks only contains one element: In that case,it returns a matrix containing one column instead of returning a list. However, this leads to problems when passing this matrix to the function 'get.date.from.unix.timestamp'. When using 'lapply' instead, we get a list all the time, which fixes the problem. In addition, the corresponding test in the test suite is enhanced by an additional statement which just runs the function with a list containing one element (to keep track of that for future changes). Signed-off-by: Thomas Bock <[email protected]>
Extend documentation in the source code for the function 'get.first.activity.data' by inclusion of a running example. Addtionally, improve the identifaction of identification of minimal fist activitity across data sources (i.e., use 'do.call(c, ...)' instead of 'unlist'). Props to @bockthom for pointing this out. Signed-off-by: Claus Hunsen <[email protected]>
To improve comprehensibility of the function 'merge.network.data', the code formatting is improved by re-arranging blank lines and comments. Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
The function 'save.and.load' should also be able to work without assignment of its return value. This way, the real potential of the parameter 'variable' is unleashed. Until now, the parameter was needed to access the given variable from the environment stored in a dumped file. With the solution implemented in this patch, the stored variable is assigned to the calling environment under the given name (independent from the data's existence). This way, the assignment of the return value is not necessary anymore. However, for compatibility, it is still possible to assign the return value to another variable name when calling the function. Additionally, the documentation of the function is improved by giving more details on the internals and a (hopefully) clearer wording. Thanks to @hechtlC for pointing this out, although unintentionally. Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Fix minor documentation issues that have been pointed out by @bockthom in his review on PR #140. Additionally, rename function 'delete.authors.without.bipartite.edges' to 'delete.authors.without.specific.edges' and allow both types of edges via an argument. Signed-off-by: Claus Hunsen <[email protected]>
Currently, we have the possibility to split data in a time-based manner by specifying a time period, specific bins, or the number of resulting time windows. However, we do not have the latter possibility for network splitting. With this commit, we add this functionality. To implement this functionality, the functions 'split.network.time.based' and 'split.networks.time.based' both get a new parameter to handle this functionality. Additionally, streamline other 'split.*' functions for uniformity regarding the 'number.windows' and 'sliding.window' parameters. Finally, adjust function documentation appropriately. This is a follow-up for commit 40974ba. Hopefully, this really fixes #49. Thanks to @bockthom for pointing this out in his review on PR #140. Signed-off-by: Claus Hunsen <[email protected]>
When giving a parameter 'length.out' to the function 'generate.date.sequence', the resulting time period may get a fractional number. This may result in additional generated dates, i.e., the length of the date sequence is not 'length.out' anymore – with the difference between the last two dates being likely only one second. To fix this, the calculated time period is rounded up. This will decrease the length of the last time period between dates, but, at most, only by 'length.out' seconds. To introduce a test case for this, the function 'split.get.bins.time.base' is fixed to set the levels on the bin factor in the correct order. Signed-off-by: Claus Hunsen <[email protected]>
For some reason, when using the parameter 'directed = TRUE', the function 'igraph::eigen_centrality' can run into ARPACK errors (in some rare cases). For example, for a small subnet of an author network of BusyBox having 7 nodes and 12 edges, we get the following error: Error in base::.Call(.NAME, ...) : At arpack.c:1174 : ARPACK error, Maximum number of iterations reached In addition: Warning message: In .Call("R_igraph_eigen_adjacency", graph, algorithm, which, options, : At arpack.c:776 :ARPACK solver failed to converge (1001 iterations, 0/1 eigenvectors converged) Those errors were already discussed in the igraph project on GitHub, e.g.: igraph/igraph#512 According to the conversation there, to avoid those errors, the number of iterations could be increased from its default value 1000 to a higher number of iterations. Also the ncv value could be increased. However, in the above example, this still runs into to the described error. Hence, we need a more drastic action: Increase the accuracy tolerance from 0 to 0.1. These leads to less precise results but omits the error (at least in the example stated above). So, in general, we still compute 'eigen_centrality' with its default ARPACK options. If this leads to an error, compute the 'eigen_centrality' again by using the adjusted ARPACK options (maxiter = 5000 and tol=0.1). Signed-off-by: Thomas Bock <[email protected]>
Thanks to @bockthom for spotting the wrong default value of the argument 'specific.edge.types'. Signed-off-by: Claus Hunsen <[email protected]>
By adding the method 'group.data.source.by.column' to the class 'ProjectData', we provide alternatives to some of the methods that have been removed in commit 23a8aa3. By providing the data source and the column in the data source to group by, the following methods can be replaced with appropriate calls to the new method: - get.author2commit and get.author2file (data.source = "commits", column = "author.name"), and - get.author2mail (data.source = "mails", column = "author.name"). Signed-off-by: Claus Hunsen <[email protected]>
Thanks to @bockthom for pointing out that the inline documentation has not been adapted in commit 4e211f0. Signed-off-by: Claus Hunsen <[email protected]>
Due to an error introduced in 94cc87b, the sliding-window parameter has not been handled correctly when splitting networks and data. To fix the problem, the code to disable sliding windows when splitting bins are set, is moved to the correct position. Additionally, the code in the function 'split.network.time.based' is adapted to conform with the other functions regarding the if-condition on specified bins. Props to @bockthom for spotting this. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>
To conform to the other splitting functions, the function 'split.networks.time.based' now gets a parameter 'bins'. The parameter is handled in the same way as in the other functions. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>
Although the patch in c5413c2 is basically alright, the added code fragments are missing package prefixes. When running in the error case catched by the additional code, the user may run into further errors. Signed-off-by: Claus Hunsen <[email protected]>
The functions 'merge.networks' and 'merge.network.data' should work in a generic way, such that they work also on general-purpose networks not coming from the network library. To enable this, the case that empty networks (in respect to number of edges and/or vertices) is now handled properly. In detail, two special cases need to be handled: - empty edge lists created from edgeless networks need to be filtered out and - if all edge lists end up empty, an appropriate empty edge list needs to be created (instead of NULL). Passing over empty vertex data (which results in 'NULL' in function 'merge.network.data') does not affect the correct handling of data right away. Appropriate test cases are added in the file 'tests/test-networks.R'. This fixes #142. Signed-off-by: Claus Hunsen <[email protected]>
Props to @klaraschlueter for spotting this. Signed-off-by: Claus Hunsen <[email protected]>
Rename the methods in the ProjectData class that are used for retrieving base data for network construction as follows to obtain more generic method names: - 'get.artifact2author' → 'group.authors.by.data.column', - 'get.author2artifact' → 'group.artifact.by.data.column', and - 'group.data.source.by.column' → 'group.data.by.column'. The functionality of the method 'get.artifact2artifact' is incorporated into the method 'group.artifact.by.data.column'. These names are a result of the discussions in PR #140 and offline discussions with @hechtlC and @bockthom. Essentially, the changes are a follow-up to commit 23a8aa3. Now, the names express more directly the basic idea behind the methods w.r.t. the function 'get.key.to.value.from.df' However, for convenience, delegation methods with the old names are left for the first two methods, although they emit deprecation warnings. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>
When splitting and subsetting data with the function 'get.key.to.value.from.df', it may happen that the given values for the parameters 'key' and 'value' are not valid column names of the 'base.data'. Whenever a invalid specification is given, an error message (including stack trace) is emitted and the script is stopped. A handling in this function automatically enables the analogous handling in the method 'ProjectData$group.data.by.column'. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Due to an unknown incompatibility with newer versions of igraph (newer than version 1.0.1), the tests on network merging, introduced in commit 26e3bef, fail with the following error: > At iterators.c:759 : Cannot create iterator, invalid vertex id, > Invalid vertex id > > 1: expect_output(str(simplify.network(g)), "IGRAPH", fixed = TRUE) at > ./tests/test-networks.R:70 > [...] Obviously, the printing of igraph objects behaves differently in newer/older versions of igraph. To fix the problem, we take advantage of the fact that testthat's function 'expect_error' can also be used to expect that an error does *not* occur. Signed-off-by: Claus Hunsen <[email protected]>
Fix minor documentation issues cause by copy-paste errors (util-data.R). Adjust the tests on network simplification and merging to be more specific (tests/test-networks.R). Remove obsolete statement (util-core-peripheral.R). Thanks to @bockthom for this remarks. Signed-off-by: Claus Hunsen <[email protected]>
Many little improvements and fixes Reviewed-by: Thomas Bock <[email protected]> Reviewed-by: Klara Schlüter <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
As we reviewed everything before, I will merge right away when the build checks finish successfully. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
3.4
Changes in detail
v3.3...d38807b
Added
BASE.ARTIFACTS
(7031d45)split.data.time.based
(Provide the possibility to split data into activity-based equally-sized windows #49, 40974ba, a174753)split.network(s).time.based
(Provide the possibility to split data into activity-based equally-sized windows #49, 94cc87b, a174753, 5ac1492)delete.authors.without.specific.edges
(Provide possibility to filter in-active developers from range-level networks #76, b9319e3, 107854c, 4e211f0, 4850666)ProjectData$group.authors.by.data.column
andProjectData$group.artifacts.by.data.column
(Determine list of artifacts more reasonably #97, 11f7189)ProjectData$group.data.by.column
(b78f54f, 11f7189, related to Determine list of artifacts more reasonably #97)Changed/Improved
first.activity
for better performance (40b7d87, f518890)RELATION.TO.DATASOURCE
to module 'networks' (1ac09f6)ProjectData$get.artifacts
to work with all data sources (Determine list of artifacts more reasonably #97, 0d184b8)save.and.load
to work without assignment (7f6ab1a)get.key.to.value.from.df
(5b74038, related to Determine list of artifacts more reasonably #97)Fixed
first.activity
to handle empty data sources (4a9ad23, 425c46b)split.networks.time.based
regarding case that provided list of networks only contains one element (010a935)generate.date.sequence
(8d80fa9)