v3.4 #144

bockthom · 2018-10-22T14:40:59Z

3.4

Changes in detail

v3.3...d38807b

Added

Create global constant named BASE.ARTIFACTS (7031d45)
Split data into time-based equally-sized windows in function split.data.time.based (Provide the possibility to split data into activity-based equally-sized windows #49, 40974ba, a174753)
Split networks into time-based equally-sized windows in functions split.network(s).time.based (Provide the possibility to split data into activity-based equally-sized windows #49, 94cc87b, a174753, 5ac1492)
Add function to delete authors without specific edges from networks delete.authors.without.specific.edges (Provide possibility to filter in-active developers from range-level networks #76, b9319e3, 107854c, 4e211f0, 4850666)
Add methods ProjectData$group.authors.by.data.column and ProjectData$group.artifacts.by.data.column (Determine list of artifacts more reasonably #97, 11f7189)
Add method ProjectData$group.data.by.column (b78f54f, 11f7189, related to Determine list of artifacts more reasonably #97)

Changed/Improved

Add possibility to add multiple first activities for different activity types in one vertex attribute (Further vertex attributes #92, 04f18b3)
Add possibility to decide whether first activity should be computed per activity type or over all activity types when added as vertex attribute (Further vertex attributes #92, , 86962a3)
Refactor computation of vertex attribute first.activity for better performance (40b7d87, f518890)
Move RELATION.TO.DATASOURCE to module 'networks' (1ac09f6)
Determine list of artifacts more reasonably in ProjectData (Determine list of artifacts more reasonably #97, 23a8aa3, 11f7189)
Adapt ProjectData$get.artifacts to work with all data sources (Determine list of artifacts more reasonably #97, 0d184b8)
Improve function save.and.load to work without assignment (7f6ab1a)
Handle incorrect keys and values in get.key.to.value.from.df (5b74038, related to Determine list of artifacts more reasonably #97)

Fixed

Fix computation of vertex attribute first.activity to handle empty data sources (4a9ad23, 425c46b)
Fortify check on callgraph revision in NetworkBuilder (dcf56ad)
Move pull-request template to take effect (6df72e9)
Fix function split.networks.time.based regarding case that provided list of networks only contains one element (010a935)
Fix problem with fractional time periods in generate.date.sequence (8d80fa9)
Handle ARPACK errors in eigen-centrality calculation (c5413c2, f213648)
Allow merge of empty networks (edges and/or vertices) (Error in 'simplify.network' respective 'merge.networks' if there are no edges #142, 26e3bef)

Contains only basic computation. Tests aren't adapted. The function add.vertex.attribute.first.activity is able to handle multiple data sources now. The result is stored as a list containing all minimums, named with the corresponding data source. Signed-off-by: Klara Schlueter <[email protected]>

Concerns method add.vertex.attribute.first.activity. If no information is found for some person and data source, than the corresponding element of the list added as vertex atttribute is set to NA (instead of not being created). Signed-off-by: Klara Schlueter <[email protected]>

A flag of the function add.vertex.attribute.first.activity now indicates, if the first acticity should be computed and stored per single data source or "globally" over all given data sources. Signed-off-by: Klara Schlueter <[email protected]>

Improve naming and add helper methods. Signed-off-by: Klara Schlueter <[email protected]>

One of three cases is tested: given multiple data sources, compute the first activity per person and datasource. Tests for the other two cases are added in commit will be added in upcomming commits. Signed-off-by: Klara Schlueter <[email protected]>

Tests one of three cases: compute the first activity per person over all given data sources (one value per person). Tests for the other two cases are added in commit 9224150 and one future commit. Signed-off-by: Klara Schlueter <[email protected]>

Signed-off-by: Klara Schlueter <[email protected]>

Tests one of three cases: given one data source, compute the first activity per person per data source. Tests for the other two cases were added in commit 9224150 and commit d7ebb79. Signed-off-by: Klara Schlueter <[email protected]>

Rename function, rectify documentation structure, delete browser statement. Signed-off-by: Klara Schlueter <[email protected]>

Signed-off-by: Klara Schlueter <[email protected]>

Improve add.vertex.attribute.first.activity Reviewed-by: Claus Hunsen <[email protected]> Reviewed-by: Thomas Bock <[email protected]>

When the data for a data source are empty (i.e., an empty data.frame), we need to deal with the corresponding empty list 'minimums.per.person' in the function 'get.first.activity.data'; otherwise, the next lines of code will throw an error. The easy fix is to check whether the data are empty, add NA values to the intermediate data structure and, then, skip the next lines of code. Signed-off-by: Claus Hunsen <[email protected]>

As a follow-up to 4a9ad23, the tests for the vertex attribute 'first.activity' are changed to contain empty issue data. This works against potential regressions when empty data sources are not properly handled. Signed-off-by: Claus Hunsen <[email protected]>

When planning the functionality for cmputing the vertex attribute 'first.activity', the straight-forward idea was to "merge lists" and then simply run 'min' for the global first activity if needed. When implementing the functionality in commit 334c258, the "simple" merge has proven to be not that easy to perform. With some effort, in this patch, we replace the current implementation with the originally planned version. When performing a microbenchmark on BusyBox sample data (five two-year ranges of data and networks), comparing both versions of the code (see column 'expr' below), the new version proves to be faster (also due to better parallelization): expr min lq mean median uq max neval old 83.07881 84.16226 85.79368 85.44123 87.45646 89.70994 20 new 75.85983 78.09042 78.76604 78.71732 79.44771 83.42560 20 Additionally, the code is more concise and we do not need additional helper functions. Add namespace imports to module 'networks-covariates'. Signed-off-by: Claus Hunsen <[email protected]>

For convenience, we move the vector 'c("Base_Feature", "File_Level")' to be a global constant in the module 'data'. Signed-off-by: Claus Hunsen <[email protected]>

Signed-off-by: Claus Hunsen <[email protected]>

When calling the function 'ProjectData$get.artifacts()', we currently retrieve the list of source-code artifacts only. With this patch, we can give the desired data source as a parameter to the function. To implement this generic functionality, we need to add two global map constants: 'DATASOURCE.TO.ARTIFACT.FUNCTION' and DATASOURCE.TO.ARTIFACT.COLUMN. The first maps the data sources to the specific function that is needed to get the correct (filtered) artifact data. The latter maps the data sources to the specific column in the data where the artifacts are stored. This puts work on issue #97. Additionally, adjust some TODO items. Signed-off-by: Claus Hunsen <[email protected]>

As a follow-up to 0d184b8, the data-mapping methods of ProjectData are adapted as follows. The following methods are incorporated into 'get.author2artifact' and 'get.artifact2author', respectively, where the data source needs to be passed as a parameter: - get.thread2author, - get.author2thread, - get.issue2author, and - get.author2issue. The following method is incorporated into 'get.artifact2artifact' which retrieves data for the bipartite relations (the data source needs to be passed as a parameter): - get.commit2artifact. The following methods are removed: - get.author2commit, - get.author2mail, - get.author2file, and - get.commit2file. Affected methods and functions as well as the showcase file are adapted appropriately. The test suite is fixed due to a re-order of vertices in one test case. This fixes #97. Signed-off-by: Claus Hunsen <[email protected]>

Currently, we have the possibility to split data in a time-based manner by specifying a time period or specific bins. However, we do not have the possibility to specify the number of windows. With this commit, we add this functionality. To implement this functionality, the functions 'split.data.time.based' and 'split.get.bins.time.based' both get a new parameter 'number.windows' and the function 'generate.date.sequence' a parameter 'length.out'. Additionally, adjust function documentation appropriately. This fixes #49. Signed-off-by: Claus Hunsen <[email protected]>

When selecting the right data for a call-graph-based artifact network, it is necessary to check additionally whether we have RangeData objects at hand – otherwise, we catch non-expressive error messages. Signed-off-by: Claus Hunsen <[email protected]>

With the new function 'delete.authors.without.bipartite.edges', the user is able to delete author vertices from a network that do not have incident bipartite edges. This fixes #76. Signed-off-by: Claus Hunsen <[email protected]>

Different from issue templates, pull-request templates in the folder '.github/PULL_REQUEST_TEMPLATE/' only work with the URL parameter 'template' [1], i.e., there is no default template that is filled in any time. Therefore, the template is moved to a place where it is recognized properly. [1] https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/ Signed-off-by: Claus Hunsen <[email protected]>

Signed-off-by: Claus Hunsen <[email protected]>

The 'sapply' in the function 'split.networks.time.based' does not work if the provided list of networks only contains one element: In that case,it returns a matrix containing one column instead of returning a list. However, this leads to problems when passing this matrix to the function 'get.date.from.unix.timestamp'. When using 'lapply' instead, we get a list all the time, which fixes the problem. In addition, the corresponding test in the test suite is enhanced by an additional statement which just runs the function with a list containing one element (to keep track of that for future changes). Signed-off-by: Thomas Bock <[email protected]>

@bockthom

Extend documentation in the source code for the function 'get.first.activity.data' by inclusion of a running example. Addtionally, improve the identifaction of identification of minimal fist activitity across data sources (i.e., use 'do.call(c, ...)' instead of 'unlist'). Props to @bockthom for pointing this out. Signed-off-by: Claus Hunsen <[email protected]>

To improve comprehensibility of the function 'merge.network.data', the code formatting is improved by re-arranging blank lines and comments. Signed-off-by: Claus Hunsen <[email protected]>

Signed-off-by: Claus Hunsen <[email protected]>

@hechtlC

The function 'save.and.load' should also be able to work without assignment of its return value. This way, the real potential of the parameter 'variable' is unleashed. Until now, the parameter was needed to access the given variable from the environment stored in a dumped file. With the solution implemented in this patch, the stored variable is assigned to the calling environment under the given name (independent from the data's existence). This way, the assignment of the return value is not necessary anymore. However, for compatibility, it is still possible to assign the return value to another variable name when calling the function. Additionally, the documentation of the function is improved by giving more details on the internals and a (hopefully) clearer wording. Thanks to @hechtlC for pointing this out, although unintentionally. Signed-off-by: Claus Hunsen <[email protected]>

Signed-off-by: Claus Hunsen <[email protected]>

@bockthom

Fix minor documentation issues that have been pointed out by @bockthom in his review on PR #140. Additionally, rename function 'delete.authors.without.bipartite.edges' to 'delete.authors.without.specific.edges' and allow both types of edges via an argument. Signed-off-by: Claus Hunsen <[email protected]>

@bockthom

Currently, we have the possibility to split data in a time-based manner by specifying a time period, specific bins, or the number of resulting time windows. However, we do not have the latter possibility for network splitting. With this commit, we add this functionality. To implement this functionality, the functions 'split.network.time.based' and 'split.networks.time.based' both get a new parameter to handle this functionality. Additionally, streamline other 'split.*' functions for uniformity regarding the 'number.windows' and 'sliding.window' parameters. Finally, adjust function documentation appropriately. This is a follow-up for commit 40974ba. Hopefully, this really fixes #49. Thanks to @bockthom for pointing this out in his review on PR #140. Signed-off-by: Claus Hunsen <[email protected]>

When giving a parameter 'length.out' to the function 'generate.date.sequence', the resulting time period may get a fractional number. This may result in additional generated dates, i.e., the length of the date sequence is not 'length.out' anymore – with the difference between the last two dates being likely only one second. To fix this, the calculated time period is rounded up. This will decrease the length of the last time period between dates, but, at most, only by 'length.out' seconds. To introduce a test case for this, the function 'split.get.bins.time.base' is fixed to set the levels on the bin factor in the correct order. Signed-off-by: Claus Hunsen <[email protected]>

For some reason, when using the parameter 'directed = TRUE', the function 'igraph::eigen_centrality' can run into ARPACK errors (in some rare cases). For example, for a small subnet of an author network of BusyBox having 7 nodes and 12 edges, we get the following error: Error in base::.Call(.NAME, ...) : At arpack.c:1174 : ARPACK error, Maximum number of iterations reached In addition: Warning message: In .Call("R_igraph_eigen_adjacency", graph, algorithm, which, options, : At arpack.c:776 :ARPACK solver failed to converge (1001 iterations, 0/1 eigenvectors converged) Those errors were already discussed in the igraph project on GitHub, e.g.: igraph/igraph#512 According to the conversation there, to avoid those errors, the number of iterations could be increased from its default value 1000 to a higher number of iterations. Also the ncv value could be increased. However, in the above example, this still runs into to the described error. Hence, we need a more drastic action: Increase the accuracy tolerance from 0 to 0.1. These leads to less precise results but omits the error (at least in the example stated above). So, in general, we still compute 'eigen_centrality' with its default ARPACK options. If this leads to an error, compute the 'eigen_centrality' again by using the adjusted ARPACK options (maxiter = 5000 and tol=0.1). Signed-off-by: Thomas Bock <[email protected]>

@bockthom

Thanks to @bockthom for spotting the wrong default value of the argument 'specific.edge.types'. Signed-off-by: Claus Hunsen <[email protected]>

By adding the method 'group.data.source.by.column' to the class 'ProjectData', we provide alternatives to some of the methods that have been removed in commit 23a8aa3. By providing the data source and the column in the data source to group by, the following methods can be replaced with appropriate calls to the new method: - get.author2commit and get.author2file (data.source = "commits", column = "author.name"), and - get.author2mail (data.source = "mails", column = "author.name"). Signed-off-by: Claus Hunsen <[email protected]>

@bockthom

Thanks to @bockthom for pointing out that the inline documentation has not been adapted in commit 4e211f0. Signed-off-by: Claus Hunsen <[email protected]>

@bockthom

Due to an error introduced in 94cc87b, the sliding-window parameter has not been handled correctly when splitting networks and data. To fix the problem, the code to disable sliding windows when splitting bins are set, is moved to the correct position. Additionally, the code in the function 'split.network.time.based' is adapted to conform with the other functions regarding the if-condition on specified bins. Props to @bockthom for spotting this. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>

To conform to the other splitting functions, the function 'split.networks.time.based' now gets a parameter 'bins'. The parameter is handled in the same way as in the other functions. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>

Although the patch in c5413c2 is basically alright, the added code fragments are missing package prefixes. When running in the error case catched by the additional code, the user may run into further errors. Signed-off-by: Claus Hunsen <[email protected]>

The functions 'merge.networks' and 'merge.network.data' should work in a generic way, such that they work also on general-purpose networks not coming from the network library. To enable this, the case that empty networks (in respect to number of edges and/or vertices) is now handled properly. In detail, two special cases need to be handled: - empty edge lists created from edgeless networks need to be filtered out and - if all edge lists end up empty, an appropriate empty edge list needs to be created (instead of NULL). Passing over empty vertex data (which results in 'NULL' in function 'merge.network.data') does not affect the correct handling of data right away. Appropriate test cases are added in the file 'tests/test-networks.R'. This fixes #142. Signed-off-by: Claus Hunsen <[email protected]>

@klaraschlueter

Props to @klaraschlueter for spotting this. Signed-off-by: Claus Hunsen <[email protected]>

@hechtlC

Rename the methods in the ProjectData class that are used for retrieving base data for network construction as follows to obtain more generic method names: - 'get.artifact2author' → 'group.authors.by.data.column', - 'get.author2artifact' → 'group.artifact.by.data.column', and - 'group.data.source.by.column' → 'group.data.by.column'. The functionality of the method 'get.artifact2artifact' is incorporated into the method 'group.artifact.by.data.column'. These names are a result of the discussions in PR #140 and offline discussions with @hechtlC and @bockthom. Essentially, the changes are a follow-up to commit 23a8aa3. Now, the names express more directly the basic idea behind the methods w.r.t. the function 'get.key.to.value.from.df' However, for convenience, delegation methods with the old names are left for the first two methods, although they emit deprecation warnings. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>

When splitting and subsetting data with the function 'get.key.to.value.from.df', it may happen that the given values for the parameters 'key' and 'value' are not valid column names of the 'base.data'. Whenever a invalid specification is given, an error message (including stack trace) is emitted and the script is stopped. A handling in this function automatically enables the analogous handling in the method 'ProjectData$group.data.by.column'. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>

Signed-off-by: Claus Hunsen <[email protected]>

Due to an unknown incompatibility with newer versions of igraph (newer than version 1.0.1), the tests on network merging, introduced in commit 26e3bef, fail with the following error: > At iterators.c:759 : Cannot create iterator, invalid vertex id, > Invalid vertex id > > 1: expect_output(str(simplify.network(g)), "IGRAPH", fixed = TRUE) at > ./tests/test-networks.R:70 > [...] Obviously, the printing of igraph objects behaves differently in newer/older versions of igraph. To fix the problem, we take advantage of the fact that testthat's function 'expect_error' can also be used to expect that an error does *not* occur. Signed-off-by: Claus Hunsen <[email protected]>

@bockthom

Fix minor documentation issues cause by copy-paste errors (util-data.R). Adjust the tests on network simplification and merging to be more specific (tests/test-networks.R). Remove obsolete statement (util-core-peripheral.R). Thanks to @bockthom for this remarks. Signed-off-by: Claus Hunsen <[email protected]>

Many little improvements and fixes Reviewed-by: Thomas Bock <[email protected]> Reviewed-by: Klara Schlüter <[email protected]>

Signed-off-by: Thomas Bock <[email protected]>

bockthom · 2018-10-22T14:55:54Z

As we reviewed everything before, I will merge right away when the build checks finish successfully.

Klara and others added 30 commits August 17, 2018 18:49

Refactor add.vertex.attribute.first.activity

334c258

Improve naming and add helper methods. Signed-off-by: Klara Schlueter <[email protected]>

Remove unnecessary type conversation from first.activity test

b0ffe62

Signed-off-by: Klara Schlueter <[email protected]>

Add test for add.vertex.attribute.first.activity

8ddf048

Tests one of three cases: given one data source, compute the first activity per person per data source. Tests for the other two cases were added in commit 9224150 and commit d7ebb79. Signed-off-by: Klara Schlueter <[email protected]>

Minor changes

24ffbbb

Rename function, rectify documentation structure, delete browser statement. Signed-off-by: Klara Schlueter <[email protected]>

Improvements suggested by reviews

ca0dfde

Signed-off-by: Klara Schlueter <[email protected]>

Fix typo in documentation

67eaae1

Signed-off-by: Klara Schlueter <[email protected]>

Update NEWS.md

8bc3236

Signed-off-by: Klara Schlueter <[email protected]>

Merge pull request #135 from klaraschlueter/attribute-active.ranges

06ed129

Improve add.vertex.attribute.first.activity Reviewed-by: Claus Hunsen <[email protected]> Reviewed-by: Thomas Bock <[email protected]>

Create global constant named 'BASE.ARTIFACTS'

7031d45

For convenience, we move the vector 'c("Base_Feature", "File_Level")' to be a global constant in the module 'data'. Signed-off-by: Claus Hunsen <[email protected]>

Move 'RELATION.TO.DATASOURCE' to module 'networks'

1ac09f6

Signed-off-by: Claus Hunsen <[email protected]>

Update changelog

bf92d28

Signed-off-by: Claus Hunsen <[email protected]>

Update copyright headers

756a901

Signed-off-by: Claus Hunsen <[email protected]>

Update documentation in 'merge.network.data'

1c7edd0

To improve comprehensibility of the function 'merge.network.data', the code formatting is improved by re-arranging blank lines and comments. Signed-off-by: Claus Hunsen <[email protected]>

Update changelog

77e5a0b

Signed-off-by: Claus Hunsen <[email protected]>

clhunsen and others added 21 commits September 6, 2018 14:35

Update changelog

a376656

Signed-off-by: Claus Hunsen <[email protected]>

Fix documentation on function 'delete.authors.without.specific.edges'

4e211f0

Thanks to @bockthom for spotting the wrong default value of the argument 'specific.edge.types'. Signed-off-by: Claus Hunsen <[email protected]>

Fix inline documentation of 'delete.authors.without.specific.edges'

4850666

Thanks to @bockthom for pointing out that the inline documentation has not been adapted in commit 4e211f0. Signed-off-by: Claus Hunsen <[email protected]>

Adapt inline documentation in function 'get.first.activity.data'

6aca45e

Props to @klaraschlueter for spotting this. Signed-off-by: Claus Hunsen <[email protected]>

Update changelog

2c9ae5a

Signed-off-by: Claus Hunsen <[email protected]>

Merge pull request #140 from clhunsen/claus-updates

7b3d387

Many little improvements and fixes Reviewed-by: Thomas Bock <[email protected]> Reviewed-by: Klara Schlüter <[email protected]>

Version v3.4

d38807b

Signed-off-by: Thomas Bock <[email protected]>

bockthom added the versioning label Oct 22, 2018

bockthom added this to the v3.4 milestone Oct 22, 2018

bockthom merged commit 83513c7 into master Oct 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.4 #144

v3.4 #144

bockthom commented Oct 22, 2018

bockthom commented Oct 22, 2018

v3.4 #144

v3.4 #144

Conversation

bockthom commented Oct 22, 2018

3.4

Changes in detail

Added

Changed/Improved

Fixed

bockthom commented Oct 22, 2018