Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.4 #144

Merged
merged 51 commits into from
Oct 23, 2018
Merged

v3.4 #144

merged 51 commits into from
Oct 23, 2018

Conversation

bockthom
Copy link
Collaborator

3.4

Changes in detail

v3.3...d38807b

Added

Changed/Improved

Fixed

Klara and others added 30 commits August 17, 2018 18:49
Contains only basic computation. Tests aren't adapted. The function
add.vertex.attribute.first.activity is able to handle multiple data sources now.
The result is stored as a list containing all minimums, named with the
corresponding data source.

Signed-off-by: Klara Schlueter <[email protected]>
Concerns method add.vertex.attribute.first.activity. If no information is found
for some person and data source, than the corresponding element of the list
added as vertex atttribute is set to NA (instead of not being created).

Signed-off-by: Klara Schlueter <[email protected]>
A flag of the function add.vertex.attribute.first.activity now indicates, if the
first acticity should be computed and stored per single data source or "globally"
over all given data sources.

Signed-off-by: Klara Schlueter <[email protected]>
Improve naming and add helper methods.

Signed-off-by: Klara Schlueter <[email protected]>
One of three cases is tested: given multiple data sources, compute the first
activity per person and datasource. Tests for the other two cases are
added in commit will be added in upcomming commits.

Signed-off-by: Klara Schlueter <[email protected]>
Tests one of three cases: compute the first activity per person
over all given data sources (one value per person). Tests for the other two
cases are added in commit 9224150 and
one future commit.

Signed-off-by: Klara Schlueter <[email protected]>
Tests one of three cases: given one data source, compute the first activity per
person per data source. Tests for the other two cases were added in commit
9224150 and commit
d7ebb79.

Signed-off-by: Klara Schlueter <[email protected]>
Rename function, rectify documentation structure, delete browser statement.

Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Signed-off-by: Klara Schlueter <[email protected]>
Improve add.vertex.attribute.first.activity

Reviewed-by: Claus Hunsen <[email protected]>
Reviewed-by: Thomas Bock <[email protected]>
When the data for a data source are empty (i.e., an empty data.frame),
we need to deal with the corresponding empty list 'minimums.per.person'
in the function 'get.first.activity.data'; otherwise, the next lines of
code will throw an error.

The easy fix is to check whether the data are empty, add NA values to
the intermediate data structure and, then, skip the next lines of code.

Signed-off-by: Claus Hunsen <[email protected]>
As a follow-up to 4a9ad23, the tests
for the vertex attribute 'first.activity' are changed to contain empty
issue data. This works against potential regressions when empty data
sources are not properly handled.

Signed-off-by: Claus Hunsen <[email protected]>
When planning the functionality for cmputing the vertex attribute
'first.activity', the straight-forward idea was to "merge lists" and
then simply run 'min' for the global first activity if needed. When
implementing the functionality in commit
334c258, the "simple" merge has proven
to be not that easy to perform.

With some effort, in this patch, we replace the current implementation
with the originally planned version.

When performing a microbenchmark on BusyBox sample data (five two-year
ranges of data and networks), comparing both versions of the code (see
column 'expr' below), the new version proves to be faster (also due to
better parallelization):
expr      min       lq     mean   median       uq      max neval
   old 83.07881 84.16226 85.79368 85.44123 87.45646 89.70994    20
   new 75.85983 78.09042 78.76604 78.71732 79.44771 83.42560    20
Additionally, the code is more concise and we do not need additional
helper functions.

Add namespace imports to module 'networks-covariates'.

Signed-off-by: Claus Hunsen <[email protected]>
For convenience, we move the vector 'c("Base_Feature", "File_Level")' to
be a global constant in the module 'data'.

Signed-off-by: Claus Hunsen <[email protected]>
When calling the function 'ProjectData$get.artifacts()', we currently
retrieve the list of source-code artifacts only. With this patch, we can
give the desired data source as a parameter to the function.

To implement this generic functionality, we need to add two global map
constants: 'DATASOURCE.TO.ARTIFACT.FUNCTION' and
DATASOURCE.TO.ARTIFACT.COLUMN. The first maps the data sources to the
specific function that is needed to get the correct (filtered) artifact
data. The latter maps the data sources to the specific column in the
data where the artifacts are stored.

This puts work on issue #97.

Additionally, adjust some TODO items.

Signed-off-by: Claus Hunsen <[email protected]>
As a follow-up to 0d184b8, the
data-mapping methods of ProjectData are adapted as follows.

The following methods are incorporated into 'get.author2artifact' and
'get.artifact2author', respectively, where the data source needs to be
passed as a parameter:
- get.thread2author,
- get.author2thread,
- get.issue2author, and
- get.author2issue.

The following method is incorporated into 'get.artifact2artifact' which
retrieves data for the bipartite relations (the data source needs to be
passed as a parameter):
- get.commit2artifact.

The following methods are removed:
- get.author2commit,
- get.author2mail,
- get.author2file, and
- get.commit2file.

Affected methods and functions as well as the showcase file are adapted
appropriately. The test suite is fixed due to a re-order of vertices in
one test case.

This fixes #97.

Signed-off-by: Claus Hunsen <[email protected]>
Currently, we have the possibility to split data in a time-based manner
by specifying a time period or specific bins. However, we do not have
the possibility to specify the number of windows. With this commit, we
add this functionality.

To implement this functionality, the functions 'split.data.time.based'
and 'split.get.bins.time.based' both get a new parameter
'number.windows' and the function 'generate.date.sequence' a parameter
'length.out'.

Additionally, adjust function documentation appropriately.

This fixes #49.

Signed-off-by: Claus Hunsen <[email protected]>
When selecting the right data for a call-graph-based artifact network,
it is necessary to check additionally whether we have RangeData objects
at hand – otherwise, we catch non-expressive error messages.

Signed-off-by: Claus Hunsen <[email protected]>
With the new function 'delete.authors.without.bipartite.edges', the user
is able to delete author vertices from a network that do not have
incident bipartite edges.

This fixes #76.

Signed-off-by: Claus Hunsen <[email protected]>
Different from issue templates, pull-request templates in the folder
'.github/PULL_REQUEST_TEMPLATE/' only work with the URL parameter
'template' [1], i.e., there is no default template that is filled in any
time. Therefore, the template is moved to a place where it is recognized
properly.

[1]
https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
The 'sapply' in the function 'split.networks.time.based' does not work if
the provided list of networks only contains one element: In that case,it
returns a matrix containing one column instead of returning a list.
However, this leads to problems when passing this matrix to the function
'get.date.from.unix.timestamp'.
When using 'lapply' instead, we get a list all the time, which fixes
the problem.

In addition, the corresponding test in the test suite is enhanced by an
additional statement which just runs the function with a list containing
one element (to keep track of that for future changes).

Signed-off-by: Thomas Bock <[email protected]>
Extend documentation in the source code for the function
'get.first.activity.data' by inclusion of a running example.
Addtionally, improve the identifaction of identification of minimal fist
activitity across data sources (i.e., use 'do.call(c, ...)' instead of
'unlist').

Props to @bockthom for pointing this out.

Signed-off-by: Claus Hunsen <[email protected]>
To improve comprehensibility of the function 'merge.network.data', the
code formatting is improved by re-arranging blank lines and comments.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
clhunsen and others added 21 commits September 6, 2018 14:35
The function 'save.and.load' should also be able to work without
assignment of its return value. This way, the real potential of the
parameter 'variable' is unleashed. Until now, the parameter was needed
to access the given variable from the environment stored in a dumped
file.

With the solution implemented in this patch, the stored variable is
assigned to the calling environment under the given name (independent
from the data's existence). This way, the assignment of the return value
is not necessary anymore. However, for compatibility, it is still
possible to assign the return value to another variable name when
calling the function.

Additionally, the documentation of the function is improved by giving
more details on the internals and a (hopefully) clearer wording.

Thanks to @hechtlC for pointing this out, although unintentionally.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Fix minor documentation issues that have been pointed out by @bockthom
in his review on PR #140.

Additionally, rename function 'delete.authors.without.bipartite.edges'
to 'delete.authors.without.specific.edges' and allow both types of edges
via an argument.

Signed-off-by: Claus Hunsen <[email protected]>
Currently, we have the possibility to split data in a time-based manner
by specifying a time period, specific bins, or the number of resulting
time windows. However, we do not have the latter possibility for network
splitting. With this commit, we add this functionality.

To implement this functionality, the functions
'split.network.time.based' and 'split.networks.time.based' both get a
new parameter to handle this functionality. Additionally, streamline
other 'split.*' functions for uniformity regarding the 'number.windows'
and 'sliding.window' parameters.

Finally, adjust function documentation appropriately.

This is a follow-up for commit 40974ba.
Hopefully, this really fixes #49.
Thanks to @bockthom for pointing this out in his review on PR #140.

Signed-off-by: Claus Hunsen <[email protected]>
When giving a parameter 'length.out' to the function
'generate.date.sequence', the resulting time period may get a fractional
number. This may result in additional generated dates, i.e., the length
of the date sequence is not 'length.out' anymore – with the difference
between the last two dates being likely only one second.

To fix this, the calculated time period is rounded up. This will
decrease the length of the last time period between dates, but, at most,
only by 'length.out' seconds.

To introduce a test case for this, the function
'split.get.bins.time.base' is fixed to set the levels on the bin factor
in the correct order.

Signed-off-by: Claus Hunsen <[email protected]>
For some reason, when using the parameter 'directed = TRUE', the function
'igraph::eigen_centrality' can run into ARPACK errors (in some rare cases).

For example, for a small subnet of an author network of BusyBox having
7 nodes and 12 edges, we get the following error:

Error in base::.Call(.NAME, ...) :
  At arpack.c:1174 : ARPACK error, Maximum number of iterations reached
In addition: Warning message:
In .Call("R_igraph_eigen_adjacency", graph, algorithm, which, options, :
  At arpack.c:776 :ARPACK solver failed to converge (1001 iterations,
0/1 eigenvectors converged)

Those errors were already discussed in the igraph project on GitHub, e.g.:
igraph/igraph#512
According to the conversation there, to avoid those errors, the number of
iterations could be increased from its default value 1000 to a higher number
of iterations. Also the ncv value could be increased. However, in the above
example, this still runs into to the described error.

Hence, we need a more drastic action: Increase the accuracy tolerance
from 0 to 0.1. These leads to less precise results but omits the error
(at least in the example stated above).

So, in general, we still compute 'eigen_centrality' with its default
ARPACK options. If this leads to an error, compute the 'eigen_centrality'
again by using the adjusted ARPACK options (maxiter = 5000 and tol=0.1).

Signed-off-by: Thomas Bock <[email protected]>
Thanks to @bockthom for spotting the wrong default value of the argument
'specific.edge.types'.

Signed-off-by: Claus Hunsen <[email protected]>
By adding the method 'group.data.source.by.column' to the class
'ProjectData', we provide alternatives to some of the methods that have
been removed in commit 23a8aa3. By
providing the data source and the column in the data source to group by,
the following methods can be replaced with appropriate calls to the new
method:
    - get.author2commit and get.author2file
      (data.source = "commits", column = "author.name"), and
    - get.author2mail
      (data.source = "mails", column = "author.name").

Signed-off-by: Claus Hunsen <[email protected]>
Thanks to @bockthom for pointing out that the inline documentation has
not been adapted in commit 4e211f0.

Signed-off-by: Claus Hunsen <[email protected]>
Due to an error introduced in 94cc87b,
the sliding-window parameter has not been handled correctly when
splitting networks and data. To fix the problem, the code to disable
sliding windows when splitting bins are set, is moved to the correct
position.

Additionally, the code in the function 'split.network.time.based' is
adapted to conform with the other functions regarding the if-condition
on specified bins.

Props to @bockthom for spotting this.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
To conform to the other splitting functions, the function
'split.networks.time.based' now gets a parameter 'bins'. The parameter
is handled in the same way as in the other functions.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
Although the patch in c5413c2 is
basically alright, the added code fragments are missing package
prefixes. When running in the error case catched by the additional code,
the user may run into further errors.

Signed-off-by: Claus Hunsen <[email protected]>
The functions 'merge.networks' and 'merge.network.data' should work in a
generic way, such that they work also on general-purpose networks not
coming from the network library. To enable this, the case that empty
networks (in respect to number of edges and/or vertices) is now handled
properly.

In detail, two special cases need to be handled:
- empty edge lists created from edgeless networks need to be filtered
out and
- if all edge lists end up empty, an appropriate empty edge list needs
to be created (instead of NULL).

Passing over empty vertex data (which results in 'NULL' in function
'merge.network.data') does not affect the correct handling of data right
away.

Appropriate test cases are added in the file 'tests/test-networks.R'.

This fixes #142.

Signed-off-by: Claus Hunsen <[email protected]>
Rename the methods in the ProjectData class that are used for retrieving
base data for network construction as follows to obtain more generic
method names:
- 'get.artifact2author' → 'group.authors.by.data.column',
- 'get.author2artifact' → 'group.artifact.by.data.column', and
- 'group.data.source.by.column' → 'group.data.by.column'.
The functionality of the method 'get.artifact2artifact' is incorporated
into the method 'group.artifact.by.data.column'.

These names are a result of the discussions in PR #140 and offline
discussions with @hechtlC and @bockthom. Essentially, the changes are a
follow-up to commit 23a8aa3. Now, the
names express more directly the basic idea behind the methods w.r.t. the
function 'get.key.to.value.from.df' However, for convenience, delegation
methods with the old names are left for the first two methods, although
they emit deprecation warnings.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
When splitting and subsetting data with the function
'get.key.to.value.from.df', it may happen that the given values for the
parameters 'key' and 'value' are not valid column names of the
'base.data'. Whenever a invalid specification is given, an error message
(including stack trace) is emitted and the script is stopped.

A handling in this function automatically enables the analogous handling
in the method 'ProjectData$group.data.by.column'.

Signed-off-by: Claus Hunsen <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
Signed-off-by: Claus Hunsen <[email protected]>
Due to an unknown incompatibility with newer versions of igraph (newer
than version 1.0.1), the tests on network merging, introduced in commit
26e3bef, fail with the following error:

> At iterators.c:759 : Cannot create iterator, invalid vertex id,
>   Invalid vertex id
>
> 1: expect_output(str(simplify.network(g)), "IGRAPH", fixed = TRUE) at
>    ./tests/test-networks.R:70
> [...]

Obviously, the printing of igraph objects behaves differently in
newer/older versions of igraph.

To fix the problem, we take advantage of the fact that testthat's
function 'expect_error' can also be used to expect that an error does
*not* occur.

Signed-off-by: Claus Hunsen <[email protected]>
Fix minor documentation issues cause by copy-paste errors (util-data.R).
Adjust the tests on network simplification and merging to be more
specific (tests/test-networks.R).
Remove obsolete statement (util-core-peripheral.R).

Thanks to @bockthom for this remarks.

Signed-off-by: Claus Hunsen <[email protected]>
Many little improvements and fixes

Reviewed-by: Thomas Bock <[email protected]>
Reviewed-by: Klara Schlüter <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
@bockthom bockthom added this to the v3.4 milestone Oct 22, 2018
@bockthom
Copy link
Collaborator Author

As we reviewed everything before, I will merge right away when the build checks finish successfully.

@bockthom bockthom merged commit 83513c7 into master Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants