Fix sliding-window creation in various situations & add tests for sliding-window functionality #184

bockthom · 2020-11-11T14:30:49Z

Description

As already discussed in #182, when calling the function split.networks.time.based (which splits several networks at the same time), the sliding.window parameter was useless: This function, in turn, calls split.network.time.based for each network, specifying fixed bins and passing the 'sliding.window' parameter to it. However, if bins are given to this function, the sliding.window parameter is ignored.
While dealing with finding an appropriate solution for that problem, it turned out that the implementation of creating sliding windows also in all the other cases lead to inconsistent behavior, especially targeting the last and second last ranges created during splitting while using sliding.window = TRUE. Hence, this led to a lot of debugging work and implementing new fixes which are described below:

Fix sliding-window creation in various situations

For that reason, I completely reworked all the splitting functions with respect to their sliding.window creation approach to use the function create.overlapping.ranges everywhere (where applicable) and result in the same, reliable way of creating sliding-window ranges. I also included proper handling of the last range: In many cases, the last (incomplete) "original" range should be ignored when the last "sliding-window" ranges already covers this last "original" range completely. For brevity reasons, I won't provide further details in the PR description. However, I have split up all my changes into small commits and added detailed explanations and also detailed examples in the commit messages. So, for more details, please have a look at the respective commit messages 😉 (1abc1b8, c34c42a, 097cebc, 9a1b651, 0fc179e, cad28bf)

Add tests for sliding-window functionality (as we did not have tests for that before)

Another major contribution of this pull request is an extension of our test suite: Previously, there was no test covering sliding windows at all - not anywhere near touching sliding windows at all. As I completely reworked many parts of the splitting module with respect to sliding windows, I also added lots of tests to be sure my changes to the splitting functionality work the way I want them to work. I have added a new test module test-split-sliding-window.R (to not even extend the lengthy test-split.R). Most of the tests in there are copied from test-split.R and adjusted to use sliding.window = TRUE. As this is not enough to test as many situations where sliding windows can cause trouble as possible, I also adjusted some other splitting-related tests in the other test modules. And here another innovation comes in: As these splitting-related tests often cope without specifying a concrete expected result by hand, we can make use of parameterized tests (i.e., running the same test multiple times without any changes except for changing fixed parameters). Using parameterized tests, we can test several functions with sliding.window = TRUE and sliding.window = FALSE without copying the test code, which is a great innovation to our test suite (at least, from my point of view). Nevertheless, our standard test library testthat does not provide the possibility to run parameterized tests. Therefore, we now use the package patrick for parameterized tests, which itself builds on top of testthat. (a3ad0a8, 2ed84ac)

Further fixes (e.g., in data-cutting functionality)

Last but not least, this Pull Request also contains additional bug fixes, not directly related to the sliding-window problem: The data cutting (performed via get.data.cut.to.same.date) actually behaves wrong with respect to the end date where to cut at, as it did not consider that end dates of ranges are exclusive, but the end date of the data cutting functionality was meant to be inclusive. Further details can also be seen in the example which is contained in the commit message of the corresponding commit (f0744c0).

[One important notice on this PR: This PR also includes the commits of PR #183 to be able to run travis builds in a reasonable way. Please ignore all commits related to Travis CI configuration R versions, as those will disappear from this PR as soon #183 will be merged. If you would like to discuss on them, please use the discussion of #183. This PR must not be merged until PR #183 is merged!]

I hope you all are happy to get all those (or more precisely: many of those) issues and bugs fixed via this PR 🎈
I look forward to your reviews 😄

Changelog

Is still missing yet. Will be added here and also to the NEWS.md as soon as most of the parts of this PR are approved by someone else 😉

Prerequisites

I adhere to the coding conventions (described here) in my code.
I have updated the copyright headers of the files I have modified.
I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
I have put signed-off tags in all commits.
I have updated the changelog file NEWS.md appropriately.
The pull request is opened against the branch dev.

When computing overlapping ranges, sometimes we ended up in having one more range in the end than needed. This led to the having ranges covering almost the same time span. Consider the following example: When constructing overlapping ranges of three month ranges which overlap fifty-fifty, we might end up in the following scenario: 2020-06-27 ——— 2020-09-26 — 2020-11-11 (2020-12-25) 2020-08-12 ————————— 2020-11-11 There is one range from 2020-06-27 to 2020-09-26, one range from 2020-08-12 to 2020-11-11, and one range from 2020-09-26 to 2020-12-25. However, when assuming that the last action in the data takes place on 2020-11-11, then we would have two ranges covering the last range: one complete ranges and half a range: 2020-08-12 to 2020-11-11 and 2020-09-26 to 2020-11-11. As can be clearly seen (and also above in the visual example), the range 2020-09-26 to 2020-11-11 is superfluous as it does not even represent a full range and it fully overlapps with the previous range. Hence we don't need this half range and can just end up with the following ranges: 2020-06-27 ——— 2020-09-26 2020-08-12 ————————— 2020-11-11 To achieve this, determine the number of complete ranges by rounding downwards instead of rounding to the nearest integer. However, if the complete time span does not even include one complete range, set the number of complete ranges to 1 because multiplying with 0 further down in the function will end up having no range at all (even not an incomplete one). However, the implementation of this corner case works correctly as there already exist some extra checks for this special case. Signed-off-by: Thomas Bock <[email protected]>

The function `get.data.cut.to.same.date` cuts the data of multiple data sources to a common date range for all the specified data sources. However, in the previous implementation, there was an off-by-one error, caused by excluding end dates in data splitting: If a range is specified, the end date of the range is exclusive (while the start of the range is inclusive). As the data cutting functionality uses the split functionality, the specified end date was excluded whereas it should be included, as the following example shows: Assume there are two data sources d1 and d2. Let d1 contain the time stamps t1 and t4, and let d2 contain the time stamps t2 and t3, with t1 < t2 < t3 < t4. Then a visualization of this example on a time line could be as follows: d1: | | d2: | | t1 t2 t3 t4 According to the proceedure of data cutting, we look for a common data range. It can be easily seen that the interval [t2,t3] completely covers data from both data sources d1 and d2. Hence, all the data needs to be cut to this interval [t2,t3]. However, what the former implementation did was the following: t2 and t3 are determined as start and end of the range the data has to be cut to. As the cutting functionality uses the splitting functionality, t2 is used as start of the range and t3 as used as the end of the range. In here the error comes in: As the end of ranges is exclusive, t3 is not included in the cut data, resulting in interval [t2, t3). To include t3 into the cut data, we need to use t3+1 as the end of the range to cut, as this leads to [t2,t3] then. In this commit, this off-by-one error at the end of the data cutting range is fixed by adding 1 second to the end. In addition, all the corresponding tests in the test suite (i.e., data cutting and network cutting) are adjusted to incorporate the end of the cutting range. Signed-off-by: Thomas Bock <[email protected]>

@clhunsen

When calling the function `split.networks.time.based` (which splits several networks at the same time), the `sliding.window` parameter was useless: This function, in turn, calls `split.network.time.based` for each network, specifying fixed bins and passing the `sliding.window` parameter to it. However, if bins are given to this function, the `sliding.window` parameter is ignored. To fix this, create the overlapping ranges need to get sliding windows and call `split.network.time.based.by.ranges` to get the networks split using sliding windows. With this fix, it is possible to use sliding windows when splitting multiple networks simultaneously. Props to @clhunsen for suggesting to use `construct.overlapping.ranges` and `split.network.time.based.by.ranges` for this fix. Signed-off-by: Thomas Bock <[email protected]> Signed-off-by: Claus Hunsen <[email protected]>

Create the sliding windows in `split.network.time.based` exactly in the same way as in `split.networks.time.based`. Signed-off-by: Thomas Bock <[email protected]>

Signed-off-by: Thomas Bock <[email protected]>

For some of the functions in the split module, some default values of parameters have been missing in the function documentation. This commit fixes that. Signed-off-by: Thomas Bock <[email protected]>

Create the sliding windows in `split.data.time.based` exactly in the same way as in `split.network.time.based` and `split.networks.time.based`. In addition, also stream line the code in `split.data.time.based` (e.g., don't execute steps that are only needed when no sliding windows are used). Signed-off-by: Thomas Bock <[email protected]>

Create the sliding windows in `split.data.activity.based` in a similar way as in `split.data.time.based`. In particular, fix some bugs in the sliding-window creation for activity-based data splitting and handle some special corner cases with respect to the last range. Signed-off-by: Thomas Bock <[email protected]>

Create the sliding windows in `split.network.activity.based` in a similar way as in `split.data.activity.based`. In particular, fix some bugs in the sliding-window creation for activity-based network splitting and handle some special corner cases with respect to the last range. Signed-off-by: Thomas Bock <[email protected]>

Add a completely new test module 'test-split-sliding-window.R', which is based on the exisiting test module 'test-split.R'. The new test module contains almost the same test setups as the exisiting test module, but it tests all the splitting functions only using sliding windows. In addition, some new test scenarios are added to test certain corner cases with respect to the last range of sliding-window splitting. In particular, the following functions are tested with sliding windows: - split.data.time.based - split.data.activity.based - split.network.time.based - split.network.activity.based Signed-off-by: Thomas Bock <[email protected]>

In this commit, additional tests for sliding-window splitting are added. For example, tests for the function 'split.networks.time.based'. As these tests don't need their expected outcomes to be determined manually, but rely on the outcomes of other, already tested functions, it is not necessary to completely copy existing tests and have them duplicated for different values of `sliding.window`. Hence, this is a good example for having parameterized tests: Parameterized tests should succeed independently of which value certain parameters take. Hence, we can just implement the test once and run it twice, for example, the first run with `sliding.window = TRUE` and the second run with `spliding.window = FALSE`. However, es our standard test library 'testthat' does not provide the possiblity to run parameterized tests, we need to add an additiona library to be able to run parameterized tests: the library 'patrick'. For more details on 'patrick', see: https://github.com/google/patrick Signed-off-by: Thomas Bock <[email protected]>

For unknown reasons, setting the layout for ggraph, which is used for plotting networks, has changed. To avoid errors due to a missing or wrongly set layout, the corresponding implementation in function 'plot.get.plot.for.network' is adjusted in to ways: 1) When creating a ggraph object, we already need to explicitly specify the layout we want to use. As we rely on igraph and its layouts, we use 'layout = "igraph"' and then we have to specify the concrete layout algorithm from igraph (as 'algorithm' paramater). 2) Opposed to previous layout specification, now a layout name has to be stated instead of naming a layout function. Unfortunately, there is no official list of supported igraph layouts. Hence, we need to rely on the following table (which comes from python and not from R): https://igraph.org/python/doc/tutorial/tutorial.html#layout-algorithms Notice that the layout algorithm needs to be given as string in form of the 'short name' given in this table. However, if there are multiple short names specified in the table, only one of them might actually work for ggraph. Now as before, the default layout is 'kk' (kamada kawai). If one wants to use another igraph layout, just set the graph attribute 'layout' to the short name of the layout and then this one will be used for creating the ggraph object. This commit fixes issue se-sic#186. Signed-off-by: Thomas Bock <[email protected]>

hechtlC

First of all thank you @bockthom for the nice fixes and additions. I have only found a couple of documentation bugs. The functionality looks good to me. If @clhunsen approves, we can merge this once you fixed the documentation bugs.

One more comment: You forgot the copyright header in the file 'tests.R'.

tests/test-split-sliding-window.R

util-plot.R

util-split.R

Signed-off-by: Thomas Bock <[email protected]>

In the function 'convert.adjacency.matrix.list.to.array', there was a typo in a variable name which led to an error when calling the function. Signed-off-by: Thomas Bock <[email protected]>

clhunsen

Thank you very much for these very valuable patches, @bockthom! I am so grateful that there are sliding-window tests now... I remember writing the "normal" splitting tests – and it stressed me out due to the amount of test code. 😅 So, let's party! 🥳 Sorry for taking so long to review this critical PR.

The changes are nice and consistent, I only have one potential problem (marked with 🐛 in the corresponding comment): There is a boolean-ization using == where it should be %in% if I understand it completely. This needs to be fixed before merging. I do not need to approve a potential change as you always do a good job, @bockthom!

I also added some further comments on things that I noticed, but no issue is critical enough to block this PR to be merged. These comments should be tackled in the future, nevertheless, I think.

util-split.R

clhunsen · 2020-12-01T07:42:00Z

util-split.R

+            ## adjust the end date of the last sliding-window range
+            name.last.sliding.window = construct.ranges(c(bins[length(bins) - 3], get.date.string(end.event.date)))


Can you please add an example to your comment on how the name is changed? Something like "2020-01-01-2020-03-01 -> 2020-01-01-2020-02-29"... Anyway, right now, the process here is rather too abstract for me to understand, sorry... 😅

Can you document in a comment that the precondition of, at least, three ranges (and four entries in bins) is always met due to the very idea that we need, at least, two "normal" ranges (and, hence, one sliding-window range) if sliding.window is enabled. I really needed to think about this why - 3 does not throws errors sometimes.
Maybe, you can integrate this into the comment I suggested above. 🙂

Sure, more documentation is needed here. I'm right there with you. I am just dealing myself with understanding what's going on there when looking at this code snippet...

Ok, I now understand again what's going on there. I won't add an example, as the example would also be confusing without stating way more example data there. However, I have added the following comments which clearly state what's going on there:

## adjust the end date of the last sliding-window range, as it might be shorter than it should be: ## The end of the last range usually is one second after the last event (as end dates are exclusive). ## In case of sliding windows, the end of the last sliding range needs to be extended to the date of the ## next event after that range (as end dates are exclusive) to get a full range as for all the previous ## ranges which end at the beginning of the next range, which is the date of the first event after the ## actual range. ## When we have sliding windows, there are, at least, three ranges (two regular ranges and one ## sliding-window range. Hence, there are always more than three elements in the bins vector, so accessing ## bins[length(bins) - 3] cannot throw errors in this case.

util-split.R

util-plot.R

clhunsen · 2020-12-01T07:57:33Z

util-split.R

+
+        ## determine end bin of last sliding-window range
+        end.event.id = items.unique[(items.unique.count - offset.end + 1)]
+        end.event.logical = (data[[activity.type]][[ id.column[[activity.type]] ]] == end.event.id)


🐛 Shouldn't it be %in% here instead of ==? 🤔 As items.unique is potentially a shorter vector than data (see Line 296 above), you may want to check for existence and not want to achieve a pair-wise comparison. Or am I missing something?

That is a good point. You are right that there could be something going wrong, but your solution does not really work. However, I have another solution.

Short explanation: end.event.logical just returns a boolean vector, stating whether a data entry's id is equal to end.event.id. Even if there are more entries which have the same id, this is not a problem - it is just important to get all the entries which belong to that id. In the next step, we get the end.event.date based on the logical. Hence, if there are multiple TRUE entries in end.event.logical, we would get multiple rows and therefore multiple dates. However, as they belong to the same id, the dates are the same. So, the only fix to add here is a unique call to make sure that there is only one element in end.event.date.

end.event.logical = (data[[activity.type]][[ id.column[[activity.type]] ]] == end.event.id) end.event.date = unique(data[[activity.type]][end.event.logical, ][["date"]])

So, I will just add the unique call and the problem should be solved.

Add more documentation to activity-based data splitting annd add a `unique` call to make sure that there is only one date even if there could be several identical dates in `end.event.date`. In addition, fix some minor inconsistencies. Signed-off-by: Thomas Bock <[email protected]>

This is a follow-up commit for commit d9e604e. In more recent package versions of igraph and ggraph, there is no layout "igraph". Instead, the layout algorithm has to be set directly to the "layout" parameter. This fixes se-sic#186. Signed-off-by: Claus Hunsen <[email protected]> Signed-off-by: Thomas Bock <[email protected]>

Signed-off-by: Thomas Bock <[email protected]>

hechtlC

Ok so now I think we can merge this PR. Thanks for the good work @bockthom .

Signed-off-by: Thomas Bock <[email protected]>

Add more documentation to activity-based data splitting annd add a `unique` call to make sure that there is only one date even if there could be several identical dates in `end.event.date`. In addition, fix some minor inconsistencies. Signed-off-by: Thomas Bock <[email protected]>

Signed-off-by: Thomas Bock <[email protected]>

Add more documentation to activity-based data splitting annd add a `unique` call to make sure that there is only one date even if there could be several identical dates in `end.event.date`. In addition, fix some minor inconsistencies. Signed-off-by: Thomas Bock <[email protected]>

bockthom added bug enhancement test labels Nov 11, 2020

bockthom added this to the v3.7 milestone Nov 11, 2020

bockthom requested review from clhunsen and hechtlC November 11, 2020 14:30

bockthom force-pushed the thomas-updates branch from d9e604e to 3c13259 Compare November 16, 2020 13:41

bockthom added 12 commits November 16, 2020 15:24

Adjust sliding-window creation in 'split.network.time.based'

a3a8e6d

Create the sliding windows in `split.network.time.based` exactly in the same way as in `split.networks.time.based`. Signed-off-by: Thomas Bock <[email protected]>

Fix wrong logging statements in 'split.network.by.bins'

820f850

Signed-off-by: Thomas Bock <[email protected]>

Add missing default values to function documentations in util-split.R

606963a

For some of the functions in the split module, some default values of parameters have been missing in the function documentation. This commit fixes that. Signed-off-by: Thomas Bock <[email protected]>

bockthom force-pushed the thomas-updates branch from 3c13259 to 720cc7b Compare November 16, 2020 14:25

hechtlC previously approved these changes Nov 25, 2020

View reviewed changes

bockthom added 2 commits November 29, 2020 21:27

Fix minor issues from review in PR se-sic#184

f0b3c10

Signed-off-by: Thomas Bock <[email protected]>

Update changelog file

1ba5551

Signed-off-by: Thomas Bock <[email protected]>

bockthom dismissed hechtlC’s stale review via 1ba5551 November 30, 2020 00:14

Fix typo in variable name 'adjacency.list'

63382b4

In the function 'convert.adjacency.matrix.list.to.array', there was a typo in a variable name which led to an error when calling the function. Signed-off-by: Thomas Bock <[email protected]>

clhunsen previously approved these changes Dec 1, 2020

View reviewed changes

bockthom and others added 3 commits December 1, 2020 12:37

Update changelog file again

4afe8b8

Signed-off-by: Thomas Bock <[email protected]>

bockthom dismissed clhunsen’s stale review via 4afe8b8 December 1, 2020 12:00

hechtlC approved these changes Dec 1, 2020

View reviewed changes

hechtlC merged commit 78e99a1 into se-sic:dev Dec 1, 2020

This was referenced Dec 1, 2020

Further improvements to splitting module #187

Closed

Version 3.7 #190

Merged

miriyusifli pushed a commit to miriyusifli/coronet that referenced this pull request Dec 12, 2020

Fix minor issues from review in PR se-sic#184

21b9b28

Signed-off-by: Thomas Bock <[email protected]>

miriyusifli pushed a commit to miriyusifli/coronet that referenced this pull request Dec 16, 2020

Fix minor issues from review in PR se-sic#184

fb1e53d

Signed-off-by: Thomas Bock <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sliding-window creation in various situations & add tests for sliding-window functionality #184

Fix sliding-window creation in various situations & add tests for sliding-window functionality #184

bockthom commented Nov 11, 2020 •

edited

Loading

hechtlC left a comment

clhunsen left a comment

clhunsen Dec 1, 2020

bockthom Dec 1, 2020

bockthom Dec 1, 2020

clhunsen Dec 1, 2020

bockthom Dec 1, 2020

hechtlC left a comment

		## adjust the end date of the last sliding-window range
		name.last.sliding.window = construct.ranges(c(bins[length(bins) - 3], get.date.string(end.event.date)))

Fix sliding-window creation in various situations & add tests for sliding-window functionality #184

Fix sliding-window creation in various situations & add tests for sliding-window functionality #184

Conversation

bockthom commented Nov 11, 2020 • edited Loading

Description

Fix sliding-window creation in various situations

Add tests for sliding-window functionality (as we did not have tests for that before)

Further fixes (e.g., in data-cutting functionality)

Changelog

Prerequisites

hechtlC left a comment

Choose a reason for hiding this comment

clhunsen left a comment

Choose a reason for hiding this comment

clhunsen Dec 1, 2020

Choose a reason for hiding this comment

bockthom Dec 1, 2020

Choose a reason for hiding this comment

bockthom Dec 1, 2020

Choose a reason for hiding this comment

clhunsen Dec 1, 2020

Choose a reason for hiding this comment

bockthom Dec 1, 2020

Choose a reason for hiding this comment

hechtlC left a comment

Choose a reason for hiding this comment

bockthom commented Nov 11, 2020 •

edited

Loading