Further vertex attributes #92

clhunsen · 2018-02-14T14:26:15Z

To improve over PR #67, we should consider adding further vertex attributes:

commit.count.author.and.committer and commit.count.author.or.committer (see here, fixed in PR Complement commit count vertex attribute possibilities #127)
Related to here, we need to improve the first.activity attribute by enabling the user to pass several data sources. (fixed in PR Improve add.vertex.attribute.first.activity #135)
Related to here, we need to improve the active.ranges attribute by incorporating further data sources. Maybe, we should add a parameter activity.type as for add.vertex.attribute.first.activity.
Also count committers for add.vertex.attribute.artifact.editor.count (see here and Correctly incorporate committer data everywhere #84, see PR Add possibility of editor definition to add.vertex.attribute.artifact.editor.count #169)?
Currently, we only provide vertex attributes for commits, but we definitely should provide functions to add attributes for e-mails and issues. For some further information, see
here.
(moved into a separate issue, see Add further vertex attributes for artifact vertices #170)

The text was updated successfully, but these errors were encountered:

klaraschlueter · 2019-01-18T16:12:16Z

Related to here, we need to improve the active.ranges attribute by incorporating further data sources. Maybe, we should add a parameter activity.type as for add.vertex.attribute.first.activity.

While implementing the possibility of active ranges for multiple data sources, the following consistence problem occured: For some calls, NA is added for no active ranges, which is consistent to other vertex attributes, but inconsistent to an empty list, which is added for some other calls and is semantically the best choice.
Since the active ranges are computed from the persons occuring in the data sources, there are three possibilities:

If an author is active in a certain data source, the ranges are added as desired. There is no problem in this case.
If an author is inactive in a certain data source, but active in other considered data sources, the computation of active ranges "knows" about the person and adds an empty list as active ranges for the specified data source. Semantically, this is the right solution, since there where no active ranges for the certain person and data source.
(For Klara mailing in range1, but never commiting, the vertex attribute added to Klaras node is something like list(mails=list(range1)), commits = list() )
If an author is inactive in all specified data sources, the computation of active ranges does not know about the author and the function add.vertex.attriute adds an NA value.
(If Klara does not do anything in the considered ranges, but is a node in the network, then the added vertex attribute is NA)

This leads to inconsistency in the function: if only commits are considered, the active ranges for Klaras commits are NA. If commits and mails are considered, they are an empty list. In my opinion, there are three possibilities:

Tolerating the inconsistency: Since in the third case, the whole attribute is NA, while in the second case, the attribute contains other information, but an empty list for one data source, this might be okay, but I don't like it.
Adding a parameter to add.vertex.attribute to indicate, which value (NA or an empty list) should be added as default value in case no information is provided. This might affect several other vertex attributes, since all functions use add.vertex.attribute to add.
Replace the empty lists by NAs in the active ranges computation. This is semantically problematic, since no active ranges does not mean no information about active ranges, but consistent to all other cases.

I think I prefer the third solution. What do you think, @bockthom ? Do you have further ideas or reasons?

bockthom · 2019-01-21T08:58:24Z

Hm, that is not easy to decide. I see your point, @klaraschlueter.

In my opinion, there are three possibilities:

1. Tolerating the inconsistency: Since in the third case, the whole attribute is NA, while in the second case, the attribute contains other information, but an empty list for one data source, this might be okay, but I don't like it.

2. Adding a parameter to add.vertex.attribute to indicate, which value (NA or an empty list) should be added as default value in case no information is provided. This might affect several other vertex attributes, since all functions use add.vertex.attribute to add.

3. Replace the empty lists by NAs in the active ranges computation. This is semantically problematic, since no active ranges does not mean no information about active ranges, but consistent to all other cases.

I think I prefer the third solution. What do you think, @bockthom ? Do you have further ideas or reasons?

I don't like solution 2 -- this is way too much overhead in implementation for such a little inconsistency issue.

Solution 1 is not the best one -- I agree on that -- but it would be acceptable for me.

Regarding solution 3: In the case that we would agree on solution 3, I would do it the other way round: Replace NAs by empty lists. Reason: When I am the user, I expect a certain data format. And the expected data format is
list(data.source=list(possible values), optional.additional.data.source=list(values),...)

If there is no data available, I would either expect NA (as in solution 1) or a list of empty lists, to be consistent.

However, in the end, we should think about different interpretations: Is there a difference if the author does not exist in the data or whether (s)he is just inactive? If there would be a difference, can we somehow measure that difference?
I would answer my first question with "Yes", but my second question with "No" (if the author does not appear in any of the data sources, we cannot decide whether (s)he is just inactive or non-existent).

So, finally, I would go for the reverse version of solution 3 and always return a list of lists 😉

clhunsen · 2019-01-21T11:45:19Z

So, finally, I would go for the reverse version of solution 3 and always return a list of lists 😉

So, in summary, the default value should be used in all cases where no data is available for a developer:

when there is no activity of this developer in a single chosen data source and
when the developer is not active in any of the configured data sources.

And the default value is to be the empty list (list()) for each data source independently, as @bockthom said (and as far as I understood the details in this discussion). Right?

By the way, the default value does not get honored in the function get.first.activity.data, @klaraschlueter – NA is used instead. Can you please fix this? (Yes, I know that I have written this piece of code. 😉)

If you need exemplary code on this, I can send you something.

klaraschlueter · 2019-01-21T12:39:37Z

By the way, the default value does not get honored in the function get.first.activity.data, @klaraschlueter – NA is used instead. Can you please fix this? (Yes, I know that I have written this piece of code. 😉)

It is no problem to consider the default value in get.first.activity.data, the required changes are minimal, @clhunsen . However, in all other functions that add vertex attributes, the default value is only used for vertices where no data is available. This is done by passing it to split.and.add.vertex.attribute in the function add.vertex.attribute.first.activity, which is already implemented. Considering the default value instead of NA in get.first.activity.data means to use it in a slightly different case as well: If there is data for a vertice in at least one considered data source, but not in all considered data sources, and take.first.overall is FALSE.

If for example the default value is given as list(), authorA was active in mails on 2019-21-01 but not active in commits and authorB was never active, the resulting vertex attributes per author are:
authorA: list(mails = 2019-21-01, commits = list())
authorB: list()
The empty list, the default value, is used both for the vertex attribute for authorB and for the value of authorA's first activity in commits.

In conclusion, I can easily "fix" this, but I'm not sure, if this is, what we want.

klaraschlueter · 2019-01-21T12:56:47Z

However, in the end, we should think about different interpretations: Is there a difference if the author does not exist in the data or whether (s)he is just inactive? If there would be a difference, can we somehow measure that difference?
I would answer my first question with "Yes", but my second question with "No" (if the author does not appear in any of the data sources, we cannot decide whether (s)he is just inactive or non-existent).

It is possible to decide if an author is non-existent or inactive, if we consider not only the data sources about author activity, but also the graph to add the vertex attributes to. But again, this seems like to much overhead in implementation to me.

So, finally, I would go for the reverse version of solution 3 and always return a list of lists 😉

I am not sure, if you really think of a list of lists in every case, @bockthom ? In case there is no data for one data source for one author, it should be a simple empty list, shouldn't it? Besides, I will implement the reverse version of solution 3, as you suggested.

bockthom · 2019-01-21T16:38:00Z

So, finally, I would go for the reverse version of solution 3 and always return a list of lists wink

I am not sure, if you really think of a list of lists in every case, @bockthom ? In case there is no data for one data source for one author, it should be a simple empty list, shouldn't it? Besides, I will implement the reverse version of solution 3, as you suggested.

I don't remember how we differentiate between using one data source and multiple data sources. So what was the behavior so far?
Did we have just a single list if we only analyze one data source and a list of lists if we have specified, at least, two data sources? As far as I remember correctly, we always used a list containing lists, even if we only had one element. Is this correct?

klaraschlueter · 2019-01-25T09:50:28Z

Yes, @bockthom . The type of the vertex attribute only depends on the parameter take.first.over.all.activity.types in the first activity computation and on the equivalent parameter (to be implemented next) in the active ranges computation. If it is FALSE, the attribute is always a list containing for every specified data source the corresponding value. For first.activity, this is a date, for active.ranges, this is a list of ranges.
If take.first.over.all.activity.types is TRUE, the vertex attribute is only one value: the minimum of all first activities or the union of all active ranges. Both are not a list of lists.

bockthom · 2019-05-28T07:46:53Z

In PR #160, the following part of this issue was fixed:

Related to here, we need to improve the active.ranges attribute by incorporating further data sources. Maybe, we should add a parameter activity.type as for add.vertex.attribute.first.activity.

clhunsen · 2019-07-17T13:13:28Z

With the merge of PR #169, the last item in this issue is fixed. Closing here. 🎉

clhunsen added the enhancement label Feb 14, 2018

clhunsen added this to the Future milestone Feb 14, 2018

clhunsen changed the title ~~Further vertex attributes for artifacts~~ Further vertex attributes Feb 16, 2018

This was referenced Feb 16, 2018

Add vertex attributes (updated) #93

Merged

Add vertex attributes #67

Closed

bockthom mentioned this issue Mar 8, 2018

Improve performance for vertex attribute 'author.role' & fix default aggregation levels #102

Closed

clhunsen assigned klaraschlueter Apr 27, 2018

clhunsen added the fixed/in-progress label Jun 6, 2018

klaraschlueter mentioned this issue Jul 31, 2018

Improve add.vertex.attribute.first.activity #135

Merged

clhunsen mentioned this issue Aug 12, 2018

v3.3 #139

Merged

bockthom mentioned this issue Oct 22, 2018

v3.4 #144

Merged

klaraschlueter mentioned this issue Mar 20, 2019

Base active ranges computation on multiple data sources and adapt first activity #160

Merged

6 tasks

clhunsen mentioned this issue Jun 7, 2019

Version 3.5 #168

Merged

clhunsen modified the milestones: Future, v3.6 Jul 11, 2019

clhunsen mentioned this issue Jul 11, 2019

Add further vertex attributes for artifact vertices #170

Closed

10 tasks

clhunsen added fixed/future-release and removed fixed/in-progress labels Jul 17, 2019

clhunsen closed this as completed Jul 17, 2019

bockthom mentioned this issue Feb 21, 2020

Version 3.6 #176

Merged

bockthom mentioned this issue Dec 1, 2020

Add further vertex attributes for author vertices #188

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further vertex attributes #92

Further vertex attributes #92

clhunsen commented Feb 14, 2018 •

edited

Loading

klaraschlueter commented Jan 18, 2019 •

edited

Loading

bockthom commented Jan 21, 2019

clhunsen commented Jan 21, 2019

klaraschlueter commented Jan 21, 2019

klaraschlueter commented Jan 21, 2019

bockthom commented Jan 21, 2019

klaraschlueter commented Jan 25, 2019

bockthom commented May 28, 2019

clhunsen commented Jul 17, 2019

Further vertex attributes #92

Further vertex attributes #92

Comments

clhunsen commented Feb 14, 2018 • edited Loading

klaraschlueter commented Jan 18, 2019 • edited Loading

bockthom commented Jan 21, 2019

clhunsen commented Jan 21, 2019

klaraschlueter commented Jan 21, 2019

klaraschlueter commented Jan 21, 2019

bockthom commented Jan 21, 2019

klaraschlueter commented Jan 25, 2019

bockthom commented May 28, 2019

clhunsen commented Jul 17, 2019

clhunsen commented Feb 14, 2018 •

edited

Loading

klaraschlueter commented Jan 18, 2019 •

edited

Loading