Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further vertex attributes #92

Closed
4 tasks done
clhunsen opened this issue Feb 14, 2018 · 9 comments
Closed
4 tasks done

Further vertex attributes #92

clhunsen opened this issue Feb 14, 2018 · 9 comments

Comments

@clhunsen
Copy link
Collaborator

clhunsen commented Feb 14, 2018

To improve over PR #67, we should consider adding further vertex attributes:

@clhunsen clhunsen added this to the Future milestone Feb 14, 2018
@clhunsen clhunsen changed the title Further vertex attributes for artifacts Further vertex attributes Feb 16, 2018
This was referenced Feb 16, 2018
@clhunsen clhunsen mentioned this issue Aug 12, 2018
@bockthom bockthom mentioned this issue Oct 22, 2018
@klaraschlueter
Copy link

klaraschlueter commented Jan 18, 2019

  • Related to here, we need to improve the active.ranges attribute by incorporating further data sources. Maybe, we should add a parameter activity.type as for add.vertex.attribute.first.activity.

While implementing the possibility of active ranges for multiple data sources, the following consistence problem occured: For some calls, NA is added for no active ranges, which is consistent to other vertex attributes, but inconsistent to an empty list, which is added for some other calls and is semantically the best choice.
Since the active ranges are computed from the persons occuring in the data sources, there are three possibilities:

  1. If an author is active in a certain data source, the ranges are added as desired. There is no problem in this case.
  2. If an author is inactive in a certain data source, but active in other considered data sources, the computation of active ranges "knows" about the person and adds an empty list as active ranges for the specified data source. Semantically, this is the right solution, since there where no active ranges for the certain person and data source.
    (For Klara mailing in range1, but never commiting, the vertex attribute added to Klaras node is something like list(mails=list(range1)), commits = list() )
  3. If an author is inactive in all specified data sources, the computation of active ranges does not know about the author and the function add.vertex.attriute adds an NA value.
    (If Klara does not do anything in the considered ranges, but is a node in the network, then the added vertex attribute is NA)

This leads to inconsistency in the function: if only commits are considered, the active ranges for Klaras commits are NA. If commits and mails are considered, they are an empty list. In my opinion, there are three possibilities:

  1. Tolerating the inconsistency: Since in the third case, the whole attribute is NA, while in the second case, the attribute contains other information, but an empty list for one data source, this might be okay, but I don't like it.
  2. Adding a parameter to add.vertex.attribute to indicate, which value (NA or an empty list) should be added as default value in case no information is provided. This might affect several other vertex attributes, since all functions use add.vertex.attribute to add.
  3. Replace the empty lists by NAs in the active ranges computation. This is semantically problematic, since no active ranges does not mean no information about active ranges, but consistent to all other cases.

I think I prefer the third solution. What do you think, @bockthom ? Do you have further ideas or reasons?

@bockthom
Copy link
Collaborator

Hm, that is not easy to decide. I see your point, @klaraschlueter.

In my opinion, there are three possibilities:

1. Tolerating the inconsistency: Since in the third case, the whole attribute is NA, while in the second case, the attribute contains other information, but an empty list for one data source, this might be okay, but I don't like it.

2. Adding a parameter to add.vertex.attribute to indicate, which value (NA or an empty list) should be added as default value in case no information is provided. This might affect several other vertex attributes, since all functions use add.vertex.attribute to add.

3. Replace the empty lists by NAs in the active ranges computation. This is semantically problematic, since no active ranges does not mean no information about active ranges, but consistent to all other cases.

I think I prefer the third solution. What do you think, @bockthom ? Do you have further ideas or reasons?

I don't like solution 2 -- this is way too much overhead in implementation for such a little inconsistency issue.

Solution 1 is not the best one -- I agree on that -- but it would be acceptable for me.

Regarding solution 3: In the case that we would agree on solution 3, I would do it the other way round: Replace NAs by empty lists. Reason: When I am the user, I expect a certain data format. And the expected data format is
list(data.source=list(possible values), optional.additional.data.source=list(values),...)

If there is no data available, I would either expect NA (as in solution 1) or a list of empty lists, to be consistent.

However, in the end, we should think about different interpretations: Is there a difference if the author does not exist in the data or whether (s)he is just inactive? If there would be a difference, can we somehow measure that difference?
I would answer my first question with "Yes", but my second question with "No" (if the author does not appear in any of the data sources, we cannot decide whether (s)he is just inactive or non-existent).

So, finally, I would go for the reverse version of solution 3 and always return a list of lists 😉

@clhunsen
Copy link
Collaborator Author

So, finally, I would go for the reverse version of solution 3 and always return a list of lists 😉

So, in summary, the default value should be used in all cases where no data is available for a developer:

  • when there is no activity of this developer in a single chosen data source and
  • when the developer is not active in any of the configured data sources.

And the default value is to be the empty list (list()) for each data source independently, as @bockthom said (and as far as I understood the details in this discussion). Right?


By the way, the default value does not get honored in the function get.first.activity.data, @klaraschlueterNA is used instead. Can you please fix this? (Yes, I know that I have written this piece of code. 😉)

If you need exemplary code on this, I can send you something.

@klaraschlueter
Copy link

By the way, the default value does not get honored in the function get.first.activity.data, @klaraschlueterNA is used instead. Can you please fix this? (Yes, I know that I have written this piece of code. 😉)

It is no problem to consider the default value in get.first.activity.data, the required changes are minimal, @clhunsen . However, in all other functions that add vertex attributes, the default value is only used for vertices where no data is available. This is done by passing it to split.and.add.vertex.attribute in the function add.vertex.attribute.first.activity, which is already implemented. Considering the default value instead of NA in get.first.activity.data means to use it in a slightly different case as well: If there is data for a vertice in at least one considered data source, but not in all considered data sources, and take.first.overall is FALSE.

If for example the default value is given as list(), authorA was active in mails on 2019-21-01 but not active in commits and authorB was never active, the resulting vertex attributes per author are:
authorA: list(mails = 2019-21-01, commits = list())
authorB: list()
The empty list, the default value, is used both for the vertex attribute for authorB and for the value of authorA's first activity in commits.

In conclusion, I can easily "fix" this, but I'm not sure, if this is, what we want.

@klaraschlueter
Copy link

However, in the end, we should think about different interpretations: Is there a difference if the author does not exist in the data or whether (s)he is just inactive? If there would be a difference, can we somehow measure that difference?
I would answer my first question with "Yes", but my second question with "No" (if the author does not appear in any of the data sources, we cannot decide whether (s)he is just inactive or non-existent).

It is possible to decide if an author is non-existent or inactive, if we consider not only the data sources about author activity, but also the graph to add the vertex attributes to. But again, this seems like to much overhead in implementation to me.

So, finally, I would go for the reverse version of solution 3 and always return a list of lists 😉

I am not sure, if you really think of a list of lists in every case, @bockthom ? In case there is no data for one data source for one author, it should be a simple empty list, shouldn't it? Besides, I will implement the reverse version of solution 3, as you suggested.

@bockthom
Copy link
Collaborator

So, finally, I would go for the reverse version of solution 3 and always return a list of lists wink

I am not sure, if you really think of a list of lists in every case, @bockthom ? In case there is no data for one data source for one author, it should be a simple empty list, shouldn't it? Besides, I will implement the reverse version of solution 3, as you suggested.

I don't remember how we differentiate between using one data source and multiple data sources. So what was the behavior so far?
Did we have just a single list if we only analyze one data source and a list of lists if we have specified, at least, two data sources? As far as I remember correctly, we always used a list containing lists, even if we only had one element. Is this correct?

@klaraschlueter
Copy link

Yes, @bockthom . The type of the vertex attribute only depends on the parameter take.first.over.all.activity.types in the first activity computation and on the equivalent parameter (to be implemented next) in the active ranges computation. If it is FALSE, the attribute is always a list containing for every specified data source the corresponding value. For first.activity, this is a date, for active.ranges, this is a list of ranges.
If take.first.over.all.activity.types is TRUE, the vertex attribute is only one value: the minimum of all first activities or the union of all active ranges. Both are not a list of lists.

@bockthom
Copy link
Collaborator

In PR #160, the following part of this issue was fixed:

  • Related to here, we need to improve the active.ranges attribute by incorporating further data sources. Maybe, we should add a parameter activity.type as for add.vertex.attribute.first.activity.

@clhunsen
Copy link
Collaborator Author

With the merge of PR #169, the last item in this issue is fixed. Closing here. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants