Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global: new domain list feature #468

Merged
merged 1 commit into from
Jan 9, 2024

Conversation

lnielsen
Copy link
Member

@lnielsen lnielsen commented Dec 14, 2023

New domain list feature that is used to:

  • disallow user registration from specific email domains
  • automatically verify users from specific email domains.
  • gather statistics about number of users (and their state) in each domain to help identify high-risk domains
  • connect domains with known organisations to allow for reporting and identifying low-risk domains.

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

@@ -103,6 +103,9 @@ class User(db.Model, Timestamp, UserMixin):
_email = db.Column("email", db.String(255), unique=True)
"""User email."""

domain = db.Column(db.String(255), nullable=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New domain property is added on the user to allow for making a join against the domain table (needed for statistics).

@@ -463,3 +468,153 @@ def delete_by_user(cls, method, user):
"""Unlink a user from an external id."""
with db.session.begin_nested():
cls.query.filter_by(id_user=user.id, method=method).delete()


class DomainOrg(db.Model):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for being able to do statistics on different domains. E.g. a typical case is email domains like physics.uni.org and bio.uni.org all linking to the same organisation.

name = db.Column(db.String(255), nullable=False)
"""Name of organisation."""

json = db.Column(
Copy link
Member Author

@lnielsen lnielsen Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be used on zenodo to store e.g. country_code, in_eu, in_cern_ms, in_cern_ams which can be used for statistics purposes.


id = db.Column(db.Integer(), primary_key=True, autoincrement=True)

pid = db.Column(db.String(255), unique=True, nullable=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on zenodo a ROR

)
"""Link to parent organisation."""

parent = db.relationship("DomainOrg", remote_side=[id])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows for creating hierarchy of organisations - useful again for statistics (on Zenodo the ROR hierarchy)

return obj


class DomainCategory(db.Model):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Label a domain with a category, on Zenodo will be used to classify domains in e.g. spammer, mail-provider, org, company.

Use to control possibility and capability of users registering with this domain.
"""

flagged = db.Column(db.Boolean(), default=False, nullable=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A boolean to flag a domain automatically - e.g. if we download domains from https://github.com/disposable/disposable then domains can be flagged for further inspection by an admin. This is different from status which is used directly to control user registration from the domain.

category = db.Column(db.Integer(), db.ForeignKey(DomainCategory.id), nullable=True)
"""Category of domain."""

num_users = db.Column(db.Integer(), default=0, nullable=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be computed on say daily/weekly basis (not part of PR). Needed for easy analysis of domains and detecting domains with high number of blocked users.

@@ -216,3 +239,33 @@ def validate_username(username):
# text explaining the validation rules.
message = current_app.config["ACCOUNTS_USERNAME_RULES_TEXT"]
raise ValueError(message)


def validate_domain_form(form, field):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be needed in user-profiles to check the domain when a user want to change their email address.

@@ -44,6 +45,16 @@ def __init__(self, *args, **kwargs):
if not self.next.data:
self.next.data = request.args.get("next", "")

def validate(self, extra_validators=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During registration we check the domain. We only check when the registration forms require email confirmation.

Copy link
Contributor

@ntarocco ntarocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added a few comments as food for thoughts, no need to answer, feel free to merge anytime.


__tablename__ = "accounts_domain_org"

id = db.Column(db.Integer(), primary_key=True, autoincrement=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you ok with autoincrement ids, or better going for UUIDs by default? :)


__tablename__ = "accounts_domains"

id = db.Column(db.Integer(), primary_key=True, autoincrement=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about the autoincrement

flagged = db.Column(db.Boolean(), default=False, nullable=False)
"""Flag domain - used by automatic processes to flag domain."""

flagged_source = db.Column(db.String(255), default="", nullable=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you have String(255), this field is more a keyword rather than a free text right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

prefix, domain = email.split("@", 1)
prefix = prefix.lower().strip()
domain = domain.lower().strip()
if domain[-1] == ".":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because [email protected]. is a valid email address, but we want to "normalize" the domain.

# Domain is inserted on domain list when a user confirms their email.
domain = datastore.find_domain(user.domain)
if domain is None:
domain = datastore.create_domain(user.domain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are doing a session.add() here, is a db.session.commit always happening after this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, checked. user_confirmed signal is sent from the confirm_user() function that's called here, which will do the commit:

https://github.com/inveniosoftware/flask-security-fork/blob/b99556579f4bbc9d1425e79d1f9f0d115e5baf6a/flask_security/views.py#L146-L147

return cls.query.filter_by(label=label).one_or_none()


class Domain(db.Model, Timestamp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add __versioned__ = {"versioning": False} to this one given the various num_* columns updated daily by the cron?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a plain model, so it doesnt' have any versioning on by default.

Copy link
Member

@slint slint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 2 minor comments

def create_domain(self, domain, **kwargs):
"""Create a new domain."""
domain = Domain.create(domain, **kwargs)
self.put(domain)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this eventually calls self.mark_changed(..., uid=domain.id), probably just callingself.db.session.add(domain) is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Domain.create() already does db.session.add() so line can be removed.

prefix, domain = email.split("@", 1)
prefix = prefix.lower().strip()
domain = domain.lower().strip()
if domain[-1] == ".":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because [email protected]. is a valid email address, but we want to "normalize" the domain.


def split_emailaddr(email):
"""Split email address in prefix and domain."""
prefix, domain = email.split("@", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prefix, domain = email.split("@", 1)
prefix, domain = email.rsplit("@", 1)

To avoid bypassing with something like "user@foo"@gmail.com, which would return foo"@gmail.com as the domain (see this SO answer)

* New domain list feature that can be used to block email domains from
  registering, and automatically verifying users from other domains.
@lnielsen lnielsen merged commit 2ad7fdc into inveniosoftware:master Jan 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants