Skip to content

Commit

Permalink
Fix join being denied after being invited over federation (#18075)
Browse files Browse the repository at this point in the history
This also happens for rejecting an invite. Basically, any out-of-band membership transition where we first get the membership as an `outlier` and then rely on federation filling us in to de-outlier it.

This PR mainly addresses automated test flakiness, bots/scripts, and options within Synapse like [`auto_accept_invites`](https://element-hq.github.io/synapse/v1.122/usage/configuration/config_documentation.html#auto_accept_invites) that are able to react quickly (before federation is able to push us events), but also helps in generic scenarios where federation is lagging.

I initially thought this might be a Synapse consistency issue (see issues labeled with [`Z-Read-After-Write`](https://github.com/matrix-org/synapse/labels/Z-Read-After-Write)) but it seems to be an event auth logic problem. Workers probably do increase the number of possible race condition scenarios that make this visible though (replication and cache invalidation lag).

Fix #15012
(probably fixes matrix-org/synapse#15012 (#15012))
Related to matrix-org/matrix-spec#2062

Problems:

 1. We don't consider [out-of-band membership](https://github.com/element-hq/synapse/blob/develop/docs/development/room-dag-concepts.md#out-of-band-membership-events) (outliers) in our `event_auth` logic even though we expose them in `/sync`.
 1. (This PR doesn't address this point) Perhaps we should consider authing events in the persistence queue as events already in the queue could allow subsequent events to be allowed (events come through many channels: federation transaction, remote invite, remote join, local send). But this doesn't save us in the case where the event is more delayed over federation.


### What happened before?

I wrote some Complement test that stresses this exact scenario and reproduces the problem: matrix-org/complement#757

```
COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestSynapseConsistency
```


We have `hs1` and `hs2` running in monolith mode (no workers):

 1. `@charlie1:hs2` is invited and joins the room:
     1. `hs1` invites `@charlie1:hs2` to a room which we receive on `hs2` as `PUT /_matrix/federation/v1/invite/{roomId}/{eventId}` (`on_invite_request(...)`) and the invite membership is persisted as an outlier. The `room_memberships` and `local_current_membership` database tables are also updated which means they are visible down `/sync` at this point.
     1. `@charlie1:hs2` decides to join because it saw the invite down `/sync`. Because `hs2` is not yet in the room, this happens as a remote join `make_join`/`send_join` which comes back with all of the auth events needed to auth successfully and now `@charlie1:hs2` is successfully joined to the room.
 1. `@charlie2:hs2` is invited and and tries to join the room:
     1. `hs1` invites `@charlie2:hs2` to the room which we receive on `hs2` as `PUT /_matrix/federation/v1/invite/{roomId}/{eventId}` (`on_invite_request(...)`) and the invite membership is persisted as an outlier. The `room_memberships` and `local_current_membership` database tables are also updated which means they are visible down `/sync` at this point.
     1. Because `hs2` is already participating in the room, we also see the invite come over federation in a transaction and we start processing it (not done yet, see below)
     1. `@charlie2:hs2` decides to join because it saw the invite down `/sync`. Because `hs2`, is already in the room, this happens as a local join but we deny the event because our `event_auth` logic thinks that we have no membership in the room ❌ (expected to be able to join because we saw the invite down `/sync`)
     1. We finally finish processing the `@charlie2:hs2` invite event from and de-outlier it.
         - If this finished before we tried to join we would have been fine but this is the race condition that makes this situation visible.


Logs for `hs2`:

```
🗳️ on_invite_request: handling event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=False>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
🔦 _store_room_members_txn update local_current_membership: <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
📨 Notifying about new event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
✅ on_invite_request: handled event <FrozenEventV3 event_id=$PRPCvdXdcqyjdUKP_NxGF2CcukmwOaoK0ZR1WiVOZVk, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=invite, outlier=True>
🧲 do_invite_join for @user-2-charlie1:hs2 in !sfZVBdLUezpPWetrol:hs1
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$bwv8LxFnqfpsw_rhR7OrTjtz09gaJ23MqstKOcs7ygA, type=m.room.member, state_key=@user-1-alice:hs1, membership=join, outlier=True>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$oju1ts3G3pz5O62IesrxX5is4LxAwU3WPr4xvid5ijI, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=join, outlier=False>
📨 Notifying about new event <FrozenEventV3 event_id=$oju1ts3G3pz5O62IesrxX5is4LxAwU3WPr4xvid5ijI, type=m.room.member, state_key=@user-2-charlie1:hs2, membership=join, outlier=False>

...

🗳️ on_invite_request: handling event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
🔦 _store_room_members_txn update room_memberships: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
🔦 _store_room_members_txn update local_current_membership: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
📨 Notifying about new event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
✅ on_invite_request: handled event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=True>
📬 handling received PDU in room !sfZVBdLUezpPWetrol:hs1: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
📮 handle_new_client_event: handling <FrozenEventV3 event_id=$WNVDTQrxy5tCdPQHMyHyIn7tE4NWqKsZ8Bn8R4WbBSA, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=join, outlier=False>
❌ Denying new event <FrozenEventV3 event_id=$WNVDTQrxy5tCdPQHMyHyIn7tE4NWqKsZ8Bn8R4WbBSA, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=join, outlier=False> because 403: You are not invited to this room.
synapse.http.server - 130 - INFO - POST-16 - <SynapseRequest at 0x7f460c91fbf0 method='POST' uri='/_matrix/client/v3/join/%21sfZVBdLUezpPWetrol:hs1?server_name=hs1' clientproto='HTTP/1.0' site='8080'> SynapseError: 403 - You are not invited to this room.
📨 Notifying about new event <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
✅ handled received PDU in room !sfZVBdLUezpPWetrol:hs1: <FrozenEventV3 event_id=$O_54j7O--6xMsegY5EVZ9SA-mI4_iHJOIoRwYyeWIPY, type=m.room.member, state_key=@user-3-charlie2:hs2, membership=invite, outlier=False>
```
  • Loading branch information
MadLittleMods authored Jan 27, 2025
1 parent 148e935 commit 6ec5e13
Show file tree
Hide file tree
Showing 16 changed files with 1,341 additions and 442 deletions.
1 change: 1 addition & 0 deletions changelog.d/18075.bugfix
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix join being denied after being invited over federation. Also fixes other out-of-band membership transitions.
4 changes: 3 additions & 1 deletion synapse/event_auth.py
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,7 @@ def _is_membership_change_allowed(
logger.debug(
"_is_membership_change_allowed: %s",
{
"caller_membership": caller.membership if caller else None,
"caller_in_room": caller_in_room,
"caller_invited": caller_invited,
"caller_knocked": caller_knocked,
Expand Down Expand Up @@ -677,7 +678,8 @@ def _is_membership_change_allowed(
and join_rule == JoinRules.KNOCK_RESTRICTED
)
):
if not caller_in_room and not caller_invited:
# You can only join the room if you are invited or are already in the room.
if not (caller_in_room or caller_invited):
raise AuthError(403, "You are not invited to this room.")
else:
# TODO (erikj): may_join list
Expand Down
7 changes: 6 additions & 1 deletion synapse/events/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
from typing_extensions import Literal
from unpaddedbase64 import encode_base64

from synapse.api.constants import RelationTypes
from synapse.api.constants import EventTypes, RelationTypes
from synapse.api.room_versions import EventFormatVersions, RoomVersion, RoomVersions
from synapse.synapse_rust.events import EventInternalMetadata
from synapse.types import JsonDict, StrCollection
Expand Down Expand Up @@ -325,12 +325,17 @@ def __str__(self) -> str:
def __repr__(self) -> str:
rejection = f"REJECTED={self.rejected_reason}, " if self.rejected_reason else ""

conditional_membership_string = ""
if self.get("type") == EventTypes.Member:
conditional_membership_string = f"membership={self.membership}, "

return (
f"<{self.__class__.__name__} "
f"{rejection}"
f"event_id={self.event_id}, "
f"type={self.get('type')}, "
f"state_key={self.get('state_key')}, "
f"{conditional_membership_string}"
f"outlier={self.internal_metadata.is_outlier()}"
">"
)
Expand Down
55 changes: 54 additions & 1 deletion synapse/events/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
import attr
from signedjson.types import SigningKey

from synapse.api.constants import MAX_DEPTH
from synapse.api.constants import MAX_DEPTH, EventTypes
from synapse.api.room_versions import (
KNOWN_EVENT_FORMAT_VERSIONS,
EventFormatVersions,
Expand Down Expand Up @@ -109,6 +109,19 @@ def state_key(self) -> str:
def is_state(self) -> bool:
return self._state_key is not None

def is_mine_id(self, user_id: str) -> bool:
"""Determines whether a user ID or room alias originates from this homeserver.
Returns:
`True` if the hostname part of the user ID or room alias matches this
homeserver.
`False` otherwise, or if the user ID or room alias is malformed.
"""
localpart_hostname = user_id.split(":", 1)
if len(localpart_hostname) < 2:
return False
return localpart_hostname[1] == self._hostname

async def build(
self,
prev_event_ids: List[str],
Expand Down Expand Up @@ -142,6 +155,46 @@ async def build(
self, state_ids
)

# Check for out-of-band membership that may have been exposed on `/sync` but
# the events have not been de-outliered yet so they won't be part of the
# room state yet.
#
# This helps in situations where a remote homeserver invites a local user to
# a room that we're already participating in; and we've persisted the invite
# as an out-of-band membership (outlier), but it hasn't been pushed to us as
# part of a `/send` transaction yet and de-outliered. This also helps for
# any of the other out-of-band membership transitions.
#
# As an optimization, we could check if the room state already includes a
# non-`leave` membership event, then we can assume the membership event has
# been de-outliered and we don't need to check for an out-of-band
# membership. But we don't have the necessary information from a
# `StateMap[str]` and we'll just have to take the hit of this extra lookup
# for any membership event for now.
if self.type == EventTypes.Member and self.is_mine_id(self.state_key):
(
_membership,
member_event_id,
) = await self._store.get_local_current_membership_for_user_in_room(
user_id=self.state_key,
room_id=self.room_id,
)
# There is no need to check if the membership is actually an
# out-of-band membership (`outlier`) as we would end up with the
# same result either way (adding the member event to the
# `auth_event_ids`).
if (
member_event_id is not None
# We only need to be careful about duplicating the event in the
# `auth_event_ids` list (duplicate `type`/`state_key` is part of the
# authorization rules)
and member_event_id not in auth_event_ids
):
auth_event_ids.append(member_event_id)
# Also make sure to point to the previous membership event that will
# allow this one to happen so the computed state works out.
prev_event_ids.append(member_event_id)

format_version = self.room_version.event_format
# The types of auth/prev events changes between event versions.
prev_events: Union[StrCollection, List[Tuple[str, Dict[str, str]]]]
Expand Down
5 changes: 3 additions & 2 deletions synapse/handlers/federation_event.py
Original file line number Diff line number Diff line change
Expand Up @@ -2272,8 +2272,9 @@ async def persist_events_and_notify(
event_and_contexts, backfilled=backfilled
)

# After persistence we always need to notify replication there may
# be new data.
# After persistence, we never notify clients (wake up `/sync` streams) about
# backfilled events but it's important to let all the workers know about any
# new event (backfilled or not) because TODO
self._notifier.notify_replication()

if self._ephemeral_messages_enabled:
Expand Down
4 changes: 2 additions & 2 deletions synapse/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,15 +391,15 @@ def get_reactor(self) -> ISynapseReactor:
def is_mine(self, domain_specific_string: DomainSpecificString) -> bool:
return domain_specific_string.domain == self.hostname

def is_mine_id(self, string: str) -> bool:
def is_mine_id(self, user_id: str) -> bool:
"""Determines whether a user ID or room alias originates from this homeserver.
Returns:
`True` if the hostname part of the user ID or room alias matches this
homeserver.
`False` otherwise, or if the user ID or room alias is malformed.
"""
localpart_hostname = string.split(":", 1)
localpart_hostname = user_id.split(":", 1)
if len(localpart_hostname) < 2:
return False
return localpart_hostname[1] == self.hostname
Expand Down
161 changes: 161 additions & 0 deletions tests/federation/test_federation_devices.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
#
# This file is licensed under the Affero General Public License (AGPL) version 3.
#
# Copyright (C) 2024 New Vector, Ltd
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# See the GNU Affero General Public License for more details:
# <https://www.gnu.org/licenses/agpl-3.0.html>.
#
# Originally licensed under the Apache License, Version 2.0:
# <http://www.apache.org/licenses/LICENSE-2.0>.
#
# [This file includes modifications made by New Vector Limited]
#
#

import logging
from unittest.mock import AsyncMock, Mock

from twisted.test.proto_helpers import MemoryReactor

from synapse.handlers.device import DeviceListUpdater
from synapse.server import HomeServer
from synapse.types import JsonDict
from synapse.util import Clock
from synapse.util.retryutils import NotRetryingDestination

from tests import unittest

logger = logging.getLogger(__name__)


class DeviceListResyncTestCase(unittest.HomeserverTestCase):
def prepare(self, reactor: MemoryReactor, clock: Clock, hs: HomeServer) -> None:
self.store = self.hs.get_datastores().main

def test_retry_device_list_resync(self) -> None:
"""Tests that device lists are marked as stale if they couldn't be synced, and
that stale device lists are retried periodically.
"""
remote_user_id = "@john:test_remote"
remote_origin = "test_remote"

# Track the number of attempts to resync the user's device list.
self.resync_attempts = 0

# When this function is called, increment the number of resync attempts (only if
# we're querying devices for the right user ID), then raise a
# NotRetryingDestination error to fail the resync gracefully.
def query_user_devices(
destination: str, user_id: str, timeout: int = 30000
) -> JsonDict:
if user_id == remote_user_id:
self.resync_attempts += 1

raise NotRetryingDestination(0, 0, destination)

# Register the mock on the federation client.
federation_client = self.hs.get_federation_client()
federation_client.query_user_devices = Mock(side_effect=query_user_devices) # type: ignore[method-assign]

# Register a mock on the store so that the incoming update doesn't fail because
# we don't share a room with the user.
self.store.get_rooms_for_user = AsyncMock(return_value=["!someroom:test"])

# Manually inject a fake device list update. We need this update to include at
# least one prev_id so that the user's device list will need to be retried.
device_list_updater = self.hs.get_device_handler().device_list_updater
assert isinstance(device_list_updater, DeviceListUpdater)
self.get_success(
device_list_updater.incoming_device_list_update(
origin=remote_origin,
edu_content={
"deleted": False,
"device_display_name": "Mobile",
"device_id": "QBUAZIFURK",
"prev_id": [5],
"stream_id": 6,
"user_id": remote_user_id,
},
)
)

# Check that there was one resync attempt.
self.assertEqual(self.resync_attempts, 1)

# Check that the resync attempt failed and caused the user's device list to be
# marked as stale.
need_resync = self.get_success(
self.store.get_user_ids_requiring_device_list_resync()
)
self.assertIn(remote_user_id, need_resync)

# Check that waiting for 30 seconds caused Synapse to retry resyncing the device
# list.
self.reactor.advance(30)
self.assertEqual(self.resync_attempts, 2)

def test_cross_signing_keys_retry(self) -> None:
"""Tests that resyncing a device list correctly processes cross-signing keys from
the remote server.
"""
remote_user_id = "@john:test_remote"
remote_master_key = "85T7JXPFBAySB/jwby4S3lBPTqY3+Zg53nYuGmu1ggY"
remote_self_signing_key = "QeIiFEjluPBtI7WQdG365QKZcFs9kqmHir6RBD0//nQ"

# Register mock device list retrieval on the federation client.
federation_client = self.hs.get_federation_client()
federation_client.query_user_devices = AsyncMock( # type: ignore[method-assign]
return_value={
"user_id": remote_user_id,
"stream_id": 1,
"devices": [],
"master_key": {
"user_id": remote_user_id,
"usage": ["master"],
"keys": {"ed25519:" + remote_master_key: remote_master_key},
},
"self_signing_key": {
"user_id": remote_user_id,
"usage": ["self_signing"],
"keys": {
"ed25519:" + remote_self_signing_key: remote_self_signing_key
},
},
}
)

# Resync the device list.
device_handler = self.hs.get_device_handler()
self.get_success(
device_handler.device_list_updater.multi_user_device_resync(
[remote_user_id]
),
)

# Retrieve the cross-signing keys for this user.
keys = self.get_success(
self.store.get_e2e_cross_signing_keys_bulk(user_ids=[remote_user_id]),
)
self.assertIn(remote_user_id, keys)
key = keys[remote_user_id]
assert key is not None

# Check that the master key is the one returned by the mock.
master_key = key["master"]
self.assertEqual(len(master_key["keys"]), 1)
self.assertTrue("ed25519:" + remote_master_key in master_key["keys"].keys())
self.assertTrue(remote_master_key in master_key["keys"].values())

# Check that the self-signing key is the one returned by the mock.
self_signing_key = key["self_signing"]
self.assertEqual(len(self_signing_key["keys"]), 1)
self.assertTrue(
"ed25519:" + remote_self_signing_key in self_signing_key["keys"].keys(),
)
self.assertTrue(remote_self_signing_key in self_signing_key["keys"].values())
Loading

0 comments on commit 6ec5e13

Please sign in to comment.