Skip to content

Commit

Permalink
drbd: allow application IO concurrently with resync requests sent to …
Browse files Browse the repository at this point in the history
…same peer

Since the previous commit ("drbd: respect peer writes held due to
conflicts when barrier received"), the receiver also waits for requests
that are held due to conflicts when a barrier is received. If these
conflicts are due to resync requests towards the same peer, then they
will only be resolved once the resync reply has been received. This
causes a deadlock. Avoid this by allowing application IO to be submitted
while there are conflicting resync requests pending, but only when they
relate to the same peer.

Implement this by adding "sent" and "received" flags for the intervals.
Resync requests that we initiate pass through two conflict resolution
phases. The first one is for waiting until the request can be sent. This
involves waiting until there are no conflicting application writes from
other peers. The second one occurs when we receive the reply and has the
purpose of ensuring we do not submit conflicting writes.

This change has similarities to
4dc38cd drbd: Break resync deadlock
which was reverted because a different solution was required to avoid a
distributed deadlock with 3 nodes:
c0cd45a drbd: Break distributed deadlock in request processing
ddc742b drbd: drop special-casing peer_device in al_begin_io_for_peer
However, this is distinctly different and this time it is valid. The key
difference is that resync requests are no longer blocked by an active
activity log extent. This means that they only wait for local disk
writes to complete, and not for a peer ack, which is a distributed
event.

It is no longer necessary to include a special case for compatibility in
receive_Data(). This approach removes the exclusivity of peer writes
with resync requests that we have sent and are waiting for the reply
for. In fact, the previous solution was incomplete, because it still
held peer writes back when they conflicted with resync requests that had
been sent and pending a reply. This approach handles that scenario as
well.
  • Loading branch information
JoelColledge committed Jun 24, 2022
1 parent 15f2580 commit da2013b
Show file tree
Hide file tree
Showing 7 changed files with 167 additions and 61 deletions.
26 changes: 18 additions & 8 deletions Documentation/application-resync-synchronization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,12 @@ scheme:
* Secondary: The application IO interval is locked while the data is being
written to the backing disk.
* Sync target: The ``P_RS_DATA_REQUEST`` interval is locked. The lock is taken
then the dagtag is recorded and the request is sent. The lock is released
when the reply has been received and the data written to the backing disk.
in two phases. Before sending the request, the interval is locked for
conflicts with other peers. Then the dagtag is recorded and the request is
sent. When the reply is received, the interval is additionally locked for
conflicts with ongoing IO, in particular writes from the same peer. The lock
is released when the reply has been received and the data written to the
backing disk.
* Sync source: The ``P_RS_DATA_REQUEST`` interval is locked. The lock is taken
when the dagtag for the request is reached. It is released when the
``P_RS_WRITE_ACK`` is received. This lock is a read lock; it is exclusive
Expand Down Expand Up @@ -94,12 +98,12 @@ Correctness of data

We only consider the synchronization between application and resync IO here.

The locking scheme prevents any writes to the resync request interval from when
the request is initiated until the received data is written. After the lock is
taken on the target, the dagtag is recorded and the request is sent to the
source. The source then waits until it has reached this dagtag before reading.
This ensures that the resync data is at least as new as the data on the target
when the request was made.
The locking scheme prevents any writes from other peers to the resync request
interval from when the request is initiated until the received data is written.
After the lock is taken on the target, the dagtag is recorded and the request
is sent to the source. The source then waits until it has reached this dagtag
before reading. This ensures that the resync data is at least as new as the
data on the target when the request was made.

Conflicting application writes that reach the target while the resync request
is in progress are held until the resync data has been written. Hence they
Expand All @@ -108,6 +112,12 @@ this application write when it performed the resync read, the application write
will overwrite the resync write on the target with identical data. This is
harmless.

Resync requests sent from the target are not exclusive with application writes
from the same peer. However, since the resync and application data originate
from the same node, they are transmitted in the correct order in the data
stream. Application IO defers to received resync IO, ensuring that a resync
write received before an application write is also submitted first.

Correctness of bitmap bits
--------------------------

Expand Down
2 changes: 2 additions & 0 deletions drbd/drbd_debugfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -1111,6 +1111,8 @@ static void seq_printf_interval_tree(struct seq_file *m, struct rb_root *root)
char sep = ' ';

seq_printf(m, "%llus+%u %s", (unsigned long long) i->sector, i->size, drbd_interval_type_str(i));
seq_print_rq_state_bit(m, test_bit(INTERVAL_SENT, &i->flags), &sep, "sent");
seq_print_rq_state_bit(m, test_bit(INTERVAL_RECEIVED, &i->flags), &sep, "received");
seq_print_rq_state_bit(m, test_bit(INTERVAL_SUBMIT_CONFLICT_QUEUED, &i->flags), &sep, "submit-conflict-queued");
seq_print_rq_state_bit(m, test_bit(INTERVAL_SUBMITTED, &i->flags), &sep, "submitted");
seq_print_rq_state_bit(m, test_bit(INTERVAL_BACKING_COMPLETED, &i->flags), &sep, "backing-completed");
Expand Down
65 changes: 61 additions & 4 deletions drbd/drbd_int.h
Original file line number Diff line number Diff line change
Expand Up @@ -1985,7 +1985,7 @@ extern void drbd_rs_controller_reset(struct drbd_peer_device *);
extern void drbd_rs_all_in_flight_came_back(struct drbd_peer_device *, int);
extern void drbd_check_peers(struct drbd_resource *resource);
extern void drbd_check_peers_new_current_uuid(struct drbd_device *);
extern void drbd_conflict_submit_resync_request(struct drbd_peer_request *peer_req);
extern void drbd_conflict_send_resync_request(struct drbd_peer_request *peer_req);
extern void drbd_ping_peer(struct drbd_connection *connection);
extern struct drbd_peer_device *peer_device_by_node_id(struct drbd_device *, int);
extern void repost_up_to_date_fn(struct timer_list *t);
Expand Down Expand Up @@ -2094,6 +2094,7 @@ extern void drbd_send_peer_ack_wf(struct work_struct *ws);
extern bool drbd_rs_c_min_rate_throttle(struct drbd_peer_device *);
extern void drbd_verify_skipped_block(struct drbd_peer_device *peer_device,
const sector_t sector, const unsigned int size);
extern void drbd_conflict_submit_resync_request(struct drbd_peer_request *peer_req);
extern void drbd_conflict_submit_peer_read(struct drbd_peer_request *peer_req);
extern void drbd_conflict_submit_peer_write(struct drbd_peer_request *peer_req);
extern int drbd_submit_peer_request(struct drbd_peer_request *);
Expand Down Expand Up @@ -2654,10 +2655,57 @@ static inline struct drbd_connection *first_connection(struct drbd_resource *res

#define NODE_MASK(id) ((u64)1 << (id))

/*
* drbd_interval_same_peer - determine whether "interval" is for the same peer as "i"
*
* "i" must be an interval corresponding to a drbd_peer_request.
*/
static inline bool drbd_interval_same_peer(struct drbd_interval *interval, struct drbd_interval *i)
{
struct drbd_peer_request *interval_peer_req, *i_peer_req;

/* Ensure we only call "container_of" if it is actually a peer request. */
if (interval->type == INTERVAL_LOCAL_WRITE ||
interval->type == INTERVAL_LOCAL_READ ||
interval->type == INTERVAL_PEERS_IN_SYNC_LOCK)
return false;

interval_peer_req = container_of(interval, struct drbd_peer_request, i);
i_peer_req = container_of(i, struct drbd_peer_request, i);
return interval_peer_req->peer_device == i_peer_req->peer_device;
}

/*
* drbd_resync_should_defer - determine whether "interval" should defer to "i"
*/
static inline bool drbd_should_defer_to_resync(struct drbd_interval *interval, struct drbd_interval *i)
{
if (!drbd_interval_is_resync(i))
return false;

/* Always defer to resync requests once the reply has been received.
* These just need to wait for conflicting local I/O to complete. This
* is necessary to ensure that resync replies received before
* application writes are submitted first, so that the resync writes do
* not overwrite newer data. */
if (test_bit(INTERVAL_RECEIVED, &i->flags))
return true;

/* If we are still waiting for a reply from the peer, only defer to the
* request if it is towards a different peer. The exclusivity between
* resync requests and application writes from another peer is
* necessary to avoid overwriting newer data with older in the resync.
* When the data in both cases is coming from the same peer, this is
* not necessary. The peer ensures that the data stream is correctly
* ordered. */
return !drbd_interval_same_peer(interval, i);
}

#define CONFLICT_FLAG_WRITE (1 << 0)
#define CONFLICT_FLAG_DEFER_TO_RESYNC (1 << 1)
#define CONFLICT_FLAG_EXCLUSIVE_UNTIL_COMPLETED (1 << 2)
#define CONFLICT_FLAG_APPLICATION_ONLY (1 << 3)
#define CONFLICT_FLAG_IGNORE_SAME_PEER (1 << 2)
#define CONFLICT_FLAG_EXCLUSIVE_UNTIL_COMPLETED (1 << 3)
#define CONFLICT_FLAG_APPLICATION_ONLY (1 << 4)

/*
* drbd_find_conflict - find conflicting interval, if any
Expand All @@ -2670,6 +2718,7 @@ static inline struct drbd_interval *drbd_find_conflict(struct drbd_device *devic
int size = interval->size;
bool write = flags & CONFLICT_FLAG_WRITE;
bool defer_to_resync = flags & CONFLICT_FLAG_DEFER_TO_RESYNC;
bool ignore_same_peer = flags & CONFLICT_FLAG_IGNORE_SAME_PEER;
bool exclusive_until_completed = flags & CONFLICT_FLAG_EXCLUSIVE_UNTIL_COMPLETED;
bool application_only = flags & CONFLICT_FLAG_APPLICATION_ONLY;

Expand All @@ -2693,7 +2742,15 @@ static inline struct drbd_interval *drbd_find_conflict(struct drbd_device *devic
/* Ignore, if not yet submitted, unless we should defer to a
* resync request. */
if (!test_bit(INTERVAL_SUBMITTED, &i->flags) &&
!(defer_to_resync && drbd_interval_is_resync(i)))
!(defer_to_resync && drbd_should_defer_to_resync(interval, i)))
continue;

/* Ignore peer writes from the peer that this request relates
* to, if requested. This is only used for determining whether
* to send a request. It must not be used for determining
* whether to submit a request, because that would allow
* concurrent writes to the backing disk. */
if (ignore_same_peer && i->type == INTERVAL_PEER_WRITE && drbd_interval_same_peer(interval, i))
continue;

if (unlikely(application_only)) {
Expand Down
6 changes: 6 additions & 0 deletions drbd/drbd_interval.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ enum drbd_interval_type {
};

enum drbd_interval_flags {
/* Whether this resync write has been sent yet. */
INTERVAL_SENT,

/* Whether this resync write has been received yet. */
INTERVAL_RECEIVED,

/* Whether this has been queued after conflict. */
INTERVAL_SUBMIT_CONFLICT_QUEUED,

Expand Down
88 changes: 49 additions & 39 deletions drbd/drbd_receiver.c
Original file line number Diff line number Diff line change
Expand Up @@ -2550,7 +2550,7 @@ static struct drbd_peer_request *find_resync_request(struct drbd_peer_device *pe
* the requests. Hence looking the request up with a list traversal
* should not be too slow. */
list_for_each_entry(p, list, w.list) {
if (p->i.sector == sector && test_bit(INTERVAL_SUBMITTED, &p->i.flags)) {
if (p->i.sector == sector && test_bit(INTERVAL_SENT, &p->i.flags)) {
peer_req = p;
break;
}
Expand All @@ -2566,6 +2566,46 @@ static struct drbd_peer_request *find_resync_request(struct drbd_peer_device *pe
return peer_req;
}

static void drbd_cleanup_after_failed_submit_resync_request(struct drbd_peer_request *peer_req)
{
struct drbd_peer_device *peer_device = peer_req->peer_device;
struct drbd_connection* connection = peer_device->connection;
struct drbd_device *device = peer_device->device;

drbd_err(device, "submit failed, triggering re-connect\n");
spin_lock_irq(&connection->peer_reqs_lock);
list_del(&peer_req->w.list);
list_del(&peer_req->recv_order);
spin_unlock_irq(&connection->peer_reqs_lock);

drbd_remove_peer_req_interval(peer_req);
drbd_free_peer_req(peer_req);
put_ldev(device);

change_cstate(connection, C_PROTOCOL_ERROR, CS_HARD);
}

void drbd_conflict_submit_resync_request(struct drbd_peer_request *peer_req)
{
struct drbd_peer_device *peer_device = peer_req->peer_device;
struct drbd_device *device = peer_device->device;
bool conflict;

spin_lock_irq(&device->interval_lock);
clear_bit(INTERVAL_SUBMIT_CONFLICT_QUEUED, &peer_req->i.flags);
set_bit(INTERVAL_RECEIVED, &peer_req->i.flags);
conflict = drbd_find_conflict(device, &peer_req->i, CONFLICT_FLAG_WRITE);
if (!conflict)
set_bit(INTERVAL_SUBMITTED, &peer_req->i.flags);
spin_unlock_irq(&device->interval_lock);

if (!conflict) {
int err = drbd_submit_peer_request(peer_req);
if (err)
drbd_cleanup_after_failed_submit_resync_request(peer_req);
}
}

static int recv_resync_read(struct drbd_peer_device *peer_device,
struct drbd_peer_request_details *d) __releases(local)
{
Expand Down Expand Up @@ -2611,9 +2651,7 @@ static int recv_resync_read(struct drbd_peer_device *peer_device,
size = peer_req->i.size;
drbd_set_all_out_of_sync(device, sector, size);

err = drbd_submit_peer_request(peer_req);
if (err)
goto out;
drbd_conflict_submit_resync_request(peer_req);
peer_req = NULL; /* since submitted, might be destroyed already */

for_each_peer_device_ref(peer_device, im, device) {
Expand All @@ -2622,16 +2660,6 @@ static int recv_resync_read(struct drbd_peer_device *peer_device,
drbd_send_out_of_sync(peer_device, sector, size);
}
return 0;
out:
/* don't care for the reason here */
drbd_err(device, "submit failed, triggering re-connect\n");
spin_lock_irq(&connection->peer_reqs_lock);
list_del(&peer_req->w.list);
list_del(&peer_req->recv_order);
spin_unlock_irq(&connection->peer_reqs_lock);

drbd_free_peer_req(peer_req);
return err;
}

/* caller must hold interval_lock */
Expand Down Expand Up @@ -3187,7 +3215,6 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
struct drbd_peer_request *peer_req;
struct drbd_peer_request_details d;
int err, tp;
unsigned long conflict_flags = CONFLICT_FLAG_WRITE;
bool conflict = false;

peer_device = conn_peer_device(connection, pi->vnr);
Expand Down Expand Up @@ -3385,16 +3412,8 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info *
goto out_del_list;
}
}
/* In protocol < 110 (compatibility with DRBD 8.4), we must not defer
* to resync. The peer may have received a resync request from us in
* the same "resync extent" as this write. In this case, it will block
* until it has received the corresponding barrier ack, so we must
* submit the write.
*/
if (connection->agreed_pro_version >= 110)
conflict_flags |= CONFLICT_FLAG_DEFER_TO_RESYNC;
conflict = connection->agreed_pro_version >= 110 &&
drbd_find_conflict(device, &peer_req->i, conflict_flags);
conflict = drbd_find_conflict(device, &peer_req->i,
CONFLICT_FLAG_WRITE | CONFLICT_FLAG_DEFER_TO_RESYNC);
drbd_insert_interval(&device->requests, &peer_req->i);
if (!conflict)
set_bit(INTERVAL_SUBMITTED, &peer_req->i.flags);
Expand Down Expand Up @@ -4066,6 +4085,8 @@ static int receive_common_ov_reply(struct drbd_connection *connection, struct pa
if (err)
goto fail;

set_bit(INTERVAL_RECEIVED, &peer_req->i.flags);

drbd_alloc_page_chain(&peer_device->connection->transport,
&peer_req->page_chain, DIV_ROUND_UP(size, PAGE_SIZE), GFP_TRY);
if (!peer_req->page_chain.head) {
Expand Down Expand Up @@ -8587,20 +8608,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac
spin_unlock_irq(&connection->peer_reqs_lock);

atomic_add(pi->size >> 9, &device->rs_sect_ev);
err = drbd_submit_peer_request(peer_req);

if (err) {
drbd_err(device, "discard submit failed, triggering re-connect\n");
spin_lock_irq(&connection->peer_reqs_lock);
list_del(&peer_req->w.list);
list_del(&peer_req->recv_order);
spin_unlock_irq(&connection->peer_reqs_lock);

drbd_remove_peer_req_interval(peer_req);
drbd_free_peer_req(peer_req);
peer_req = NULL;
put_ldev(device);
}
drbd_conflict_submit_resync_request(peer_req);

/* No put_ldev() here. Gets called in drbd_endio_write_sec_final(). */
} else {
Expand Down Expand Up @@ -9701,6 +9709,8 @@ static int got_IsInSync(struct drbd_connection *connection, struct packet_info *

dec_rs_pending(peer_device);

set_bit(INTERVAL_RECEIVED, &peer_req->i.flags);

spin_lock_irq(&connection->peer_reqs_lock);
list_del(&peer_req->w.list);
spin_unlock_irq(&connection->peer_reqs_lock);
Expand Down
10 changes: 9 additions & 1 deletion drbd/drbd_req.c
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,11 @@ void drbd_release_conflicts(struct drbd_device *device, struct drbd_interval *re
if (test_bit(INTERVAL_SUBMITTED, &i->flags))
continue;

/* If we are waiting for a reply from the peer, then there is
* no need to process the conflict. */
if (test_bit(INTERVAL_SENT, &i->flags) && !test_bit(INTERVAL_RECEIVED, &i->flags))
continue;

dynamic_drbd_dbg(device,
"%s %s request at %llus+%u after conflict with %llus+%u\n",
test_bit(INTERVAL_SUBMIT_CONFLICT_QUEUED, &i->flags) ? "Already queued" : "Queue",
Expand Down Expand Up @@ -2112,7 +2117,10 @@ void drbd_do_submit_conflict(struct work_struct *ws)

list_for_each_entry_safe(peer_req, peer_req_tmp, &resync_writes, submit_list) {
list_del_init(&peer_req->submit_list);
drbd_conflict_submit_resync_request(peer_req);
if (!test_bit(INTERVAL_SENT, &peer_req->i.flags))
drbd_conflict_send_resync_request(peer_req);
else
drbd_conflict_submit_resync_request(peer_req);
}

list_for_each_entry_safe(peer_req, peer_req_tmp, &resync_reads, submit_list) {
Expand Down
Loading

0 comments on commit da2013b

Please sign in to comment.