Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in handling postgres COPY command and a few others #610

Merged
merged 4 commits into from
Sep 29, 2024

Conversation

mostafa
Copy link
Member

@mostafa mostafa commented Sep 29, 2024

Ticket(s)

Closes #533.

Description

After hours of debugging, I discovered that the issue stemmed from a few incorrect assumptions. The first was that the chunk size read from the connection should be compared to the ReceiveChunkSize, rather than the total amount of data received up to that point. The second assumption was that receiving zero data from the server should automatically close the connection, which has since been resolved.

The third assumption was that every request to PostgreSQL would elicit an immediate response, but the behavior of the COPY command proved otherwise. When a client issues a COPY command to the server, PostgreSQL replies with a CopyInResponse. If no errors occur, the client begins sending CopyData messages, which may consist of multiple requests, depending on the chunk size. After each CopyData message, however, the server does not send an acknowledgment or response—it simply waits. Only when the client sends a CopyDone message at the end of the data transmission does the server reply with CommandComplete and ReadyForQuery messages. This means the client can send multiple requests without receiving an immediate response for each one. The final ReadyForQuery response from the server signals that the client can proceed with the next request.

Ref: https://www.postgresql.org/docs/current/protocol-message-formats.html

In order to make sure that it works as expected, I tested the referenced SQL dump file in the issue and it works like a charm. I tested GatewayD with pgbench and found out that concurrent reads and writes to the Server.connectionToProxyMap causes fatal errors, hence exiting the process. I changed the type to sync.Map to avoid this in the future and make GatewayD more stable. Note that the tps before and after the change is decreased and the latency is increased, which is expected:

Before:

PGPASSWORD=postgres pgbench -M extended --transactions 100 --jobs 10 --client 99 -h localhost -p 15432 -U postgres postgres
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: extended
number of clients: 99
number of threads: 10
number of transactions per client: 100
number of transactions actually processed: 9900/9900
latency average = 188.036 ms
tps = 526.495434 (including connections establishing)
tps = 527.065595 (excluding connections establishing)

After:

PGPASSWORD=postgres pgbench -M extended --transactions 100 --jobs 10 --client 99 -h localhost -p 15432 -U postgres postgres
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: extended
number of clients: 99
number of threads: 10
number of transactions per client: 100
number of transactions actually processed: 9900/9900
latency average = 197.091 ms
tps = 502.307211 (including connections establishing)
tps = 502.959156 (excluding connections establishing)

Direct connection to database (without using GatewayD):
Surprisingly GatewayD performs better when running the benchmark, which is due to the warm up when booting up. GatewayD creates 100 connections to postgres and puts them in the queue, which in turn causes 100 master processes to be created in postgres, hence speeding up transactions by more than 24%, while decreasing the latency average by almost 20%.

PGPASSWORD=postgres pgbench -M extended --transactions 100 --jobs 10 --client 99 -h localhost -p 5432 -U postgres postgres
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: extended
number of clients: 99
number of threads: 10
number of transactions per client: 100
number of transactions actually processed: 9900/9900
latency average = 244.993 ms
tps = 404.093614 (including connections establishing)
tps = 404.478112 (excluding connections establishing)

CC @sh-soltanpour

Related PRs

N/A

Development Checklist

  • I have added a descriptive title to this PR.
  • I have squashed related commits together.
  • I have rebased my branch on top of the latest main branch.
  • I have performed a self-review of my own code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added docstring(s) to my code.
  • I have made corresponding changes to the documentation (docs).
  • I have updated docs using make gen-docs command.
  • I have added tests for my changes.
  • I have signed all the commits.

Legal Checklist

@mostafa mostafa self-assigned this Sep 29, 2024
Copy link

github-actions bot commented Sep 29, 2024

Overview

Image reference ghcr.io/gatewayd-io/gatewayd:15c3ad0 gatewaydio/gatewayd:latest
- digest 2ff015d36565 aec16cb6bc1b
- tag 15c3ad0 latest
- provenance 4510a51
- vulnerabilities critical: 0 high: 0 medium: 0 low: 0 critical: 0 high: 0 medium: 0 low: 0
- platform linux/amd64 linux/amd64
- size 19 MB 17 MB (-2.2 MB)
- packages 136 132 (-4)
Base Image alpine:3
also known as:
3.20
3.20.3
latest
alpine:3.20
also known as:
3
3.20.3
latest
- vulnerabilities critical: 0 high: 0 medium: 0 low: 0 critical: 0 high: 0 medium: 0 low: 0
Packages and Vulnerabilities (23 package changes and 0 vulnerability changes)
  • ➕ 1 packages added
  • ➖ 3 packages removed
  • ♾️ 19 packages changed
  • 110 packages unchanged
Changes for packages of type apk (3 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:15c3ad0
Version
gatewaydio/gatewayd:latest
ca-certificates 20240705-r0
openssl 3.3.2-r0
pax-utils 1.3.7-r2
Changes for packages of type golang (20 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:15c3ad0
Version
gatewaydio/gatewayd:latest
♾️ github.com/cyphar/filepath-securejoin 0.3.2 0.3.1
♾️ github.com/docker/docker 27.3.1+incompatible 27.2.1+incompatible
♾️ github.com/gatewayd-io/gatewayd (devel) 0.9.7
♾️ github.com/gatewayd-io/gatewayd-plugin-sdk 0.3.2 0.3.1
♾️ github.com/getsentry/sentry-go 0.29.0 0.28.1
♾️ github.com/hashicorp/yamux 0.1.2 0.1.1
github.com/imdario/mergo 0.3.16
♾️ github.com/jackc/pgx/v5 5.7.1 5.7.0
♾️ github.com/klauspost/compress 1.17.10 1.17.9
♾️ github.com/prometheus/client_golang 1.20.4 1.20.3
♾️ go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp 0.55.0 0.54.0
♾️ go.opentelemetry.io/otel 1.30.0 1.29.0
♾️ go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc 1.30.0 1.29.0
♾️ go.opentelemetry.io/otel/metric 1.30.0 1.29.0
♾️ go.opentelemetry.io/otel/sdk 1.30.0 1.29.0
♾️ go.opentelemetry.io/otel/trace 1.30.0 1.29.0
♾️ golang.org/x/exp 0.0.0-20240909161429-701f63a606c0 0.0.0-20240904232852-e7e105dedf7e
♾️ google.golang.org/genproto/googleapis/rpc 0.0.0-20240924160255-9d4c2d233b61 0.0.0-20240903143218-8af14fe29dc1
♾️ google.golang.org/grpc 1.67.0 1.66.0
♾️ stdlib go1.23.1 1.23.1

@mostafa mostafa changed the title Fix bug in postgres COPY command Fix bug in handling postgres COPY command and a few others Sep 29, 2024
network/client.go Outdated Show resolved Hide resolved
network/proxy.go Outdated Show resolved Hide resolved
network/proxy.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@sinadarbouy sinadarbouy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🎉 Great catch and fix

@mostafa mostafa merged commit 9ec6b54 into main Sep 29, 2024
5 checks passed
@mostafa mostafa deleted the fix-copy-bug branch September 29, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Postgres COPY causes timeouts and connection getting stuck transferring data
2 participants