Skip to content

GH-46193: [Flight][Format] Extend Flight Location URI Semantics #46194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Apr 21, 2025

Rationale for this change

Updating the documentation in Flight.proto and Flight.rst to extend the semantics of the allowed Flight location URIs.

What changes are included in this PR?

Just documentation changes. Currently, none of the Arrow Flight implementations actually implement handling of the returned URIs beyond possibly parsing them and wrapping in a Location structure. It is left to the consumer to implement the logic of whether to re-use the same client or spin up a new client with the new location etc. to perform the DoGet request against. As such, there wasn't a need to make any code/library changes to accomodate this as part of this PR.

Copy link

⚠️ GitHub issue #46193 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member

kou commented Apr 21, 2025

@github-actions crossbow submit preview-docs

Copy link

Revision: 4cf4a1f

Submitted crossbow builds: ursacomputing/crossbow @ actions-e1c3b439ea

Task Status
preview-docs GitHub Actions

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase on main to fix the "preview-docs" job failure?

* In these cases, the following conventions should be followed by servers and clients:
*
* - Unless otherwise specified by the 'Content-Type' header of the response,
* a client should assume the response is an Arrow IPC Stream. Usage of an IANA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* a client should assume the response is an Arrow IPC Stream. Usage of an IANA
* a client should assume the response is an Arrow IPC Streaming format. Usage of an IANA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantically this doesn't make much sense. I'll reword it a bit to use Streaming format if that's preferred

* - Unless otherwise specified by the 'Content-Type' header of the response,
* a client should assume the response is an Arrow IPC Stream. Usage of an IANA
* media type like 'application/octet-stream' should be assumed to be an Arrow
* IPC Stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* IPC Stream.
* IPC Streaming format.

+--------------------+------------------------+
| Location | URI Scheme |
+====================+========================+
| Object storage (1) | s3:, gcs:, abfs:, etc. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, let's fully list out the schemes instead of relying on etc

For instance: is GCS gcs, or gs? (Google itself uses gs). And Azure seems to have both abfs and abfss.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think there should be an exhaustive list of both protocols and formats supported, however large the list might be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with listing these explicitly is that we would ideally document the full allowed URI syntax (such as allowed parameters for each supported scheme). Do we want to do that?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the change is essentially about delegating data access to a different protocol, feels like it should be fine to delegate the particulars of URI syntax and such to wherever that protocol is defined as well. Unless you have some particular parameter in mind (of s3 for example) that doesn't make sense in the context of flight. Realistically clients wouldn't try to reimplement each protocol (?), rather they will integrate with existing tools that probably aim to cover the full spec regardless.

I think a major upside of having an exhaustive list would be to at least signal to clients what the goalpost should be. There should be some way to state that it's lot more likely to encounter s3/parquet duo in the response rather than ftp/orc for example. Disallowing combinations like ftp/orc altogether would be a simple way to achieve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the change is essentially about delegating data access to a different protocol, feels like it should be fine to delegate the particulars of URI syntax and such to wherever that protocol is defined as well.

That's fine for those protocols that have a well-known URI standard. But many protocols such as S3 don't seem to, and ad-hoc S3 URI syntaxes may differ on various aspects.

For example, the Arrow C++ routine for parsing S3 URIs allows for a number of query parameters such as region, endpoint_override, tls_verify_certificates... Other S3 URI implementations may define other optional parameters.

Result<S3Options> S3Options::FromUri(const Uri& uri, std::string* out_path) {
S3Options options;
const auto bucket = uri.host();
auto path = uri.path();
if (bucket.empty()) {
if (!path.empty()) {
return Status::Invalid("Missing bucket name in S3 URI");
}
} else {
if (path.empty()) {
path = bucket;
} else {
if (path[0] != '/') {
return Status::Invalid("S3 URI should be absolute, not relative");
}
path = bucket + path;
}
}
if (out_path != nullptr) {
*out_path = std::string(internal::RemoveTrailingSlash(path));
}
std::unordered_map<std::string, std::string> options_map;
ARROW_ASSIGN_OR_RAISE(const auto options_items, uri.query_items());
for (const auto& kv : options_items) {
options_map.emplace(kv.first, kv.second);
}
const auto username = uri.username();
if (!username.empty()) {
options.ConfigureAccessKey(username, uri.password());
} else {
options.ConfigureDefaultCredentials();
}
// Prefer AWS service-specific endpoint url
auto s3_endpoint_env = arrow::internal::GetEnvVar(kAwsEndpointUrlS3EnvVar);
if (s3_endpoint_env.ok()) {
options.endpoint_override = *s3_endpoint_env;
} else {
auto endpoint_env = arrow::internal::GetEnvVar(kAwsEndpointUrlEnvVar);
if (endpoint_env.ok()) {
options.endpoint_override = *endpoint_env;
}
}
bool region_set = false;
for (const auto& kv : options_map) {
if (kv.first == "region") {
options.region = kv.second;
region_set = true;
} else if (kv.first == "scheme") {
options.scheme = kv.second;
} else if (kv.first == "endpoint_override") {
options.endpoint_override = kv.second;
} else if (kv.first == "allow_bucket_creation") {
ARROW_ASSIGN_OR_RAISE(options.allow_bucket_creation,
::arrow::internal::ParseBoolean(kv.second));
} else if (kv.first == "allow_bucket_deletion") {
ARROW_ASSIGN_OR_RAISE(options.allow_bucket_deletion,
::arrow::internal::ParseBoolean(kv.second));
} else if (kv.first == "tls_ca_file_path") {
options.tls_ca_file_path = kv.second;
} else if (kv.first == "tls_ca_dir_path") {
options.tls_ca_dir_path = kv.second;
} else if (kv.first == "tls_verify_certificates") {
ARROW_ASSIGN_OR_RAISE(options.tls_verify_certificates,
::arrow::internal::ParseBoolean(kv.second));
} else {
return Status::Invalid("Unexpected query parameter in S3 URI: '", kv.first, "'");
}
}
if (!region_set && !bucket.empty() && options.endpoint_override.empty()) {
// XXX Should we use a dedicated resolver with the given credentials?
ARROW_ASSIGN_OR_RAISE(options.region, ResolveS3BucketRegion(bucket));
}
return options;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, seems like delta sharing (which effectively does something very similar) already works like that. According to spec urls for add/remove files are exclusively https, although spec claims to support S3, ADLS, or GCS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of simplifying it that way.

@paleolimbot @lidavidm @pitrou what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only comment at a high level, but I do think it seems reasonable to always respond with http to make it easier to implement clients. It seems like this would easy to relax in the future if a use case arises where this is blocking somebody from adopting Flight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in favor of simplifying it in the way that @joellubi suggests.

Later, if the need arises, we can consider proposals of other supported URI schemes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 22, 2025
@zeroshade zeroshade force-pushed the flight-endpoint-uris branch from 40abb75 to 474d1bc Compare April 25, 2025 15:33
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 25, 2025
@zeroshade
Copy link
Member Author

@github-actions crossbow submit preview-docs

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 25, 2025
Copy link

Revision: 1a5a21d

Submitted crossbow builds: ursacomputing/crossbow @ actions-2cce99968f

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants