-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46193: [Flight][Format] Extend Flight Location URI Semantics #46194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@github-actions crossbow submit preview-docs |
Revision: 4cf4a1f Submitted crossbow builds: ursacomputing/crossbow @ actions-e1c3b439ea
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase on main to fix the "preview-docs" job failure?
format/Flight.proto
Outdated
* In these cases, the following conventions should be followed by servers and clients: | ||
* | ||
* - Unless otherwise specified by the 'Content-Type' header of the response, | ||
* a client should assume the response is an Arrow IPC Stream. Usage of an IANA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* a client should assume the response is an Arrow IPC Stream. Usage of an IANA | |
* a client should assume the response is an Arrow IPC Streaming format. Usage of an IANA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
semantically this doesn't make much sense. I'll reword it a bit to use Streaming format
if that's preferred
format/Flight.proto
Outdated
* - Unless otherwise specified by the 'Content-Type' header of the response, | ||
* a client should assume the response is an Arrow IPC Stream. Usage of an IANA | ||
* media type like 'application/octet-stream' should be assumed to be an Arrow | ||
* IPC Stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* IPC Stream. | |
* IPC Streaming format. |
docs/source/format/Flight.rst
Outdated
+--------------------+------------------------+ | ||
| Location | URI Scheme | | ||
+====================+========================+ | ||
| Object storage (1) | s3:, gcs:, abfs:, etc. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, let's fully list out the schemes instead of relying on etc
For instance: is GCS gcs, or gs? (Google itself uses gs). And Azure seems to have both abfs and abfss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think there should be an exhaustive list of both protocols and formats supported, however large the list might be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with listing these explicitly is that we would ideally document the full allowed URI syntax (such as allowed parameters for each supported scheme). Do we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the change is essentially about delegating data access to a different protocol, feels like it should be fine to delegate the particulars of URI syntax and such to wherever that protocol is defined as well. Unless you have some particular parameter in mind (of s3 for example) that doesn't make sense in the context of flight. Realistically clients wouldn't try to reimplement each protocol (?), rather they will integrate with existing tools that probably aim to cover the full spec regardless.
I think a major upside of having an exhaustive list would be to at least signal to clients what the goalpost should be. There should be some way to state that it's lot more likely to encounter s3/parquet duo in the response rather than ftp/orc for example. Disallowing combinations like ftp/orc altogether would be a simple way to achieve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the change is essentially about delegating data access to a different protocol, feels like it should be fine to delegate the particulars of URI syntax and such to wherever that protocol is defined as well.
That's fine for those protocols that have a well-known URI standard. But many protocols such as S3 don't seem to, and ad-hoc S3 URI syntaxes may differ on various aspects.
For example, the Arrow C++ routine for parsing S3 URIs allows for a number of query parameters such as region
, endpoint_override
, tls_verify_certificates
... Other S3 URI implementations may define other optional parameters.
arrow/cpp/src/arrow/filesystem/s3fs.cc
Lines 349 to 428 in d2ddee6
Result<S3Options> S3Options::FromUri(const Uri& uri, std::string* out_path) { | |
S3Options options; | |
const auto bucket = uri.host(); | |
auto path = uri.path(); | |
if (bucket.empty()) { | |
if (!path.empty()) { | |
return Status::Invalid("Missing bucket name in S3 URI"); | |
} | |
} else { | |
if (path.empty()) { | |
path = bucket; | |
} else { | |
if (path[0] != '/') { | |
return Status::Invalid("S3 URI should be absolute, not relative"); | |
} | |
path = bucket + path; | |
} | |
} | |
if (out_path != nullptr) { | |
*out_path = std::string(internal::RemoveTrailingSlash(path)); | |
} | |
std::unordered_map<std::string, std::string> options_map; | |
ARROW_ASSIGN_OR_RAISE(const auto options_items, uri.query_items()); | |
for (const auto& kv : options_items) { | |
options_map.emplace(kv.first, kv.second); | |
} | |
const auto username = uri.username(); | |
if (!username.empty()) { | |
options.ConfigureAccessKey(username, uri.password()); | |
} else { | |
options.ConfigureDefaultCredentials(); | |
} | |
// Prefer AWS service-specific endpoint url | |
auto s3_endpoint_env = arrow::internal::GetEnvVar(kAwsEndpointUrlS3EnvVar); | |
if (s3_endpoint_env.ok()) { | |
options.endpoint_override = *s3_endpoint_env; | |
} else { | |
auto endpoint_env = arrow::internal::GetEnvVar(kAwsEndpointUrlEnvVar); | |
if (endpoint_env.ok()) { | |
options.endpoint_override = *endpoint_env; | |
} | |
} | |
bool region_set = false; | |
for (const auto& kv : options_map) { | |
if (kv.first == "region") { | |
options.region = kv.second; | |
region_set = true; | |
} else if (kv.first == "scheme") { | |
options.scheme = kv.second; | |
} else if (kv.first == "endpoint_override") { | |
options.endpoint_override = kv.second; | |
} else if (kv.first == "allow_bucket_creation") { | |
ARROW_ASSIGN_OR_RAISE(options.allow_bucket_creation, | |
::arrow::internal::ParseBoolean(kv.second)); | |
} else if (kv.first == "allow_bucket_deletion") { | |
ARROW_ASSIGN_OR_RAISE(options.allow_bucket_deletion, | |
::arrow::internal::ParseBoolean(kv.second)); | |
} else if (kv.first == "tls_ca_file_path") { | |
options.tls_ca_file_path = kv.second; | |
} else if (kv.first == "tls_ca_dir_path") { | |
options.tls_ca_dir_path = kv.second; | |
} else if (kv.first == "tls_verify_certificates") { | |
ARROW_ASSIGN_OR_RAISE(options.tls_verify_certificates, | |
::arrow::internal::ParseBoolean(kv.second)); | |
} else { | |
return Status::Invalid("Unexpected query parameter in S3 URI: '", kv.first, "'"); | |
} | |
} | |
if (!region_set && !bucket.empty() && options.endpoint_override.empty()) { | |
// XXX Should we use a dedicated resolver with the given credentials? | |
ARROW_ASSIGN_OR_RAISE(options.region, ResolveS3BucketRegion(bucket)); | |
} | |
return options; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, seems like delta sharing (which effectively does something very similar) already works like that. According to spec urls for add/remove files are exclusively https, although spec claims to support S3, ADLS, or GCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of simplifying it that way.
@paleolimbot @lidavidm @pitrou what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only comment at a high level, but I do think it seems reasonable to always respond with http to make it easier to implement clients. It seems like this would easy to relax in the future if a use case arises where this is blocking somebody from adopting Flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also in favor of simplifying it in the way that @joellubi suggests.
Later, if the need arises, we can consider proposals of other supported URI schemes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
40abb75
to
474d1bc
Compare
@github-actions crossbow submit preview-docs |
Revision: 1a5a21d Submitted crossbow builds: ursacomputing/crossbow @ actions-2cce99968f
|
Rationale for this change
Updating the documentation in Flight.proto and Flight.rst to extend the semantics of the allowed Flight location URIs.
What changes are included in this PR?
Just documentation changes. Currently, none of the Arrow Flight implementations actually implement handling of the returned URIs beyond possibly parsing them and wrapping in a
Location
structure. It is left to the consumer to implement the logic of whether to re-use the same client or spin up a new client with the new location etc. to perform theDoGet
request against. As such, there wasn't a need to make any code/library changes to accomodate this as part of this PR.