Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the filename type of DownloadObjectArgs and UploadObjectArgs #149

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

z-pc
Copy link

@z-pc z-pc commented Jun 24, 2024

Hi.
I have a problem with the filename attribute type of DownloadObjectArgs and UploadObjectArgs.
The current type is std::string, this leads to a file with utf-8 characters in the path that cannot read.
Example path: G:\\Tuấn Anh\tết.txt.
So, I suggest replacing std::string with std::filesystem::path
Thanks.

Lê Xuân Tuấn Anh added 2 commits June 24, 2024 14:37
@@ -213,12 +213,12 @@ error::Error DownloadObjectArgs::Validate() const {
if (error::Error err = ObjectReadArgs::Validate()) {
return err;
}
if (!utils::CheckNonEmptyString(filename)) {
if (!utils::CheckNonEmptyString(filename.u8string())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't filename.string() work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename.string() may make errors with the pure utf-8 encoded path. Let's consider following:

#include <iostream>
#include <string>
#include <filesystem>

int main(int argc, char* argv[])
{
	try {
		std::filesystem::path fileStr = L"G:\\Tuấn Anh\\tết.txt";
		std::string test1 = fileStr.u8string();
		std::string test2 = fileStr.string(); // >>>>>>>>>>> throw exception
	}
	catch (const std::exception& ex) {
		std::cerr << ex.what() << std::endl;
		return 1;
	}

	return 0;
}

I tried again with filename.string(), the exception was thrown.
With minio::utils::Trim(...), I don't think there is a different in result with filename.string() or filename.u8string()

@kobalicek
Copy link
Contributor

I think this is not fixing the root cause of the problem though. The C++ std::filesystem is kinda weird when it comes to UTF-8 encoding. In the past char* and std::string were considered to be using local encoding (which meant using code-pages on Windows), and this haven't changed, just Linux and others went with UTF-8 for all locales, but not Windows (Windows has unicode-aware API that uses wchar_t instead).

Changing std::string to std::filesystem::path may look as a solution, but I think it's not the right one. Maybe using std::u8string instead of std::string would be much better, but then we will hit more problems like poor support for std::u8string in the rest of the C++ standard library.

Not sure what would be the best thing to do in this case, I would like to just tell C++'s filesystem that we have indeed an UTF-8 encoded string to avoid going through mbtowc.

@z-pc
Copy link
Author

z-pc commented Jun 25, 2024

@kobalicek
I agree with you, u8string() is not yet widely used. It can cause more string-handling problems in this repository, I think.
Above all, we will have big problems with std::fstream.open(...) if u8string() is used as standard.

Currently, I only use u8string() with the object attribute:

std::filesystem::path file = L"G:\\Tuấn Anh\\tết.txt";
  minio::s3::UploadObjectArgs args;
  args.bucket = "storage";
  args.object = file.filename().u8string();

I'm also not sure which is better. However, it seems we don't have much choice.

@kobalicek
Copy link
Contributor

kobalicek commented Jun 25, 2024

If you use L"G:\\Tuấn Anh\\tết.txt" it means you are using wchar_t* string and not char* string. wchar_t has a defined encoding by platform, but char doesn't always mean UTF-8 and the problem here is UTF-8 vs 8-bit locale used by Windows.

When it comes to Windows the library code would most likely use WideChar API under the hood, so using wchar_t string seems okay, but I think we should not expose wchar_t in public API as it's not a that common outside Windows (unless you use .NET/Java that exposes UTF-16 string as part of the runtime).

I'm still not sure about the best solution here, whether exposing std::filesystem in public API makes sense or not.

@harshavardhana
Copy link
Member

what does AWS CPP library do in this scenario? @kobalicek

@z-pc
Copy link
Author

z-pc commented Jun 26, 2024

I think we should not expose wchar_t in public API

I think so, too.
It's a hard change for current users.

@kobalicek
Copy link
Contributor

what does AWS CPP library do in this scenario? @kobalicek

I have no idea.

Usually when you want UTF-8 file naming on Windows you would convert your UTF-8 string to Windows WideChar (UTF-16) and use WinAPI unicode-aware functions to access the file. And here the problem is file names on Windows platform can even have invalid surrogate pairs, etc... as the API doesn't validate the names, it just wants a sequence of 16-bit chars (similarly Linux API just wants a sequence of bytes). This means that if you use a regular validating UTF-8 -> UTF-16 conversion it could refuse your "invalid" file names. And quite frankly, I have no idea how C++ std::filesystem even deals with that - could be implementation specific, etc...

BTW check this issue regarding path names on Windows:

This means that under windows we would want to use WTF-8 encoding instead of UTF-8:

That's why I don't really know the best way of solving this in minio-cpp. Ideally we should do the conversion ourselves to be sure, but honestly that just sounds like a lot of work (basically to write the code and have it tested). Depending on std::filesystem is another option, but like everything in C++ the standard could just say this is implementation specific, which would mean that it's not guaranteed to work.

@balamurugana
Copy link
Member

Looks like there is no simple solution available. Why can't we skip supporting such special use case in {Upload,Download}Object methods? There is always a working APIs available like {Get, Put}Object methods. {Upload,Download}Object are just wrapper to these methods with additional file open/close. Let the special users use {Get,Put}Object methods.

@harshavardhana
Copy link
Member

Looks like there is no simple solution available. Why can't we skip supporting such special use case in {Upload,Download}Object methods? There is always a working APIs available like {Get, Put}Object methods. {Upload,Download}Object are just wrapper to these methods with additional file open/close. Let the special users use {Get,Put}Object methods.

yeah since this is a language specific problem, we can make this assumption and also add a doc line on what to do when a user wishes to deal with UTF-8.

@z-pc
Copy link
Author

z-pc commented Jun 27, 2024

Using Get|Put is fine, but have a lot of work to do for self-handle.
At first, I just thought Upload|DownloadObject needs a parameter type to pass to std::fstream(...), and std::filesystem is well supported.

@balamurugana
Copy link
Member

Using Get|Put is fine, but have a lot of work to do for self-handle. At first, I just thought Upload|DownloadObject needs a parameter type to pass to std::fstream(...), and std::filesystem is well supported.

  1. std::fstream(...) is already supported by {Get,Put}Object APIs. There is no point to support in {Upload,Download}Object APIs.
  2. I am unable to understand why usage of {Get,Put}Object is hard.
std::ifstream file(filename);
// construct PutObjectArgs
PutObjectResponse resp = PutObject(args);
file.close();
// Use resp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants