Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More string trimming capabilities #587

Open
richard-viney opened this issue May 11, 2024 · 19 comments · May be fixed by #696
Open

More string trimming capabilities #587

richard-viney opened this issue May 11, 2024 · 19 comments · May be fixed by #696

Comments

@richard-viney
Copy link
Contributor

The existing string.trim, string.trim_left and string.trim_right functions currently only trim whitespace.

Could we add equivalent trimming functions that take the codepoints to be trimmed as an input, similar to Erlang's string.trim/3?

E.g.

import gleam/list

/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(s: String, codepoints: List(UtfCodepoint)) -> String {
  string_trim(s, Both, codepoints_to_ints(codepoints))
}

/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_left(s: String, codepoints: List(UtfCodepoint)) -> String {
  string_trim(s, Leading, codepoints_to_ints(codepoints))
}

/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_right(
  s: String,
  codepoints: List(UtfCodepoint),
) -> String {
  string_trim(s, Trailing, codepoints_to_ints(codepoints))
}

fn codepoints_to_ints(codepoints: List(UtfCodepoint)) {
  codepoints
  |> list.map(string.utf_codepoint_to_int)
}

type Direction {
  Leading
  Trailing
  Both
}

@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(a: String, b: Direction, c: List(Int)) -> String

There are other ways to slice this, such as taking a String as the final parameter and using its codepoints as the ones to trim, rather than a List(UtfCodepoint).

I haven't looked into what the JavaScript implementation would be, but can do so.

On a related note, my understanding is that the existing methods with the words "left" and "right" in them assume left-to-right reading order, which doesn't hold for all languages (e.g. Arabic, Hebrew). Could consider using "start" and "end" instead, or adding these as aliases.

@lpil
Copy link
Member

lpil commented May 13, 2024

Sounds good, but we'll need to think of some other name as the ones used in the code example don't fit with the convention used in this repo.

Changing left and right for start and end sounds good.

@richard-viney
Copy link
Contributor Author

richard-viney commented May 20, 2024

What names would you suggest?

Exposing the existing Direction type would mean only one public function is added:

@target(erlang)
pub fn trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: String,
) -> String {
  let codepoint_ints =
    codepoints
    |> to_utf_codepoints
    |> list.map(utf_codepoint_to_int)

  do_trim_codepoints(string, direction, codepoint_ints)
}

@target(erlang)
@external(erlang, "string", "trim")
pub fn do_trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: List(Int),
) -> String

@target(javascript)
pub fn do_trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: List(Int),
) -> String {
  todo
}

@lpil
Copy link
Member

lpil commented May 20, 2024

We will not expose the direction type, it needs to be different functions.

@inoas
Copy link
Contributor

inoas commented Jul 18, 2024

maybe one of these

pub fn trim_codepoints(
  string: String
) { todo }


// and any pair:


pub fn trim_start_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_end_codepoints(
  string: String,
  by: String,
) { todo }


pub fn trim_head_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_tail_codepoints(
  string: String,
  by: String,
) { todo }


pub fn trim_beginning_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_ending_codepoints(
  string: String,
  by: String,
) { todo }

@RobiFerentz
Copy link

Has anyone started working on this?

I'm new to gleam but I'd like to try implementing this.

pub fn trim_codepoints_start(str: String, by: String) -> String {
  case starts_with(str, by) {
    True -> drop_left(str, length(by)) |> trim_codepoints_start(by)
    False -> str
  }
}

The above implementation seems to work, but I'm pretty sure there's a more performant way to do this, I'm just not used to thinking in pure functional terms.

@richard-viney
Copy link
Contributor Author

Yep I'd still like to get this through. We need to:

  1. Agree on the naming for the functions.
  2. Do the JavaScript implementation. I'm happy to take a look at this.

The code below implements the three proposed trim functions for the Erlang target via string:trim/3. JavaScript implementation still pending as mentioned above.

import gleam/list
import gleam/string

/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Both, codepoint_ints)
}

/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_start(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Leading, codepoint_ints)
}

/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_end(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Trailing, codepoint_ints)
}

fn string_to_codepoint_ints(string: String) -> List(Int) {
  string
  |> string.to_utf_codepoints
  |> list.map(string.utf_codepoint_to_int)
}

type Direction {
  Leading
  Trailing
  Both
}

@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(string: String, dir: Direction, characters: List(Int)) -> String

@lpil
Copy link
Member

lpil commented Sep 1, 2024

Are we sure we want to be trimming codepoints and not graphemes? How would we ensure that the string is still valid unicode when removing codepoints?

@richard-viney
Copy link
Contributor Author

richard-viney commented Sep 8, 2024

Looking at a few other languages, Erlang's string:trim/3 and Rust's string.trim_end_matches both operate on codepoints, i.e. they allow invalid graphemes to be created by the trimming operation. JavaScript doesn't have an equivalent configurable trim, but if you replace a specific codepoint with "" using a regex you get the same end result, i.e. regex replace doesn't respect graphemes.

I had previously thought that Gleam strings were defined to be valid UTF-8 binaries, but hadn't realised this extended to also being valid Unicode graphemes. Is that the case? If so, wouldn't the following lines need to error because they create a string with an invalid grapheme? (U+0301 is the 'Combining Acute Accent' as it modifies the preceding character):

let invalid_grapheme = "\u{0301}"
let invalid_grapheme = <<0xCC, 0x81>> |> bit_array.to_string

I tried a couple of other (what I believe to be) invalid graphemes and none were rejected from being turned into Gleam strings.

Either way, this trimming functionality takes a List(String) of graphemes to be trimmed off the input string, but if String isn't guaranteed to only contain valid graphemes then the result of the trim can't strictly be guaranteed to either.

This would look something like:

/// Removes all occurrences of the specified graphemes from the start and
/// end of a `String`.
///
pub fn trim_graphemes(string: String, graphemes: List(String)) -> String {
  todo
}

/// Removes all occurrences of the specified graphemes from the start of a
/// `String`.
///
pub fn trim_graphemes_start(string: String, graphemes: List(String)) -> String {
  todo
}

/// Removes all occurrences of the specified graphemes from the end of a
/// `String`.
///
pub fn trim_graphemes_end(string: String, graphemes: List(String)) -> String {
  todo
}

@lpil
Copy link
Member

lpil commented Sep 10, 2024

I don't really understand what the best behaviour is here. Is there a use case that motivates making it easy to have invalid graphemes?

@richard-viney
Copy link
Contributor Author

No specific use case that I'm aware of. In practice most trimming is probably not done using codepoints that can be part of multi-codepoint graphemes, hence why other languages tend to trim just based on codepoints.

Maybe it's better to make the graphemes argument a String instead of a List(String):

pub fn trim_graphemes(string: String, graphemes: String) -> String
pub fn trim_graphemes_start(string: String, graphemes: String) -> String
pub fn trim_graphemes_end(string: String, graphemes: String) -> String

The implementation then calls string.to_graphemes() on the graphemes argument and trims using that.

@richard-viney
Copy link
Contributor Author

Some alternative names:

pub fn trim_with(string: String, graphemes: String) -> String
pub fn trim_start_with(string: String, graphemes: String) -> String
pub fn trim_end_with(string: String, graphemes: String) -> String
pub fn trim_using(string: String, graphemes: String) -> String
pub fn trim_start_using(string: String, graphemes: String) -> String
pub fn trim_end_using(string: String, graphemes: String) -> String

Or could copy the Rust naming:

pub fn trim_matches(string: String, graphemes: String) -> String
pub fn trim_start_matches(string: String, graphemes: String) -> String
pub fn trim_end_matches(string: String, graphemes: String) -> String

@ollien
Copy link

ollien commented Oct 6, 2024

I've been poking at this, and just to throw a name into the ring, I've been liking trim_chars; I feel like trim_matches implies it takes a regexp, but that's just me. I'm definitely flexible on the name.

I assume we'd want to deprecate trim_left and trim_right and create aliases for consistency?

@richard-viney
Copy link
Contributor Author

Thanks for looking into this.

I don't know if introducing the term 'char' is desirable when 'codepoint' and 'grapheme' are already in use and are well-defined?

@ollien
Copy link

ollien commented Oct 6, 2024

Yeah that's reasonable. I think I'll take with, if that's ok with you!

@ollien
Copy link

ollien commented Oct 7, 2024

I had posted a PR before having read this issue thoroughly (I had been working on it before I found the issue actually), so I missed a couple things here! Apologies for not addressing the concerns more thoroughly, first.

  1. I think trimming on graphemes is fine, but perhaps a bit of a challenge. I suspect the erlang implementation will not respect this (not at a PC to check). Whether or not that's worth the complexity of having to re-implement the behavior in the stdlib, I'm not sure. It definitely feels like a nice behavior, though.
  2. I propose we take on the name trim_start_with / trim_end_with. IMO it's the clearest of the suggested names so far, but I don't care too much necessarily
  3. It's been suggested implicitly, but I'd like to propose more formally that these two functions are shipped with a corresponding trim_with, that handles bidirectional trimming. It seems odd to not include it, but to include the left/right variants.
  4. This, I feel less strongly about, but to echo my question above, we might consider deprecating trim_left/trim_right, and rename them to trim_start/trim_end` for consistency. Not sure how y'all feel about this though.

@lpil
Copy link
Member

lpil commented Oct 7, 2024

Those names are not acceptable in the stdlib as that's not the convention it uses.

We will be deprecating trim left and right.

@ollien
Copy link

ollien commented Oct 7, 2024

@lpil is there a name you'd prefer? I'm happy to use whatever name you think is conventional.

@lpil
Copy link
Member

lpil commented Oct 7, 2024

I do not have any firm ideas for a name, but I am also not convinced that these functions are worthwhile additions.

@ollien
Copy link

ollien commented Oct 7, 2024

I'll admit, it's definitely more thorny than I thought. Originally I assumed this was so uncontroversial I'd take it upon myself to implement 😅. Text encoding is always hairy.

I do think that this is a common enough need to warrant its inclusion in the standard library. I can't say I use this functionality commonly, but it is very helpful when I need it (usually some kind of parsing case).

All of that said, I'm happy to follow your lead on this. If there's something you need to help move this forward, I'm happy to do what I can. I also understand the burden of needing to maintain what outside contributors submit, so I would also understand if this juice wasn't worth the squeeze.

I'll think on names, for now; or maybe someone else has one that will stick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants