More string trimming capabilities #587

richard-viney · 2024-05-11T03:11:12Z

The existing string.trim, string.trim_left and string.trim_right functions currently only trim whitespace.

Could we add equivalent trimming functions that take the codepoints to be trimmed as an input, similar to Erlang's string.trim/3?

E.g.

import gleam/list

/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(s: String, codepoints: List(UtfCodepoint)) -> String {
  string_trim(s, Both, codepoints_to_ints(codepoints))
}

/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_left(s: String, codepoints: List(UtfCodepoint)) -> String {
  string_trim(s, Leading, codepoints_to_ints(codepoints))
}

/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_right(
  s: String,
  codepoints: List(UtfCodepoint),
) -> String {
  string_trim(s, Trailing, codepoints_to_ints(codepoints))
}

fn codepoints_to_ints(codepoints: List(UtfCodepoint)) {
  codepoints
  |> list.map(string.utf_codepoint_to_int)
}

type Direction {
  Leading
  Trailing
  Both
}

@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(a: String, b: Direction, c: List(Int)) -> String

There are other ways to slice this, such as taking a String as the final parameter and using its codepoints as the ones to trim, rather than a List(UtfCodepoint).

I haven't looked into what the JavaScript implementation would be, but can do so.

On a related note, my understanding is that the existing methods with the words "left" and "right" in them assume left-to-right reading order, which doesn't hold for all languages (e.g. Arabic, Hebrew). Could consider using "start" and "end" instead, or adding these as aliases.

The text was updated successfully, but these errors were encountered:

lpil · 2024-05-13T13:54:40Z

Sounds good, but we'll need to think of some other name as the ones used in the code example don't fit with the convention used in this repo.

Changing left and right for start and end sounds good.

richard-viney · 2024-05-20T01:48:08Z

What names would you suggest?

Exposing the existing Direction type would mean only one public function is added:

@target(erlang)
pub fn trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: String,
) -> String {
  let codepoint_ints =
    codepoints
    |> to_utf_codepoints
    |> list.map(utf_codepoint_to_int)

  do_trim_codepoints(string, direction, codepoint_ints)
}

@target(erlang)
@external(erlang, "string", "trim")
pub fn do_trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: List(Int),
) -> String

@target(javascript)
pub fn do_trim_codepoints(
  string: String,
  direction: Direction,
  codepoints: List(Int),
) -> String {
  todo
}

lpil · 2024-05-20T13:24:54Z

We will not expose the direction type, it needs to be different functions.

inoas · 2024-07-18T19:07:44Z

maybe one of these

pub fn trim_codepoints(
  string: String
) { todo }


// and any pair:


pub fn trim_start_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_end_codepoints(
  string: String,
  by: String,
) { todo }


pub fn trim_head_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_tail_codepoints(
  string: String,
  by: String,
) { todo }


pub fn trim_beginning_codepoints(
  string: String,
  by: String,
) { todo }

pub fn trim_ending_codepoints(
  string: String,
  by: String,
) { todo }

RobiFerentz · 2024-08-31T16:26:31Z

Has anyone started working on this?

I'm new to gleam but I'd like to try implementing this.

pub fn trim_codepoints_start(str: String, by: String) -> String {
  case starts_with(str, by) {
    True -> drop_left(str, length(by)) |> trim_codepoints_start(by)
    False -> str
  }
}

The above implementation seems to work, but I'm pretty sure there's a more performant way to do this, I'm just not used to thinking in pure functional terms.

richard-viney · 2024-09-01T06:44:04Z

Yep I'd still like to get this through. We need to:

Agree on the naming for the functions.
Do the JavaScript implementation. I'm happy to take a look at this.

The code below implements the three proposed trim functions for the Erlang target via string:trim/3. JavaScript implementation still pending as mentioned above.

import gleam/list
import gleam/string

/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Both, codepoint_ints)
}

/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_start(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Leading, codepoint_ints)
}

/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_end(string: String, codepoints: String) -> String {
  let codepoint_ints = codepoints |> string_to_codepoint_ints

  string_trim(string, Trailing, codepoint_ints)
}

fn string_to_codepoint_ints(string: String) -> List(Int) {
  string
  |> string.to_utf_codepoints
  |> list.map(string.utf_codepoint_to_int)
}

type Direction {
  Leading
  Trailing
  Both
}

@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(string: String, dir: Direction, characters: List(Int)) -> String

lpil · 2024-09-01T12:45:51Z

Are we sure we want to be trimming codepoints and not graphemes? How would we ensure that the string is still valid unicode when removing codepoints?

richard-viney · 2024-09-08T02:38:08Z

Looking at a few other languages, Erlang's string:trim/3 and Rust's string.trim_end_matches both operate on codepoints, i.e. they allow invalid graphemes to be created by the trimming operation. JavaScript doesn't have an equivalent configurable trim, but if you replace a specific codepoint with "" using a regex you get the same end result, i.e. regex replace doesn't respect graphemes.

I had previously thought that Gleam strings were defined to be valid UTF-8 binaries, but hadn't realised this extended to also being valid Unicode graphemes. Is that the case? If so, wouldn't the following lines need to error because they create a string with an invalid grapheme? (U+0301 is the 'Combining Acute Accent' as it modifies the preceding character):

let invalid_grapheme = "\u{0301}"
let invalid_grapheme = <<0xCC, 0x81>> |> bit_array.to_string

I tried a couple of other (what I believe to be) invalid graphemes and none were rejected from being turned into Gleam strings.

Either way, this trimming functionality takes a List(String) of graphemes to be trimmed off the input string, but if String isn't guaranteed to only contain valid graphemes then the result of the trim can't strictly be guaranteed to either.

This would look something like:

/// Removes all occurrences of the specified graphemes from the start and
/// end of a `String`.
///
pub fn trim_graphemes(string: String, graphemes: List(String)) -> String {
  todo
}

/// Removes all occurrences of the specified graphemes from the start of a
/// `String`.
///
pub fn trim_graphemes_start(string: String, graphemes: List(String)) -> String {
  todo
}

/// Removes all occurrences of the specified graphemes from the end of a
/// `String`.
///
pub fn trim_graphemes_end(string: String, graphemes: List(String)) -> String {
  todo
}

lpil · 2024-09-10T15:48:58Z

I don't really understand what the best behaviour is here. Is there a use case that motivates making it easy to have invalid graphemes?

richard-viney · 2024-09-11T02:56:12Z

No specific use case that I'm aware of. In practice most trimming is probably not done using codepoints that can be part of multi-codepoint graphemes, hence why other languages tend to trim just based on codepoints.

Maybe it's better to make the graphemes argument a String instead of a List(String):

pub fn trim_graphemes(string: String, graphemes: String) -> String
pub fn trim_graphemes_start(string: String, graphemes: String) -> String
pub fn trim_graphemes_end(string: String, graphemes: String) -> String

The implementation then calls string.to_graphemes() on the graphemes argument and trims using that.

richard-viney · 2024-09-11T04:41:07Z

Some alternative names:

pub fn trim_with(string: String, graphemes: String) -> String
pub fn trim_start_with(string: String, graphemes: String) -> String
pub fn trim_end_with(string: String, graphemes: String) -> String

pub fn trim_using(string: String, graphemes: String) -> String
pub fn trim_start_using(string: String, graphemes: String) -> String
pub fn trim_end_using(string: String, graphemes: String) -> String

Or could copy the Rust naming:

pub fn trim_matches(string: String, graphemes: String) -> String
pub fn trim_start_matches(string: String, graphemes: String) -> String
pub fn trim_end_matches(string: String, graphemes: String) -> String

ollien · 2024-10-06T16:31:40Z

I've been poking at this, and just to throw a name into the ring, I've been liking trim_chars; I feel like trim_matches implies it takes a regexp, but that's just me. I'm definitely flexible on the name.

I assume we'd want to deprecate trim_left and trim_right and create aliases for consistency?

richard-viney · 2024-10-06T22:54:25Z

Thanks for looking into this.

I don't know if introducing the term 'char' is desirable when 'codepoint' and 'grapheme' are already in use and are well-defined?

ollien · 2024-10-06T23:35:51Z

Yeah that's reasonable. I think I'll take with, if that's ok with you!

ollien · 2024-10-07T19:32:33Z

I had posted a PR before having read this issue thoroughly (I had been working on it before I found the issue actually), so I missed a couple things here! Apologies for not addressing the concerns more thoroughly, first.

I think trimming on graphemes is fine, but perhaps a bit of a challenge. I suspect the erlang implementation will not respect this (not at a PC to check). Whether or not that's worth the complexity of having to re-implement the behavior in the stdlib, I'm not sure. It definitely feels like a nice behavior, though.
I propose we take on the name trim_start_with / trim_end_with. IMO it's the clearest of the suggested names so far, but I don't care too much necessarily
It's been suggested implicitly, but I'd like to propose more formally that these two functions are shipped with a corresponding trim_with, that handles bidirectional trimming. It seems odd to not include it, but to include the left/right variants.
This, I feel less strongly about, but to echo my question above, we might consider deprecating trim_left/trim_right, and rename them to trim_start/trim_end` for consistency. Not sure how y'all feel about this though.

lpil · 2024-10-07T20:13:27Z

Those names are not acceptable in the stdlib as that's not the convention it uses.

We will be deprecating trim left and right.

ollien · 2024-10-07T20:16:15Z

@lpil is there a name you'd prefer? I'm happy to use whatever name you think is conventional.

lpil · 2024-10-07T20:20:00Z

I do not have any firm ideas for a name, but I am also not convinced that these functions are worthwhile additions.

ollien · 2024-10-07T20:26:16Z

I'll admit, it's definitely more thorny than I thought. Originally I assumed this was so uncontroversial I'd take it upon myself to implement 😅. Text encoding is always hairy.

I do think that this is a common enough need to warrant its inclusion in the standard library. I can't say I use this functionality commonly, but it is very helpful when I need it (usually some kind of parsing case).

All of that said, I'm happy to follow your lead on this. If there's something you need to help move this forward, I'm happy to do what I can. I also understand the burden of needing to maintain what outside contributors submit, so I would also understand if this juice wasn't worth the squeeze.

I'll think on names, for now; or maybe someone else has one that will stick.

ollien linked a pull request Oct 6, 2024 that will close this issue

Add charset based variants for trimming, rename from left/right to start/end #696

Open

richard-viney mentioned this issue Nov 11, 2024

Use _start and _end suffixes on string functions instead of _left and _right #732

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More string trimming capabilities #587

More string trimming capabilities #587

richard-viney commented May 11, 2024

lpil commented May 13, 2024 •

edited

Loading

richard-viney commented May 20, 2024 •

edited

Loading

lpil commented May 20, 2024

inoas commented Jul 18, 2024 •

edited

Loading

RobiFerentz commented Aug 31, 2024

richard-viney commented Sep 1, 2024

lpil commented Sep 1, 2024

richard-viney commented Sep 8, 2024 •

edited

Loading

lpil commented Sep 10, 2024

richard-viney commented Sep 11, 2024

richard-viney commented Sep 11, 2024

ollien commented Oct 6, 2024

richard-viney commented Oct 6, 2024

ollien commented Oct 6, 2024

ollien commented Oct 7, 2024

lpil commented Oct 7, 2024

ollien commented Oct 7, 2024

lpil commented Oct 7, 2024

ollien commented Oct 7, 2024

More string trimming capabilities #587

More string trimming capabilities #587

Comments

richard-viney commented May 11, 2024

lpil commented May 13, 2024 • edited Loading

richard-viney commented May 20, 2024 • edited Loading

lpil commented May 20, 2024

inoas commented Jul 18, 2024 • edited Loading

RobiFerentz commented Aug 31, 2024

richard-viney commented Sep 1, 2024

lpil commented Sep 1, 2024

richard-viney commented Sep 8, 2024 • edited Loading

lpil commented Sep 10, 2024

richard-viney commented Sep 11, 2024

richard-viney commented Sep 11, 2024

ollien commented Oct 6, 2024

richard-viney commented Oct 6, 2024

ollien commented Oct 6, 2024

ollien commented Oct 7, 2024

lpil commented Oct 7, 2024

ollien commented Oct 7, 2024

lpil commented Oct 7, 2024

ollien commented Oct 7, 2024

lpil commented May 13, 2024 •

edited

Loading

richard-viney commented May 20, 2024 •

edited

Loading

inoas commented Jul 18, 2024 •

edited

Loading

richard-viney commented Sep 8, 2024 •

edited

Loading