-
-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More string trimming capabilities #587
Comments
Sounds good, but we'll need to think of some other name as the ones used in the code example don't fit with the convention used in this repo. Changing left and right for start and end sounds good. |
What names would you suggest? Exposing the existing @target(erlang)
pub fn trim_codepoints(
string: String,
direction: Direction,
codepoints: String,
) -> String {
let codepoint_ints =
codepoints
|> to_utf_codepoints
|> list.map(utf_codepoint_to_int)
do_trim_codepoints(string, direction, codepoint_ints)
}
@target(erlang)
@external(erlang, "string", "trim")
pub fn do_trim_codepoints(
string: String,
direction: Direction,
codepoints: List(Int),
) -> String
@target(javascript)
pub fn do_trim_codepoints(
string: String,
direction: Direction,
codepoints: List(Int),
) -> String {
todo
} |
We will not expose the direction type, it needs to be different functions. |
maybe one of these pub fn trim_codepoints(
string: String
) { todo }
// and any pair:
pub fn trim_start_codepoints(
string: String,
by: String,
) { todo }
pub fn trim_end_codepoints(
string: String,
by: String,
) { todo }
pub fn trim_head_codepoints(
string: String,
by: String,
) { todo }
pub fn trim_tail_codepoints(
string: String,
by: String,
) { todo }
pub fn trim_beginning_codepoints(
string: String,
by: String,
) { todo }
pub fn trim_ending_codepoints(
string: String,
by: String,
) { todo } |
Has anyone started working on this? I'm new to gleam but I'd like to try implementing this. pub fn trim_codepoints_start(str: String, by: String) -> String {
case starts_with(str, by) {
True -> drop_left(str, length(by)) |> trim_codepoints_start(by)
False -> str
}
} The above implementation seems to work, but I'm pretty sure there's a more performant way to do this, I'm just not used to thinking in pure functional terms. |
Yep I'd still like to get this through. We need to:
The code below implements the three proposed trim functions for the Erlang target via import gleam/list
import gleam/string
/// Removes all occurrences of the specified codepoints from the start and
/// end of a `String`.
///
pub fn trim_codepoints(string: String, codepoints: String) -> String {
let codepoint_ints = codepoints |> string_to_codepoint_ints
string_trim(string, Both, codepoint_ints)
}
/// Removes all occurrences of the specified codepoints from the start of a
/// `String`.
///
pub fn trim_codepoints_start(string: String, codepoints: String) -> String {
let codepoint_ints = codepoints |> string_to_codepoint_ints
string_trim(string, Leading, codepoint_ints)
}
/// Removes all occurrences of the specified codepoints from the end of a
/// `String`.
///
pub fn trim_codepoints_end(string: String, codepoints: String) -> String {
let codepoint_ints = codepoints |> string_to_codepoint_ints
string_trim(string, Trailing, codepoint_ints)
}
fn string_to_codepoint_ints(string: String) -> List(Int) {
string
|> string.to_utf_codepoints
|> list.map(string.utf_codepoint_to_int)
}
type Direction {
Leading
Trailing
Both
}
@target(erlang)
@external(erlang, "string", "trim")
fn string_trim(string: String, dir: Direction, characters: List(Int)) -> String |
Are we sure we want to be trimming codepoints and not graphemes? How would we ensure that the string is still valid unicode when removing codepoints? |
Looking at a few other languages, Erlang's I had previously thought that Gleam strings were defined to be valid UTF-8 binaries, but hadn't realised this extended to also being valid Unicode graphemes. Is that the case? If so, wouldn't the following lines need to error because they create a string with an invalid grapheme? (U+0301 is the 'Combining Acute Accent' as it modifies the preceding character): let invalid_grapheme = "\u{0301}"
let invalid_grapheme = <<0xCC, 0x81>> |> bit_array.to_string I tried a couple of other (what I believe to be) invalid graphemes and none were rejected from being turned into Gleam strings. Either way, this trimming functionality takes a This would look something like: /// Removes all occurrences of the specified graphemes from the start and
/// end of a `String`.
///
pub fn trim_graphemes(string: String, graphemes: List(String)) -> String {
todo
}
/// Removes all occurrences of the specified graphemes from the start of a
/// `String`.
///
pub fn trim_graphemes_start(string: String, graphemes: List(String)) -> String {
todo
}
/// Removes all occurrences of the specified graphemes from the end of a
/// `String`.
///
pub fn trim_graphemes_end(string: String, graphemes: List(String)) -> String {
todo
} |
I don't really understand what the best behaviour is here. Is there a use case that motivates making it easy to have invalid graphemes? |
No specific use case that I'm aware of. In practice most trimming is probably not done using codepoints that can be part of multi-codepoint graphemes, hence why other languages tend to trim just based on codepoints. Maybe it's better to make the pub fn trim_graphemes(string: String, graphemes: String) -> String
pub fn trim_graphemes_start(string: String, graphemes: String) -> String
pub fn trim_graphemes_end(string: String, graphemes: String) -> String The implementation then calls |
Some alternative names: pub fn trim_with(string: String, graphemes: String) -> String
pub fn trim_start_with(string: String, graphemes: String) -> String
pub fn trim_end_with(string: String, graphemes: String) -> String pub fn trim_using(string: String, graphemes: String) -> String
pub fn trim_start_using(string: String, graphemes: String) -> String
pub fn trim_end_using(string: String, graphemes: String) -> String Or could copy the Rust naming: pub fn trim_matches(string: String, graphemes: String) -> String
pub fn trim_start_matches(string: String, graphemes: String) -> String
pub fn trim_end_matches(string: String, graphemes: String) -> String |
I've been poking at this, and just to throw a name into the ring, I've been liking I assume we'd want to deprecate |
Thanks for looking into this. I don't know if introducing the term 'char' is desirable when 'codepoint' and 'grapheme' are already in use and are well-defined? |
Yeah that's reasonable. I think I'll take |
I had posted a PR before having read this issue thoroughly (I had been working on it before I found the issue actually), so I missed a couple things here! Apologies for not addressing the concerns more thoroughly, first.
|
Those names are not acceptable in the stdlib as that's not the convention it uses. We will be deprecating trim left and right. |
@lpil is there a name you'd prefer? I'm happy to use whatever name you think is conventional. |
I do not have any firm ideas for a name, but I am also not convinced that these functions are worthwhile additions. |
I'll admit, it's definitely more thorny than I thought. Originally I assumed this was so uncontroversial I'd take it upon myself to implement 😅. Text encoding is always hairy. I do think that this is a common enough need to warrant its inclusion in the standard library. I can't say I use this functionality commonly, but it is very helpful when I need it (usually some kind of parsing case). All of that said, I'm happy to follow your lead on this. If there's something you need to help move this forward, I'm happy to do what I can. I also understand the burden of needing to maintain what outside contributors submit, so I would also understand if this juice wasn't worth the squeeze. I'll think on names, for now; or maybe someone else has one that will stick. |
The existing
string.trim
,string.trim_left
andstring.trim_right
functions currently only trim whitespace.Could we add equivalent trimming functions that take the codepoints to be trimmed as an input, similar to Erlang's
string.trim/3
?E.g.
There are other ways to slice this, such as taking a
String
as the final parameter and using its codepoints as the ones to trim, rather than aList(UtfCodepoint)
.I haven't looked into what the JavaScript implementation would be, but can do so.
On a related note, my understanding is that the existing methods with the words "left" and "right" in them assume left-to-right reading order, which doesn't hold for all languages (e.g. Arabic, Hebrew). Could consider using "start" and "end" instead, or adding these as aliases.
The text was updated successfully, but these errors were encountered: