Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stringify chunk before merging chunks breaks character at the end #4433

Merged
merged 2 commits into from
Oct 13, 2024
Merged

Conversation

panoanx
Copy link
Contributor

@panoanx panoanx commented Oct 13, 2024

Hi there,

I like this extension and has long been using it. This extension is really awesome!

Recently I find formatting tex files introduces random characters. An example is:

\documentclass{article}
\usepackage{ctex}

\begin{document}
汉语又称华语[6][7],是汉族的语言[7][8],是由先秦雅言发展而来、书写使用表意文字(汉字)的东亚分析语,为汉藏语系最大的一支语族。如把整个汉语族视为单一语言,则汉语为世界上母语使用者人数最多的语言,目前全世界有五分之一人口将其作为母语或第二语言。汉语在以其做为母语的地方有不同通称,且有多种方言变体,其中以北方汉语为基础的官话最为流行,其衍生而来的现代标准汉语(有国语、普通话、新加坡标准华语等变体)是大中华区的主要通用语。此外,汉语还是联合国正式语文[9],并被上海合作组织等国际组织采用为官方语言。国际上常将“汉语”称为“中文”,这是因为大多数语言使用表音文字,对于“文”[10]与“语”[11]并不作区分,不符合汉语语法;然而汉语使用语素文字,文字[12]并不等于语言[13][注 1][原创研究?]。
\end{document}

after formatting ===>

\documentclass{article}
\usepackage{ctex}

\begin{document}
汉语又称华语[6][7],是汉族的语言[7][8],是由先秦雅言发展而来、书写使用表意文字(汉字)的东亚分析语,为汉藏语系最大的一支语族。如把整个汉语族视为单一语言,则汉语为世界上母语使用者人数最多的语言,目前全世界有五分之一人口将其作为母语或第二语言。汉语在以其做为母语的地方有不同通称,且有多种方言变体,其中以北方汉语为基础的官话最为流行,其衍生而来的现代标准汉语(有国语、普通话、新加坡标准华语等变体)是大中华区的主要通用语。此外,汉语还是联合国正式语文[9],并被上海合作组织等国际组织采用为官方语言。国际上常将“汉语”称为“中文”,这是因为大多数语言使用表音文字,对于“文”[10]与“语”[11]并不作区分,不符合汉语语法;然而汉语使用语素文字,文字[12]并不等于���言[13][注 1][原创研究?]。
\end{document}

The stdoutBuffer: string[] was:

[
  "\\documentclass{article}\n\\usepackage{ctex}\n\n\\begin{document}\n汉语又称华语[6][7],是汉族的语言[7][8],是由先秦雅言发展而来、书写使用表意文字(汉字)的东亚分析语,为汉藏语系最大的一支语族。如把整个汉语族视为单一语言,则汉语为世界上母语使用者人数最多的语言,目前全世界有五分之一人口将其作为母语或第二语言。汉语在以其做为母语的地方有不同通称,且有多种方言变体,其中以北方汉语为基础的官话最为流行,其衍生而来的现代标准汉语(有国语、普通话、新加坡标准华语等变体)是大中华区的主要通用语。此外,汉语还是联合国正式语文[9],并被上海合作组织等国际组织采用为官方语言。国际上常将“汉语”称为“中文”,这是因为大多数语言使用表音文字,对于“文”[10]与“语”[11]并不作区分,不符合汉语语法;然而汉语使用语素文字,文字[12]并不等于�",
  "��言[13][注 1][原创研究?]。\n\\end{document}",
]

i.e., 不等于语言 at the end of line 3 turns to 不等于���言. This error frequently occurs with Chinese characters as they are more than 8 bits each.

After some investigation I find possible cause was, the chunks of stdout pipe were first chunk.toString() and then joined later. This breaks some characters at the end of the chunk if they were more than 8 bits, and turns in to 3 unknown characters.

However please be noted that this type of behaviours were not consistent (sometimes error happens and sometimes stdout is not chunked), and might behave differently on different platforms.

So I think it may be safer to first merge the chunk in Buffer type, and then turn them into strings after concatenation.

Thanks again for your contribution on this extension.

@James-Yu
Copy link
Owner

Thanks for the contribution. May you also do the same for tex-fmt? Thanks.

@panoanx
Copy link
Contributor Author

panoanx commented Oct 13, 2024

Definitely. Please see 4ea5f55.

@James-Yu James-Yu merged commit 7f4f1ff into James-Yu:master Oct 13, 2024
7 checks passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants