fix: stringify chunk before merging chunks breaks character at the end #4433

panoanx · 2024-10-13T06:47:15Z

Hi there,

I like this extension and has long been using it. This extension is really awesome!

Recently I find formatting tex files introduces random characters. An example is:

\documentclass{article}
\usepackage{ctex}

\begin{document}
汉语又称华语[6][7]，是汉族的语言[7][8]，是由先秦雅言发展而来、书写使用表意文字（汉字）的东亚分析语，为汉藏语系最大的一支语族。如把整个汉语族视为单一语言，则汉语为世界上母语使用者人数最多的语言，目前全世界有五分之一人口将其作为母语或第二语言。汉语在以其做为母语的地方有不同通称，且有多种方言变体，其中以北方汉语为基础的官话最为流行，其衍生而来的现代标准汉语（有国语、普通话、新加坡标准华语等变体）是大中华区的主要通用语。此外，汉语还是联合国正式语文[9]，并被上海合作组织等国际组织采用为官方语言。国际上常将“汉语”称为“中文”，这是因为大多数语言使用表音文字，对于“文”[10]与“语”[11]并不作区分，不符合汉语语法；然而汉语使用语素文字，文字[12]并不等于语言[13][注 1][原创研究？]。
\end{document}

after formatting ===>

\documentclass{article}
\usepackage{ctex}

\begin{document}
汉语又称华语[6][7]，是汉族的语言[7][8]，是由先秦雅言发展而来、书写使用表意文字（汉字）的东亚分析语，为汉藏语系最大的一支语族。如把整个汉语族视为单一语言，则汉语为世界上母语使用者人数最多的语言，目前全世界有五分之一人口将其作为母语或第二语言。汉语在以其做为母语的地方有不同通称，且有多种方言变体，其中以北方汉语为基础的官话最为流行，其衍生而来的现代标准汉语（有国语、普通话、新加坡标准华语等变体）是大中华区的主要通用语。此外，汉语还是联合国正式语文[9]，并被上海合作组织等国际组织采用为官方语言。国际上常将“汉语”称为“中文”，这是因为大多数语言使用表音文字，对于“文”[10]与“语”[11]并不作区分，不符合汉语语法；然而汉语使用语素文字，文字[12]并不等于���言[13][注 1][原创研究？]。
\end{document}

The stdoutBuffer: string[] was:

[
  "\\documentclass{article}\n\\usepackage{ctex}\n\n\\begin{document}\n汉语又称华语[6][7]，是汉族的语言[7][8]，是由先秦雅言发展而来、书写使用表意文字（汉字）的东亚分析语，为汉藏语系最大的一支语族。如把整个汉语族视为单一语言，则汉语为世界上母语使用者人数最多的语言，目前全世界有五分之一人口将其作为母语或第二语言。汉语在以其做为母语的地方有不同通称，且有多种方言变体，其中以北方汉语为基础的官话最为流行，其衍生而来的现代标准汉语（有国语、普通话、新加坡标准华语等变体）是大中华区的主要通用语。此外，汉语还是联合国正式语文[9]，并被上海合作组织等国际组织采用为官方语言。国际上常将“汉语”称为“中文”，这是因为大多数语言使用表音文字，对于“文”[10]与“语”[11]并不作区分，不符合汉语语法；然而汉语使用语素文字，文字[12]并不等于�",
  "��言[13][注 1][原创研究？]。\n\\end{document}",
]

i.e., 不等于语言 at the end of line 3 turns to 不等于��言. This error frequently occurs with Chinese characters as they are more than 8 bits each.

After some investigation I find possible cause was, the chunks of stdout pipe were first chunk.toString() and then joined later. This breaks some characters at the end of the chunk if they were more than 8 bits, and turns in to 3 unknown characters.

However please be noted that this type of behaviours were not consistent (sometimes error happens and sometimes stdout is not chunked), and might behave differently on different platforms.

So I think it may be safer to first merge the chunk in Buffer type, and then turn them into strings after concatenation.

Thanks again for your contribution on this extension.

James-Yu · 2024-10-13T12:19:27Z

Thanks for the contribution. May you also do the same for tex-fmt? Thanks.

panoanx · 2024-10-13T13:20:38Z

Definitely. Please see 4ea5f55.

fix: stringify chunk before merging chunks breaks character at the end

ba1633f

fix: stdout concatenation for tex-fmt #4433

4ea5f55

James-Yu merged commit 7f4f1ff into James-Yu:master Oct 13, 2024
7 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stringify chunk before merging chunks breaks character at the end #4433

fix: stringify chunk before merging chunks breaks character at the end #4433

panoanx commented Oct 13, 2024 •

edited

Loading

James-Yu commented Oct 13, 2024

panoanx commented Oct 13, 2024

fix: stringify chunk before merging chunks breaks character at the end #4433

fix: stringify chunk before merging chunks breaks character at the end #4433

Conversation

panoanx commented Oct 13, 2024 • edited Loading

James-Yu commented Oct 13, 2024

panoanx commented Oct 13, 2024

panoanx commented Oct 13, 2024 •

edited

Loading