non-uniform Russian encodings in the sourcefiles #3

bamboospirit · 2019-01-25T04:20:51Z

The .f files and the .txt files don't have the same Russian encoding. Some of them use cp866, while others use windws-1251. I tried to convert them using iconv and it almost worked... unfortunately iconv deleted random parts of the code and even though it looked to me that it converted the files well to UTF-8 (I ran it manually on each on them, and specified the right encoding after I opened each of them inside a text editor to see what the right encoding is), the code was unfortunately broken.
The only reliable way would be to convert them manually (open in text editor, select the right encoding under which the Russian texts shows up correctly, copy the text, switch to UTF-8 encoding, paste the whole content, save ). I did that for a lot of files actually, but iconv broke the other part which was processed semi-automatically with iconv and some scripting, so I gave up.
Also there is at leaast 1 file which has two mixed encodings... the first half is cp866 and the second half is windows-1251.
So this is very troublesome because one can't read the commentaries in the sourcecode, one has to switch the encoding in the many times. It would be very good if you could convert them all to utf-8.

ruv · 2019-01-30T14:51:53Z

Yes, the best solutions would be to convert all the files into UTF-8 encoding.

But in such case we will lose the ability to run words with non-ASCII names in the console. Since Windows console can't properly work in UTF-8 (see Problems with reading/writing UTF-8 characters to console). So, it will require additional work to overcome this issues.

Perhaps there is a sense to create a separate repository with copy in UTF-8 for documentation purpose only.

For the moment you can only use an editor that can automatically detect text encoding.

BTW, could you please provide the name of file that contains mixed encodings?

eekee · 2019-02-05T17:04:39Z

I used tcs of the Inferno operating system on some windows-1251 files. I didn't see any code deleted, so it might possibly be a better alternative. I'll probably try to convert everything for documentation purposes within the next few weeks, if my health allows. (Unless of course you are successful first, bamboospirit.)

Would UTF-16 work better for Windows console? I don't like byte-order dependent encodings, but if it works, it works. (Does SP-Forth have any big-endian ports?)

ruv · 2019-02-06T10:54:15Z

I would even like to convert the whole repository history (i.e. each commit) into UTF-8 with line endings normalization.

Regarding iconv — I used it many times without any problem.

UTF-16 in the console will require a much additional work too. Also take into account that in the 2016 Forth200x meeting, "1 chars = 1" proposal was accepted. In the same time it would be good to develop SP-Forth to the level when the char size and encoding can be specified as a build option.

SP-Forth does not have the big-endian ports, and even the ports to any other architectures except x86 (and some experiments with x86-64).

ruv mentioned this issue Jan 25, 2019

Running on Linux #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-uniform Russian encodings in the sourcefiles #3

non-uniform Russian encodings in the sourcefiles #3

bamboospirit commented Jan 25, 2019

ruv commented Jan 30, 2019 •

edited

Loading

eekee commented Feb 5, 2019 •

edited

Loading

ruv commented Feb 6, 2019 •

edited

Loading

non-uniform Russian encodings in the sourcefiles #3

non-uniform Russian encodings in the sourcefiles #3

Comments

bamboospirit commented Jan 25, 2019

ruv commented Jan 30, 2019 • edited Loading

eekee commented Feb 5, 2019 • edited Loading

ruv commented Feb 6, 2019 • edited Loading

ruv commented Jan 30, 2019 •

edited

Loading

eekee commented Feb 5, 2019 •

edited

Loading

ruv commented Feb 6, 2019 •

edited

Loading