Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-uniform Russian encodings in the sourcefiles #3

Open
bamboospirit opened this issue Jan 25, 2019 · 3 comments
Open

non-uniform Russian encodings in the sourcefiles #3

bamboospirit opened this issue Jan 25, 2019 · 3 comments

Comments

@bamboospirit
Copy link

The .f files and the .txt files don't have the same Russian encoding. Some of them use cp866, while others use windws-1251. I tried to convert them using iconv and it almost worked... unfortunately iconv deleted random parts of the code and even though it looked to me that it converted the files well to UTF-8 (I ran it manually on each on them, and specified the right encoding after I opened each of them inside a text editor to see what the right encoding is), the code was unfortunately broken.
The only reliable way would be to convert them manually (open in text editor, select the right encoding under which the Russian texts shows up correctly, copy the text, switch to UTF-8 encoding, paste the whole content, save ). I did that for a lot of files actually, but iconv broke the other part which was processed semi-automatically with iconv and some scripting, so I gave up.
Also there is at leaast 1 file which has two mixed encodings... the first half is cp866 and the second half is windows-1251.
So this is very troublesome because one can't read the commentaries in the sourcecode, one has to switch the encoding in the many times. It would be very good if you could convert them all to utf-8.

@ruv ruv mentioned this issue Jan 25, 2019
@ruv
Copy link
Contributor

ruv commented Jan 30, 2019

Yes, the best solutions would be to convert all the files into UTF-8 encoding.

But in such case we will lose the ability to run words with non-ASCII names in the console. Since Windows console can't properly work in UTF-8 (see Problems with reading/writing UTF-8 characters to console). So, it will require additional work to overcome this issues.

Perhaps there is a sense to create a separate repository with copy in UTF-8 for documentation purpose only.

For the moment you can only use an editor that can automatically detect text encoding.

BTW, could you please provide the name of file that contains mixed encodings?

@eekee
Copy link

eekee commented Feb 5, 2019

I used tcs of the Inferno operating system on some windows-1251 files. I didn't see any code deleted, so it might possibly be a better alternative. I'll probably try to convert everything for documentation purposes within the next few weeks, if my health allows. (Unless of course you are successful first, bamboospirit.)

Would UTF-16 work better for Windows console? I don't like byte-order dependent encodings, but if it works, it works. (Does SP-Forth have any big-endian ports?)

@ruv
Copy link
Contributor

ruv commented Feb 6, 2019

I would even like to convert the whole repository history (i.e. each commit) into UTF-8 with line endings normalization.

Regarding iconv — I used it many times without any problem.

UTF-16 in the console will require a much additional work too. Also take into account that in the 2016 Forth200x meeting, "1 chars = 1" proposal was accepted. In the same time it would be good to develop SP-Forth to the level when the char size and encoding can be specified as a build option.

SP-Forth does not have the big-endian ports, and even the ports to any other architectures except x86 (and some experiments with x86-64).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants