-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support of dbf files encoded in utf-8 #9
Conversation
we can just use the built in node/browserify string decoder which will help us with some of the crazier edge cases, I'm trying to remember why I have that regex there so I do need to do some poking, you said something about |
hm I think we can blame @jimmyrocks for that regex as he added it in calvinmetcalf/shapefile-js#7 but while it's supposed to remove a trailing null character, from looking at this it doesn't seem to do that. |
.cpg is just a text file with encoding for .dbf
here reference: https://en.wikipedia.org/wiki/Shapefile
so the encoding param that is to be passed to parseDBF should be taken from
.cpg file if available, otherwise utf-8 seems to be resonable default
not 100% sure, but it looks as if StringDecoder will work with utf-8 and
utf-16 only (will check it), while .dbf files could be encoded differently.
In Poland we've got windows1250 and iso8859-2 and I think its typical use
case that .dbf files are utf-8
Kind regards, Blazej
2017-03-22 13:22 GMT+01:00 Calvin Metcalf <[email protected]>:
… we can just use the built in node/browserify string decoder
<https://nodejs.org/api/string_decoder.html> which will help us with some
of the crazier edge cases, I'm trying to remember why I have that regex
there so I do need to do some poking, you said something about .cpg in
the original email you sent do you know what kind of info that has?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD7bbFd2p55DcU1swEUgRB0A50XnRoT8ks5roRJogaJpZM4Mkl8U>
.
|
The regex should work for a null character. For example: It is strange that it will remove any null characters, but multiple whitespace characters, it may be better to rewrite it as: The current code removes all null characters, despite where they are Of course, we're not replacing it with "symbol for null", we're just replacing it with nothing, so the multiple matches at the end don't make much of a difference. |
.dbf files are not utf-8
2017-03-22 13:50 GMT+01:00 Błażej Koszny <[email protected]>:
… .cpg is just a text file with encoding for .dbf
here reference: https://en.wikipedia.org/wiki/Shapefile
so the encoding param that is to be passed to parseDBF should be taken
from .cpg file if available, otherwise utf-8 seems to be resonable default
not 100% sure, but it looks as if StringDecoder will work with utf-8 and
utf-16 only (will check it), while .dbf files could be encoded differently.
In Poland we've got windows1250 and iso8859-2 and I think its typical use
case that .dbf files are utf-8
Kind regards, Blazej
2017-03-22 13:22 GMT+01:00 Calvin Metcalf ***@***.***>:
> we can just use the built in node/browserify string decoder
> <https://nodejs.org/api/string_decoder.html> which will help us with
> some of the crazier edge cases, I'm trying to remember why I have that
> regex there so I do need to do some poking, you said something about .cpg
> in the original email you sent do you know what kind of info that has?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#9 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AD7bbFd2p55DcU1swEUgRB0A50XnRoT8ks5roRJogaJpZM4Mkl8U>
> .
>
|
|
that being said it looks like text-encoding is probably better then the built in node one as it's much more aligned with what we want |
ah just for fun textdecoder and cpg files use different names for the same encoding, woopie |
should the mapping be done here or in shapefile-js? |
ah so as it turns out text-encoding by default doesn't include anything beyond the basic files unless you also include their big half megabyte index, so I'm going to have to think about the best way to do it taht would work for everyone |
ok so we probably want to be able to parse the cpg files, so we're going to need the mappings to be able to do that, then we'll need the text-encoder polyfill for use in node and IE and we'll probably want to include the additional encodings in node but not IE by default and have a note saying you can include that file to get it to work with obscurer encodings in IE |
hm iconv-lite might be a compromise as it has the more well known ones without as much of the crazy |
@koszny if you could point me to an example of of shapefile with non-default encoding that would be very helpful |
the cpg in those just says 'UTF-8' |
duh because that's what this pull is about, do you have any examples with windows1250 or more exotic encodings ? |
then check here ftp://91.223.135.109/prg/jednostki_administracyjne.zip |
thanks I ended up figuring out how to make one with polish names in the windows1250 encoding |
and it looks it might be tricky to use https://github.com/czhj/codepage-encoding ie. 'ANSI 1250' won't map to anything, I doubt all software use same convention to keep encoding info in .cpg file |
so I'm using that to map the code page number to it's name not the other way around, from what I can tell, the cpg file either has a number or the phrase 'UTF-8' |
oh excellent, you're files don't put the codepage number in there, um how exactly did you generate them? |
the file is from http://www.codgik.gov.pl/index.php/darmowe-dane/prg.html, but dunno how it was generated. |
any update on this? |
I'm waiting on inexorabletash/text-encoding#70, I'd prefer to just use that, if not I can create my own fork |
ok we should support it now |
support of dbf files encoded in utf-8