support of dbf files encoded in utf-8 #9

koszny · 2017-03-22T00:49:46Z

support of dbf files encoded in utf-8

calvinmetcalf · 2017-03-22T12:21:59Z

we can just use the built in node/browserify string decoder which will help us with some of the crazier edge cases, I'm trying to remember why I have that regex there so I do need to do some poking, you said something about .cpg in the original email you sent do you know what kind of info that has?

calvinmetcalf · 2017-03-22T12:26:50Z

hm I think we can blame @jimmyrocks for that regex as he added it in calvinmetcalf/shapefile-js#7 but while it's supposed to remove a trailing null character, from looking at this it doesn't seem to do that.

koszny · 2017-03-22T12:50:36Z

.cpg is just a text file with encoding for .dbf here reference: https://en.wikipedia.org/wiki/Shapefile so the encoding param that is to be passed to parseDBF should be taken from .cpg file if available, otherwise utf-8 seems to be resonable default not 100% sure, but it looks as if StringDecoder will work with utf-8 and utf-16 only (will check it), while .dbf files could be encoded differently. In Poland we've got windows1250 and iso8859-2 and I think its typical use case that .dbf files are utf-8 Kind regards, Blazej 2017-03-22 13:22 GMT+01:00 Calvin Metcalf <[email protected]>:

…

we can just use the built in node/browserify string decoder <https://nodejs.org/api/string_decoder.html> which will help us with some of the crazier edge cases, I'm trying to remember why I have that regex there so I do need to do some poking, you said something about .cpg in the original email you sent do you know what kind of info that has? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD7bbFd2p55DcU1swEUgRB0A50XnRoT8ks5roRJogaJpZM4Mkl8U> .

jimmyrocks · 2017-03-22T12:51:04Z

The regex should work for a null character. For example:
'abcdefg\0'.replace(/\0|\s+$/g,'␀');
Will return:
"abcdefg␀"

It is strange that it will remove any null characters, but multiple whitespace characters, it may be better to rewrite it as:
'abcdefg\0\0\0 '.replace(/[\0|\s]+$/g,'␀');
Which returns:
"abcdefg␀"

The current code removes all null characters, despite where they are
'abcdefg\0\0\0 '.replace(/\0|\s+$/g,'␀'); returns an ugly "abcdefg␀␀␀␀"
and
'abc\0defg\0\0\0 '.replace(/\0|\s+$/g,'␀'); returns "abc␀defg␀␀␀␀"

Of course, we're not replacing it with "symbol for null", we're just replacing it with nothing, so the multiple matches at the end don't make much of a difference.

koszny · 2017-03-22T12:51:16Z

.dbf files are not utf-8 2017-03-22 13:50 GMT+01:00 Błażej Koszny <[email protected]>:

…

.cpg is just a text file with encoding for .dbf here reference: https://en.wikipedia.org/wiki/Shapefile so the encoding param that is to be passed to parseDBF should be taken from .cpg file if available, otherwise utf-8 seems to be resonable default not 100% sure, but it looks as if StringDecoder will work with utf-8 and utf-16 only (will check it), while .dbf files could be encoded differently. In Poland we've got windows1250 and iso8859-2 and I think its typical use case that .dbf files are utf-8 Kind regards, Blazej 2017-03-22 13:22 GMT+01:00 Calvin Metcalf ***@***.***>: > we can just use the built in node/browserify string decoder > <https://nodejs.org/api/string_decoder.html> which will help us with > some of the crazier edge cases, I'm trying to remember why I have that > regex there so I do need to do some poking, you said something about .cpg > in the original email you sent do you know what kind of info that has? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#9 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AD7bbFd2p55DcU1swEUgRB0A50XnRoT8ks5roRJogaJpZM4Mkl8U> > . >

calvinmetcalf · 2017-03-22T13:01:42Z

removing the g flag would cause it to only remove the trailing null
I think that website is wrong about what that regex does so vindicated @jimmyrocks
I think there ends up being 2 issues, trailing white spacing and nulls, so we may be able to split it up into 2 things

calvinmetcalf · 2017-03-22T13:04:07Z

that being said it looks like text-encoding is probably better then the built in node one as it's much more aligned with what we want

calvinmetcalf · 2017-03-22T13:28:47Z

ah just for fun textdecoder and cpg files use different names for the same encoding, woopie

koszny · 2017-03-22T13:33:18Z

should the mapping be done here or in shapefile-js?

calvinmetcalf · 2017-03-22T13:38:57Z

ah so as it turns out text-encoding by default doesn't include anything beyond the basic files unless you also include their big half megabyte index, so I'm going to have to think about the best way to do it taht would work for everyone

calvinmetcalf · 2017-03-24T12:22:26Z

ok so we probably want to be able to parse the cpg files, so we're going to need the mappings to be able to do that, then we'll need the text-encoder polyfill for use in node and IE and we'll probably want to include the additional encodings in node but not IE by default and have a note saying you can include that file to get it to work with obscurer encodings in IE

calvinmetcalf · 2017-03-24T16:28:12Z

hm iconv-lite might be a compromise as it has the more well known ones without as much of the crazy

calvinmetcalf · 2017-03-27T13:51:03Z

@koszny if you could point me to an example of of shapefile with non-default encoding that would be very helpful

koszny · 2017-03-27T13:53:42Z

http://download.geofabrik.de/europe/poland/malopolskie-latest-free.shp.zip

calvinmetcalf · 2017-03-27T14:24:58Z

the cpg in those just says 'UTF-8'

calvinmetcalf · 2017-03-27T14:29:45Z

duh because that's what this pull is about, do you have any examples with windows1250 or more exotic encodings ?

koszny · 2017-03-27T14:44:30Z

then check here ftp://91.223.135.109/prg/jednostki_administracyjne.zip

calvinmetcalf · 2017-03-27T14:49:44Z

thanks I ended up figuring out how to make one with polish names in the windows1250 encoding

koszny · 2017-03-27T15:13:01Z

and it looks it might be tricky to use https://github.com/czhj/codepage-encoding ie. 'ANSI 1250' won't map to anything, I doubt all software use same convention to keep encoding info in .cpg file

calvinmetcalf · 2017-03-27T15:26:10Z

so I'm using that to map the code page number to it's name not the other way around, from what I can tell, the cpg file either has a number or the phrase 'UTF-8'

calvinmetcalf · 2017-03-27T15:33:49Z

oh excellent, you're files don't put the codepage number in there, um how exactly did you generate them?

koszny · 2017-03-27T17:55:07Z

the file is from http://www.codgik.gov.pl/index.php/darmowe-dane/prg.html, but dunno how it was generated.

ahjyrkia · 2017-04-11T07:37:05Z

any update on this?

calvinmetcalf · 2017-04-11T17:03:48Z

I'm waiting on inexorabletash/text-encoding#70, I'd prefer to just use that, if not I can create my own fork

calvinmetcalf · 2017-07-05T12:37:31Z

ok we should support it now

support of dbf files encoded in utf-8

dd69e5c

koszny force-pushed the master branch from 3310759 to dd69e5c Compare March 22, 2017 00:59

calvinmetcalf mentioned this pull request Mar 24, 2017

rewrite to be able to handle cpgs #10

Merged

calvinmetcalf closed this Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support of dbf files encoded in utf-8 #9

support of dbf files encoded in utf-8 #9

koszny commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

koszny commented Mar 22, 2017 via email

jimmyrocks commented Mar 22, 2017

koszny commented Mar 22, 2017 via email

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

koszny commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 24, 2017

calvinmetcalf commented Mar 24, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017 •

edited

Loading

calvinmetcalf commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017

ahjyrkia commented Apr 11, 2017

calvinmetcalf commented Apr 11, 2017

calvinmetcalf commented Jul 5, 2017

support of dbf files encoded in utf-8 #9

support of dbf files encoded in utf-8 #9

Conversation

koszny commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

koszny commented Mar 22, 2017 via email

jimmyrocks commented Mar 22, 2017

koszny commented Mar 22, 2017 via email

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

koszny commented Mar 22, 2017

calvinmetcalf commented Mar 22, 2017

calvinmetcalf commented Mar 24, 2017

calvinmetcalf commented Mar 24, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017 • edited Loading

calvinmetcalf commented Mar 27, 2017

calvinmetcalf commented Mar 27, 2017

koszny commented Mar 27, 2017

ahjyrkia commented Apr 11, 2017

calvinmetcalf commented Apr 11, 2017

calvinmetcalf commented Jul 5, 2017

koszny commented Mar 27, 2017 •

edited

Loading