Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support of dbf files encoded in utf-8 #9

Closed
wants to merge 1 commit into from

Conversation

koszny
Copy link

@koszny koszny commented Mar 22, 2017

support of dbf files encoded in utf-8

@calvinmetcalf
Copy link
Owner

we can just use the built in node/browserify string decoder which will help us with some of the crazier edge cases, I'm trying to remember why I have that regex there so I do need to do some poking, you said something about .cpg in the original email you sent do you know what kind of info that has?

@calvinmetcalf
Copy link
Owner

hm I think we can blame @jimmyrocks for that regex as he added it in calvinmetcalf/shapefile-js#7 but while it's supposed to remove a trailing null character, from looking at this it doesn't seem to do that.

@koszny
Copy link
Author

koszny commented Mar 22, 2017 via email

@jimmyrocks
Copy link

The regex should work for a null character. For example:
'abcdefg\0'.replace(/\0|\s+$/g,'␀');
Will return:
"abcdefg␀"

It is strange that it will remove any null characters, but multiple whitespace characters, it may be better to rewrite it as:
'abcdefg\0\0\0 '.replace(/[\0|\s]+$/g,'␀');
Which returns:
"abcdefg␀"

The current code removes all null characters, despite where they are
'abcdefg\0\0\0 '.replace(/\0|\s+$/g,'␀'); returns an ugly "abcdefg␀␀␀␀"
and
'abc\0defg\0\0\0 '.replace(/\0|\s+$/g,'␀'); returns "abc␀defg␀␀␀␀"

Of course, we're not replacing it with "symbol for null", we're just replacing it with nothing, so the multiple matches at the end don't make much of a difference.

@koszny
Copy link
Author

koszny commented Mar 22, 2017 via email

@calvinmetcalf
Copy link
Owner

  • removing the g flag would cause it to only remove the trailing null
  • I think that website is wrong about what that regex does so vindicated @jimmyrocks
  • I think there ends up being 2 issues, trailing white spacing and nulls, so we may be able to split it up into 2 things

@calvinmetcalf
Copy link
Owner

that being said it looks like text-encoding is probably better then the built in node one as it's much more aligned with what we want

@calvinmetcalf
Copy link
Owner

ah just for fun textdecoder and cpg files use different names for the same encoding, woopie

@koszny
Copy link
Author

koszny commented Mar 22, 2017

should the mapping be done here or in shapefile-js?

@calvinmetcalf
Copy link
Owner

ah so as it turns out text-encoding by default doesn't include anything beyond the basic files unless you also include their big half megabyte index, so I'm going to have to think about the best way to do it taht would work for everyone

@calvinmetcalf
Copy link
Owner

ok so we probably want to be able to parse the cpg files, so we're going to need the mappings to be able to do that, then we'll need the text-encoder polyfill for use in node and IE and we'll probably want to include the additional encodings in node but not IE by default and have a note saying you can include that file to get it to work with obscurer encodings in IE

@calvinmetcalf
Copy link
Owner

hm iconv-lite might be a compromise as it has the more well known ones without as much of the crazy

@calvinmetcalf
Copy link
Owner

@koszny if you could point me to an example of of shapefile with non-default encoding that would be very helpful

@koszny
Copy link
Author

koszny commented Mar 27, 2017

@calvinmetcalf
Copy link
Owner

the cpg in those just says 'UTF-8'

@calvinmetcalf
Copy link
Owner

duh because that's what this pull is about, do you have any examples with windows1250 or more exotic encodings ?

@koszny
Copy link
Author

koszny commented Mar 27, 2017

then check here ftp://91.223.135.109/prg/jednostki_administracyjne.zip

@calvinmetcalf
Copy link
Owner

thanks I ended up figuring out how to make one with polish names in the windows1250 encoding

@koszny
Copy link
Author

koszny commented Mar 27, 2017

and it looks it might be tricky to use https://github.com/czhj/codepage-encoding ie. 'ANSI 1250' won't map to anything, I doubt all software use same convention to keep encoding info in .cpg file

@calvinmetcalf
Copy link
Owner

so I'm using that to map the code page number to it's name not the other way around, from what I can tell, the cpg file either has a number or the phrase 'UTF-8'

@calvinmetcalf
Copy link
Owner

oh excellent, you're files don't put the codepage number in there, um how exactly did you generate them?

@koszny
Copy link
Author

koszny commented Mar 27, 2017

the file is from http://www.codgik.gov.pl/index.php/darmowe-dane/prg.html, but dunno how it was generated.

@ahjyrkia
Copy link

any update on this?

@calvinmetcalf
Copy link
Owner

I'm waiting on inexorabletash/text-encoding#70, I'd prefer to just use that, if not I can create my own fork

@calvinmetcalf
Copy link
Owner

ok we should support it now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants