Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize punctuation on input #1599

Open
jenniferward opened this issue Jun 12, 2024 · 1 comment
Open

Normalize punctuation on input #1599

jenniferward opened this issue Jun 12, 2024 · 1 comment

Comments

@jenniferward
Copy link
Contributor

Some characters need to be normalized (smart quotes vs. apostrophes) but some need to be allowed (u vs ü). Currently, and ' are read as different punctuation marks. This causes misalignment in city names in Institutions:
La Seu d’Urgell https://rism.online/institutions/30079707
La Seu d'Urgell https://rism.online/institutions/30005481
and duplicates in Titles/Texts:
https://muscat.rism.info/admin/standard_titles?utf8=%E2%9C%93&q%5Btitle_equals%5D=Au+sein+des+alarmes+l%E2%80%99amour+a+des+charmes&commit=Filter&order=id_desc
Au sein des alarmes l’amour a des charmes
Au sein des alarmes l'amour a des charmes

This arises especially when copying from websites or data imports. The problem has been solved with searching (see #622 ) but not on the input side.

I can think of the following:

  • and '
  • " " and “ ”
  • - (dash, n-dash, m-dash)

For the dashes, only one is needed (the dash I think?) in the standardized fields.

What about spaces? Sometimes that acts strangely (Excel doesn't always read the spaces as spaces) but I can't describe it further than that.

This is most important the fields that are linked to authority files, not everywhere (like in notes fields).

@fjorba
Copy link
Contributor

fjorba commented Jul 22, 2024

If it helps, the list of the characters we systematically correct in our systems, because we have found them in our records, is this one (still in Python2; maybe copy and paste hasn't respected some of them, but the comment may help):

bad_chars = {
    '\t': u' ',
    '
': u' ',
    u'': '', # Macintosh newline char?                                         
    u' ': u' ', # Unicode 0xA0, NO-BREAK SPACE                                  
    u' ': u' ', # Unicode 0x200E, LEFT-TO-RIGHT MARK                            
    u'‘': u"'", # Unicode 0xA0, LEFT SINGLE QUOTATION MARK                      
    u'’': u"'", # Unicode 0x2019, RIGHT SINGLE QUOTATION MARK                   
    u'´': u"'", # Unicode 0xB4, ACUTE ACCENT                                    
    u'′': u"'", # Unicode 0x2032, PRIME                                         
    u'`': u"'", # Unicode 0x60, GRAVE ACCENT                                    
    u'\222': u"'", # Unicode 0x92: PRIVATE USE TWO                              
    u'“': u'"',
    u'”': u'"',
    u'<<': u'«',
    u'&lt;&lt;': u'«',
    u'>>': u'»',
    u'&gt;&gt;': u'»',
    u'l.l': u'l·l',
    u'l•l': u'l·l',
    u'l\225l': u'l·l',
    u'&#61655;': u'·',
    u'–': u'-', # Unicode 0x2013, EN DASH                                       
    u'—': u'-', # Unicode 0x2014, EM DASH                                       
    u'‐': u'-', # Unicode 0x2010, HYPHEN                                        
}

def replace_bad_chars(line):
    for bad_char in bad_chars:
        line = line.replace(bad_char, bad_chars[bad_char])
    return line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants