Normalize punctuation on input #1599

jenniferward · 2024-06-12T09:28:25Z

Some characters need to be normalized (smart quotes vs. apostrophes) but some need to be allowed (u vs ü). Currently, ’ and ' are read as different punctuation marks. This causes misalignment in city names in Institutions:
La Seu d’Urgell https://rism.online/institutions/30079707
La Seu d'Urgell https://rism.online/institutions/30005481
and duplicates in Titles/Texts:
https://muscat.rism.info/admin/standard_titles?utf8=%E2%9C%93&q%5Btitle_equals%5D=Au+sein+des+alarmes+l%E2%80%99amour+a+des+charmes&commit=Filter&order=id_desc
Au sein des alarmes l’amour a des charmes
Au sein des alarmes l'amour a des charmes

This arises especially when copying from websites or data imports. The problem has been solved with searching (see #622 ) but not on the input side.

I can think of the following:

’ and '
" " and “ ”
- – — (dash, n-dash, m-dash)

For the dashes, only one is needed (the dash I think?) in the standardized fields.

What about spaces? Sometimes that acts strangely (Excel doesn't always read the spaces as spaces) but I can't describe it further than that.

This is most important the fields that are linked to authority files, not everywhere (like in notes fields).

The text was updated successfully, but these errors were encountered:

fjorba · 2024-07-22T09:21:36Z

If it helps, the list of the characters we systematically correct in our systems, because we have found them in our records, is this one (still in Python2; maybe copy and paste hasn't respected some of them, but the comment may help):

bad_chars = {
    '\t': u' ',
    '&#13;': u' ',
    u'': '', # Macintosh newline char?                                         
    u' ': u' ', # Unicode 0xA0, NO-BREAK SPACE                                  
    u' ': u' ', # Unicode 0x200E, LEFT-TO-RIGHT MARK                            
    u'‘': u"'", # Unicode 0xA0, LEFT SINGLE QUOTATION MARK                      
    u'’': u"'", # Unicode 0x2019, RIGHT SINGLE QUOTATION MARK                   
    u'´': u"'", # Unicode 0xB4, ACUTE ACCENT                                    
    u'′': u"'", # Unicode 0x2032, PRIME                                         
    u'`': u"'", # Unicode 0x60, GRAVE ACCENT                                    
    u'\222': u"'", # Unicode 0x92: PRIVATE USE TWO                              
    u'“': u'"',
    u'”': u'"',
    u'<<': u'«',
    u'&lt;&lt;': u'«',
    u'>>': u'»',
    u'&gt;&gt;': u'»',
    u'l.l': u'l·l',
    u'l•l': u'l·l',
    u'l\225l': u'l·l',
    u'&#61655;': u'·',
    u'–': u'-', # Unicode 0x2013, EN DASH                                       
    u'—': u'-', # Unicode 0x2014, EM DASH                                       
    u'‐': u'-', # Unicode 0x2010, HYPHEN                                        
}

def replace_bad_chars(line):
    for bad_char in bad_chars:
        line = line.replace(bad_char, bad_chars[bad_char])
    return line

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize punctuation on input #1599

Normalize punctuation on input #1599

jenniferward commented Jun 12, 2024

fjorba commented Jul 22, 2024 •

edited

Loading

Normalize punctuation on input #1599

Normalize punctuation on input #1599

Comments

jenniferward commented Jun 12, 2024

fjorba commented Jul 22, 2024 • edited Loading

fjorba commented Jul 22, 2024 •

edited

Loading