Skip to content

Commit 01afcad

Browse files
committed
Correct an error in the previous commit that caused failure to wrap around when nothing is selected.
Remove \m escape; add [[.x--.]] character names for invalid UTF-8 bytes. Update help.
1 parent ecf13a6 commit 01afcad

File tree

3 files changed

+167
-14
lines changed

3 files changed

+167
-14
lines changed

help.htm

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,7 @@ <h3>Unicode</h3>
308308

309309
<p>There are no surrogate pairs in <strong>Columns++</strong> regular expressions; each code point matches as a single character. To enter any Unicode character in hexadecimal notation, use the full code point; for example, enter &#x1f642; as <code>\x{1f642}</code>. (The surrogate pair, <code>\x{d83d}\x{de42}</code>, which must be used in <strong>Notepad++</strong> search, <em>will not match</em> in <strong>Columns++</strong>.)</p>
310310

311-
<p><strong>Scintilla</strong>, the display control used in <strong>Notepad++</strong>, represents Unicode internally as UTF-8. (This is true whether the file containing the document is UTF-8, UTF-16 or anything else other than “ANSI.”) When displaying Unicode documents that contain invalid UTF-8, Scintilla shows each byte that cannot be decoded as a hexadecimal code in reversed colors. When matching a regular expression, <strong>Columns++</strong> treats each of these error bytes as if it were the Unicode code point formed by adding <code>0xdc00</code> to the invalid byte. These code points are in the surrogate range and are invalid as Unicode characters. (It is possible to match one of these error bytes by prefixing <code>dc</code> to the hexadecimal value; e.g., <code>0xf7</code> is never a valid <em>byte</em> in UTF-8, but it can be found as <code>\x{dcf7}</code>.)</p>
311+
<p><strong>Scintilla</strong>, the display control used in <strong>Notepad++</strong>, represents Unicode internally as UTF-8. (This is true whether the file containing the document is UTF-8, UTF-16 or anything else other than “ANSI.”) When displaying Unicode documents that contain invalid UTF-8, Scintilla shows each byte that cannot be decoded as a hexadecimal code in reversed colors. You can match any of these bytes with <code>\i</code>; to match a specific byte, use the hexadecimal code Scintilla displays as a symbolic character name, e.g., <code>[[.xF7.]]</code>. (When matching a regular expression, <strong>Columns++</strong> treats each of these error bytes as if it were the Unicode code point formed by adding <code>0xdc00</code> to the invalid byte. These code points are in the surrogate range and are invalid as UTF-32 code units.)</p>
312312

313313
<p>The period (<code>.</code>) matches any one code point except the characters which end lines in Scintilla: carriage return (<code>\x0d</code> or <code>\r</code>) and newline (also called line feed, <code>\x0a</code> or <code>\n</code>). This corresponds to the <a href="https://npp-user-manual.org/docs/searching/#single-character-matches">documented</a> behavior of the period, but not the actual behavior in Notepad++ (where there are several other control characters it does not match). Use <code>\X</code> to match a character including any combining code points (marks) which follow it. (In Notepad++ search, <code>.</code> and <code>\X</code> do not work as expected when the code points involved are outside the basic multilingual plane, that is, 0x10000 or greater.)</p>
314314

@@ -317,7 +317,6 @@ <h3>Unicode</h3>
317317
<table class=optionsTable>
318318
<tr><th>escape</th><th>negation</th><th>character class</th><th>meaning</th></tr>
319319
<tr><td><code>\i</code></td><td><code>\I</code></td><td><code>[[:invalid:]]</code></td><td>a byte in an invalid UTF-8 sequence</td></tr>
320-
<tr><td><code>\m</code></td><td><code>\M</code></td><td><code>[[:mark:]]</code></td><td>a combining mark, which displays as part of the previous character</td></tr>
321320
<tr><td><code>\o</code></td><td><code>\O</code></td><td><code>[[:ascii:]]</code></td><td>an ASCII character, code points 0 through 127</td></tr>
322321
<tr><td><code>\y</code></td><td><code>\Y</code></td><td><code>[[:defined:]]</code></td><td>any Unicode code point that is assigned and is not a surrogate or a private use character</td></tr>
323322
</table>
@@ -488,7 +487,9 @@ <h3>Unicode</h3>
488487
<tr><td><code>[[.sflo.]]</code></td> <td>1bca0</td><td>shorthand format letter overlap</td></tr>
489488
<tr><td><code>[[.sfco.]]</code></td> <td>1bca1</td><td>shorthand format continuing overlap</td></tr>
490489
<tr><td><code>[[.sfds.]]</code></td> <td>1bca2</td><td>shorthand format down step</td></tr>
491-
<tr><td><code>[[.sfus.]]</code></td> <td>1bca3</td><td>shorthand format up step</td></tr></table>
490+
<tr><td><code>[[.sfus.]]</code></td> <td>1bca3</td><td>shorthand format up step</td></tr>
491+
<tr><td><code>[[.x80.]]–[[.xff.]]</code></td> <td></td> <td>invalid UTF-8 bytes</td></tr>
492+
</table>
492493

493494
</section>
494495

src/ColumnsPlusPlus.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -415,11 +415,11 @@ class ColumnsPlusPlusData {
415415
if (searchData.autoSetSelection) {
416416
if (sci.SelectionMode() != Scintilla::SelectionMode::Stream) return SearchRegionNotReady;
417417
if (sci.Selections() > 1) return SearchRegionNotReady;
418-
if (sci.SelectionEmpty()) return SearchRegionImpliedAll;
419418
}
420419
if (sci.IndicatorValueAt(searchData.indicator, 0)) return SearchRegionReady;
421420
Scintilla::Position ie = sci.IndicatorEnd(searchData.indicator, 0);
422-
return ie != 0 && ie != sci.Length() ? SearchRegionReady : SearchRegionNotReady;
421+
if (ie != 0 && ie != sci.Length()) return SearchRegionReady;
422+
return searchData.autoSetSelection && sci.SelectionEmpty() ? SearchRegionImpliedAll : SearchRegionNotReady;
423423
}
424424

425425
void syncFindButton() {

src/Unicode/UnicodeRegexTraits.cpp

Lines changed: 161 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -184,8 +184,12 @@ const utf32_regex_traits::char_class_type utf32_regex_traits::asciiMasks[] = {
184184
/* 7E ~ */ CatMask_Sm | mask_ascii | mask_graph | mask_punct,
185185
/* 7F DEL */ CatMask_Cc | mask_ascii | mask_cntrl,
186186
};
187-
187+
188+
188189
const std::map<std::string, utf32_regex_traits::char_class_type> utf32_regex_traits::classnames = {
190+
191+
// Unicode general categories - short names:
192+
189193
{"c*", CatMask_Cc | CatMask_Cf | CatMask_Cn | CatMask_Co},
190194
{"l*", CatMask_Ll | CatMask_Lm | CatMask_Lo | CatMask_Lt | CatMask_Lu},
191195
{"m*", CatMask_Mc | CatMask_Me | CatMask_Mn},
@@ -222,6 +226,9 @@ const std::map<std::string, utf32_regex_traits::char_class_type> utf32_regex_tra
222226
{"zl", CatMask_Zl},
223227
{"zp", CatMask_Zp},
224228
{"zs", CatMask_Zs},
229+
230+
// Unicode character class names:
231+
225232
{"ascii" , mask_ascii},
226233
{"any" , 0x3fffffff00000000U},
227234
{"assigned" , 0x3fffffee00000000U},
@@ -236,7 +243,6 @@ const std::map<std::string, utf32_regex_traits::char_class_type> utf32_regex_tra
236243
{"format" , CatMask_Cf},
237244
{"not assigned" , CatMask_Cn},
238245
{"private use" , CatMask_Co},
239-
{"invalid" , CatMask_Cs}, // No surrogates in UTF-32, but we use some to hold invalid UTF-8 bytes
240246
{"lowercase letter" , CatMask_Ll},
241247
{"modifier letter" , CatMask_Lm},
242248
{"other letter" , CatMask_Lo},
@@ -262,6 +268,9 @@ const std::map<std::string, utf32_regex_traits::char_class_type> utf32_regex_tra
262268
{"line separator" , CatMask_Zl},
263269
{"paragraph separator" , CatMask_Zp},
264270
{"space separator" , CatMask_Zs},
271+
272+
// POSIX/Boost class names and escapes:
273+
265274
{"alnum" , mask_alnum },
266275
{"alpha" , mask_alpha },
267276
{"blank" , mask_blank },
@@ -270,11 +279,8 @@ const std::map<std::string, utf32_regex_traits::char_class_type> utf32_regex_tra
270279
{"digit" , mask_digit },
271280
{"graph" , mask_graph },
272281
{"h" , mask_horizontal},
273-
{"i" , CatMask_Cs },
274282
{"l" , mask_lower },
275283
{"lower" , mask_lower },
276-
{"m" , CatMask_Mc | CatMask_Me | CatMask_Mn},
277-
{"o" , mask_ascii },
278284
{"print" , mask_print },
279285
{"punct" , mask_punct },
280286
{"s" , mask_space },
@@ -286,19 +292,30 @@ const std::map<std::string, utf32_regex_traits::char_class_type> utf32_regex_tra
286292
{"w" , mask_word },
287293
{"word" , mask_word },
288294
{"xdigit" , mask_xdigit },
295+
296+
// additional for Columns++:
297+
289298
{"y" , 0x3fffffe600000000U},
290-
{"defined" , 0x3fffffe600000000U}
299+
{"defined" , 0x3fffffe600000000U},
300+
{"i" , CatMask_Cs}, // Surrogates are not valid in UTF-8, but we use xDC80-xDCFF
301+
{"invalid" , CatMask_Cs}, // to represent invalid UTF-8 bytes
302+
{"o" , mask_ascii}
303+
291304
};
292305

306+
293307
const std::map<std::string, utf32_regex_traits::char_type> utf32_regex_traits::character_names = {
308+
294309
{"ht" , 0x0009}, // Horizontal Tab
295310
{"lf" , 0x000a}, // Line Feed
296311
{"cr" , 0x000d}, // Carriage Return
297312
{"sflo" , 0x1bca0}, // Shorthand Format Letter Overlap
298313
{"sfco" , 0x1bca1}, // Shorthand Format Continuing Overlap
299314
{"sfds" , 0x1bca2}, // Shorthand Format Down Step
300315
{"sfus" , 0x1bca3}, // Shorthand Format Up Step
301-
// from Notepad++ (ScintillaEditView.h):
316+
317+
// from Notepad++ (ScintillaEditView.h):
318+
302319
{"nul" , 0x0000}, // Null
303320
{"soh" , 0x0001}, // Start of Heading
304321
{"stx" , 0x0002}, // Start of Text
@@ -412,7 +429,9 @@ const std::map<std::string, utf32_regex_traits::char_type> utf32_regex_traits::c
412429
{"iaa" , 0xfff9}, // interlinear annotation anchor
413430
{"ias" , 0xfffa}, // interlinear annotation separator
414431
{"iat" , 0xfffb}, // interlinear annotation terminator
415-
// other POSIX names, from Boost (regex_traits_default.hpp):
432+
433+
// other POSIX names, from Boost (regex_traits_default.hpp):
434+
416435
{"alert" , 0x07},
417436
{"backspace" , 0x08},
418437
{"tab" , 0x09},
@@ -466,9 +485,142 @@ const std::map<std::string, utf32_regex_traits::char_type> utf32_regex_traits::c
466485
{"left-curly-bracket" , 0x7b},
467486
{"vertical-line" , 0x7c},
468487
{"right-curly-bracket" , 0x7d},
469-
{"tilde" , 0x7e}
488+
{"tilde" , 0x7e},
489+
490+
// invalid UTF-8 bytes:
491+
492+
{"x80" , 0xdc80},
493+
{"x81" , 0xdc81},
494+
{"x82" , 0xdc82},
495+
{"x83" , 0xdc83},
496+
{"x84" , 0xdc84},
497+
{"x85" , 0xdc85},
498+
{"x86" , 0xdc86},
499+
{"x87" , 0xdc87},
500+
{"x88" , 0xdc88},
501+
{"x89" , 0xdc89},
502+
{"x8a" , 0xdc8a},
503+
{"x8b" , 0xdc8b},
504+
{"x8c" , 0xdc8c},
505+
{"x8d" , 0xdc8d},
506+
{"x8e" , 0xdc8e},
507+
{"x8f" , 0xdc8f},
508+
{"x90" , 0xdc90},
509+
{"x91" , 0xdc91},
510+
{"x92" , 0xdc92},
511+
{"x93" , 0xdc93},
512+
{"x94" , 0xdc94},
513+
{"x95" , 0xdc95},
514+
{"x96" , 0xdc96},
515+
{"x97" , 0xdc97},
516+
{"x98" , 0xdc98},
517+
{"x99" , 0xdc99},
518+
{"x9a" , 0xdc9a},
519+
{"x9b" , 0xdc9b},
520+
{"x9c" , 0xdc9c},
521+
{"x9d" , 0xdc9d},
522+
{"x9e" , 0xdc9e},
523+
{"x9f" , 0xdc9f},
524+
{"xa0" , 0xdca0},
525+
{"xa1" , 0xdca1},
526+
{"xa2" , 0xdca2},
527+
{"xa3" , 0xdca3},
528+
{"xa4" , 0xdca4},
529+
{"xa5" , 0xdca5},
530+
{"xa6" , 0xdca6},
531+
{"xa7" , 0xdca7},
532+
{"xa8" , 0xdca8},
533+
{"xa9" , 0xdca9},
534+
{"xaa" , 0xdcaa},
535+
{"xab" , 0xdcab},
536+
{"xac" , 0xdcac},
537+
{"xad" , 0xdcad},
538+
{"xae" , 0xdcae},
539+
{"xaf" , 0xdcaf},
540+
{"xb0" , 0xdcb0},
541+
{"xb1" , 0xdcb1},
542+
{"xb2" , 0xdcb2},
543+
{"xb3" , 0xdcb3},
544+
{"xb4" , 0xdcb4},
545+
{"xb5" , 0xdcb5},
546+
{"xb6" , 0xdcb6},
547+
{"xb7" , 0xdcb7},
548+
{"xb8" , 0xdcb8},
549+
{"xb9" , 0xdcb9},
550+
{"xba" , 0xdcba},
551+
{"xbb" , 0xdcbb},
552+
{"xbc" , 0xdcbc},
553+
{"xbd" , 0xdcbd},
554+
{"xbe" , 0xdcbe},
555+
{"xbf" , 0xdcbf},
556+
{"xc0" , 0xdcc0},
557+
{"xc1" , 0xdcc1},
558+
{"xc2" , 0xdcc2},
559+
{"xc3" , 0xdcc3},
560+
{"xc4" , 0xdcc4},
561+
{"xc5" , 0xdcc5},
562+
{"xc6" , 0xdcc6},
563+
{"xc7" , 0xdcc7},
564+
{"xc8" , 0xdcc8},
565+
{"xc9" , 0xdcc9},
566+
{"xca" , 0xdcca},
567+
{"xcb" , 0xdccb},
568+
{"xcc" , 0xdccc},
569+
{"xcd" , 0xdccd},
570+
{"xce" , 0xdcce},
571+
{"xcf" , 0xdccf},
572+
{"xd0" , 0xdcd0},
573+
{"xd1" , 0xdcd1},
574+
{"xd2" , 0xdcd2},
575+
{"xd3" , 0xdcd3},
576+
{"xd4" , 0xdcd4},
577+
{"xd5" , 0xdcd5},
578+
{"xd6" , 0xdcd6},
579+
{"xd7" , 0xdcd7},
580+
{"xd8" , 0xdcd8},
581+
{"xd9" , 0xdcd9},
582+
{"xda" , 0xdcda},
583+
{"xdb" , 0xdcdb},
584+
{"xdc" , 0xdcdc},
585+
{"xdd" , 0xdcdd},
586+
{"xde" , 0xdcde},
587+
{"xdf" , 0xdcdf},
588+
{"xe0" , 0xdce0},
589+
{"xe1" , 0xdce1},
590+
{"xe2" , 0xdce2},
591+
{"xe3" , 0xdce3},
592+
{"xe4" , 0xdce4},
593+
{"xe5" , 0xdce5},
594+
{"xe6" , 0xdce6},
595+
{"xe7" , 0xdce7},
596+
{"xe8" , 0xdce8},
597+
{"xe9" , 0xdce9},
598+
{"xea" , 0xdcea},
599+
{"xeb" , 0xdceb},
600+
{"xec" , 0xdcec},
601+
{"xed" , 0xdced},
602+
{"xee" , 0xdcee},
603+
{"xef" , 0xdcef},
604+
{"xf0" , 0xdcf0},
605+
{"xf1" , 0xdcf1},
606+
{"xf2" , 0xdcf2},
607+
{"xf3" , 0xdcf3},
608+
{"xf4" , 0xdcf4},
609+
{"xf5" , 0xdcf5},
610+
{"xf6" , 0xdcf6},
611+
{"xf7" , 0xdcf7},
612+
{"xf8" , 0xdcf8},
613+
{"xf9" , 0xdcf9},
614+
{"xfa" , 0xdcfa},
615+
{"xfb" , 0xdcfb},
616+
{"xfc" , 0xdcfc},
617+
{"xfd" , 0xdcfd},
618+
{"xfe" , 0xdcfe},
619+
{"xff" , 0xdcff}
620+
470621
};
471622

623+
472624
const std::set<utf32_regex_traits::string_type> utf32_regex_traits::digraphs = { // from Boost
473625
U"ae", U"ch", U"dz", U"lj", U"ll", U"nj", U"ss",
474626
U"Ae", U"Ch", U"Dz", U"Lj", U"Ll", U"Nj", U"Ss",

0 commit comments

Comments
 (0)