^Head and tail$ regex characters broken again #2

hackerb9 · 2019-03-03T08:20:15Z

ugrep ^x should show all characters whose description begins with "X". It does not. This was so much easier in awk... ☺

The text was updated successfully, but these errors were encountered:

ldo · 2021-09-22T07:32:22Z

Just tested it, it works fine:

ldo@theon:ugrep> ./ugrep ^x | wc -l
16
ldo@theon:ugrep> ./ugrep 'X$' | wc -l
663

hackerb9 · 2021-09-22T20:53:33Z

Thank you for taking the time to notice this bug report and testing it out. I have a kludge that makes it work for most cases, but it is not always correct. For example, ugrep '(^x)'.

Python does not have Awk's concept of "fields" so I'm using regular expressions to fake them when parsing UnicodeData.txt. In the future, I may fake the fields by separating each one into a separate array. I didn't implement it that way originally because it seemed needlessly ugly and I presumed a more natural solution in Python would present itself.

ldo · 2021-09-23T04:33:10Z

Hmm, looks like you are trying to construct a regex out of something that might already be a regex. Maybe it’s easier not to bother? Let the user specify an explicit option to indicate that they are already supplying a regex, which you just use as is.

The simplest kind of Unicode search I could think of was what I implemented in my Android-based Unicode browser from some years ago: do partial matches on the words of the name, in order (but not necessarily successive). This lets you type something as simple as “right arrow” and have it match things like “RIGHTWARDS ARROW”, “RIGHTWARDS PAIRED ARROWS”, “LEFT RIGHT WAVE ARROW” and so on.

hackerb9 · 2021-09-23T21:50:33Z

Thanks for the suggestion, but using a user supplied regex "as is" would not work. I'm wrapping the regex in another one in order split the fields of the UnicodeData.txt file. It's clearly the wrong way to do things in Python, but I am not sure what the Pythonic solution would be.

ldo · 2021-09-23T22:52:39Z

First of all, you can apply the user-supplied regex after splitting the fields, to just the particular field value containing the name.

Secondly, splitting the fields can be done more easily with an expression like

line.split(";")

hackerb9 · 2021-09-25T12:39:08Z

Thank you. I had started doing something like that using a list comprehension:

with open('/usr/share/unicode/UnicodeData.txt') as f:
    c=[ s.split(';') for s in f ] 

(codepoint, character_name, category, dummy, dummy, dummy, dummy, dummy, dummy, dummy, alias, dummy, dummy, dummy, dummy) = zip(*c)

It seemed a bit ugly, so I figured I'd wait until a cleaner solution became obvious. I haven't stumbled into one yet and it's more elegant than the regex kludge, so I probably should just implement it that way.

ldo · 2021-09-25T23:17:34Z

By the way, I looked at the various Unicode files when I was writing the Android app I mentioned earlier, and decided that NamesList.txt was more informative than UnicodeData.txt. The former also includes the names of code blocks, alternative character names, and references to related characters -- all of which I thought was useful to include in the app. There is a warning at the front not to do automatic parsing of that file, but I managed to get my parser-and-database-builder script (included in the app source tree) working well enough. ;)

hackerb9 · 2021-09-26T00:04:27Z

Yes, I have a branch called nameslist in which ugrep to searches both UnicodeData and NamesList. I thought I had finished it and merged it into the mainline months ago, but that does not appear to be the case. Perhaps I got stuck on trying to handle regions so that a search for "food" would show pizza and so on?

I was also amused by the "Do not parse" message given that NamesList is the only source in Unicode for some pretty basic information. For example, 🍆 is known as an "eggplant" where I'm from, but UnicodeData.txt only knows it as "aubergine".

hackerb9 · 2021-09-26T00:22:47Z

Oh, right. I see it now. Python was giving me guff about doing a multiline regex search and I shelved it, thinking of either going back to AWK or doing what you said: parse the file and convert it to a searchable database.

hackerb9 self-assigned this Mar 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

^Head and tail$ regex characters broken again #2

^Head and tail$ regex characters broken again #2

hackerb9 commented Mar 3, 2019

ldo commented Sep 22, 2021

hackerb9 commented Sep 22, 2021

ldo commented Sep 23, 2021

hackerb9 commented Sep 23, 2021

ldo commented Sep 23, 2021

hackerb9 commented Sep 25, 2021

ldo commented Sep 25, 2021

hackerb9 commented Sep 26, 2021

hackerb9 commented Sep 26, 2021

^Head and tail$ regex characters broken again #2

^Head and tail$ regex characters broken again #2

Comments

hackerb9 commented Mar 3, 2019

ldo commented Sep 22, 2021

hackerb9 commented Sep 22, 2021

ldo commented Sep 23, 2021

hackerb9 commented Sep 23, 2021

ldo commented Sep 23, 2021

hackerb9 commented Sep 25, 2021

ldo commented Sep 25, 2021

hackerb9 commented Sep 26, 2021

hackerb9 commented Sep 26, 2021