Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

^Head and tail$ regex characters broken again #2

Open
hackerb9 opened this issue Mar 3, 2019 · 9 comments
Open

^Head and tail$ regex characters broken again #2

hackerb9 opened this issue Mar 3, 2019 · 9 comments
Assignees

Comments

@hackerb9
Copy link
Owner

hackerb9 commented Mar 3, 2019

ugrep ^x should show all characters whose description begins with "X". It does not. This was so much easier in awk... ☺

@hackerb9 hackerb9 self-assigned this Mar 3, 2019
@ldo
Copy link

ldo commented Sep 22, 2021

Just tested it, it works fine:

ldo@theon:ugrep> ./ugrep ^x | wc -l
16
ldo@theon:ugrep> ./ugrep 'X$' | wc -l
663

@hackerb9
Copy link
Owner Author

Thank you for taking the time to notice this bug report and testing it out. I have a kludge that makes it work for most cases, but it is not always correct. For example, ugrep '(^x)'.

Python does not have Awk's concept of "fields" so I'm using regular expressions to fake them when parsing UnicodeData.txt. In the future, I may fake the fields by separating each one into a separate array. I didn't implement it that way originally because it seemed needlessly ugly and I presumed a more natural solution in Python would present itself.

@ldo
Copy link

ldo commented Sep 23, 2021

Hmm, looks like you are trying to construct a regex out of something that might already be a regex. Maybe it’s easier not to bother? Let the user specify an explicit option to indicate that they are already supplying a regex, which you just use as is.

The simplest kind of Unicode search I could think of was what I implemented in my Android-based Unicode browser from some years ago: do partial matches on the words of the name, in order (but not necessarily successive). This lets you type something as simple as “right arrow” and have it match things like “RIGHTWARDS ARROW”, “RIGHTWARDS PAIRED ARROWS”, “LEFT RIGHT WAVE ARROW” and so on.

@hackerb9
Copy link
Owner Author

Thanks for the suggestion, but using a user supplied regex "as is" would not work. I'm wrapping the regex in another one in order split the fields of the UnicodeData.txt file. It's clearly the wrong way to do things in Python, but I am not sure what the Pythonic solution would be.

@ldo
Copy link

ldo commented Sep 23, 2021

First of all, you can apply the user-supplied regex after splitting the fields, to just the particular field value containing the name.

Secondly, splitting the fields can be done more easily with an expression like

line.split(";")

@hackerb9
Copy link
Owner Author

Thank you. I had started doing something like that using a list comprehension:

with open('/usr/share/unicode/UnicodeData.txt') as f:
    c=[ s.split(';') for s in f ] 

(codepoint, character_name, category, dummy, dummy, dummy, dummy, dummy, dummy, dummy, alias, dummy, dummy, dummy, dummy) = zip(*c)

It seemed a bit ugly, so I figured I'd wait until a cleaner solution became obvious. I haven't stumbled into one yet and it's more elegant than the regex kludge, so I probably should just implement it that way.

@ldo
Copy link

ldo commented Sep 25, 2021

By the way, I looked at the various Unicode files when I was writing the Android app I mentioned earlier, and decided that NamesList.txt was more informative than UnicodeData.txt. The former also includes the names of code blocks, alternative character names, and references to related characters -- all of which I thought was useful to include in the app. There is a warning at the front not to do automatic parsing of that file, but I managed to get my parser-and-database-builder script (included in the app source tree) working well enough. ;)

@hackerb9
Copy link
Owner Author

Yes, I have a branch called nameslist in which ugrep to searches both UnicodeData and NamesList. I thought I had finished it and merged it into the mainline months ago, but that does not appear to be the case. Perhaps I got stuck on trying to handle regions so that a search for "food" would show pizza and so on?

I was also amused by the "Do not parse" message given that NamesList is the only source in Unicode for some pretty basic information. For example, 🍆 is known as an "eggplant" where I'm from, but UnicodeData.txt only knows it as "aubergine".

@hackerb9
Copy link
Owner Author

Oh, right. I see it now. Python was giving me guff about doing a multiline regex search and I shelved it, thinking of either going back to AWK or doing what you said: parse the file and convert it to a searchable database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants