-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
^Head and tail$ regex characters broken again #2
Comments
Just tested it, it works fine:
|
Thank you for taking the time to notice this bug report and testing it out. I have a kludge that makes it work for most cases, but it is not always correct. For example, Python does not have Awk's concept of "fields" so I'm using regular expressions to fake them when parsing UnicodeData.txt. In the future, I may fake the fields by separating each one into a separate array. I didn't implement it that way originally because it seemed needlessly ugly and I presumed a more natural solution in Python would present itself. |
Hmm, looks like you are trying to construct a regex out of something that might already be a regex. Maybe it’s easier not to bother? Let the user specify an explicit option to indicate that they are already supplying a regex, which you just use as is. The simplest kind of Unicode search I could think of was what I implemented in my Android-based Unicode browser from some years ago: do partial matches on the words of the name, in order (but not necessarily successive). This lets you type something as simple as “right arrow” and have it match things like “RIGHTWARDS ARROW”, “RIGHTWARDS PAIRED ARROWS”, “LEFT RIGHT WAVE ARROW” and so on. |
Thanks for the suggestion, but using a user supplied regex "as is" would not work. I'm wrapping the regex in another one in order split the fields of the UnicodeData.txt file. It's clearly the wrong way to do things in Python, but I am not sure what the Pythonic solution would be. |
First of all, you can apply the user-supplied regex after splitting the fields, to just the particular field value containing the name. Secondly, splitting the fields can be done more easily with an expression like
|
Thank you. I had started doing something like that using a list comprehension: with open('/usr/share/unicode/UnicodeData.txt') as f:
c=[ s.split(';') for s in f ]
(codepoint, character_name, category, dummy, dummy, dummy, dummy, dummy, dummy, dummy, alias, dummy, dummy, dummy, dummy) = zip(*c) It seemed a bit ugly, so I figured I'd wait until a cleaner solution became obvious. I haven't stumbled into one yet and it's more elegant than the regex kludge, so I probably should just implement it that way. |
By the way, I looked at the various Unicode files when I was writing the Android app I mentioned earlier, and decided that |
Yes, I have a branch called nameslist in which ugrep to searches both UnicodeData and NamesList. I thought I had finished it and merged it into the mainline months ago, but that does not appear to be the case. Perhaps I got stuck on trying to handle regions so that a search for "food" would show pizza and so on? I was also amused by the "Do not parse" message given that NamesList is the only source in Unicode for some pretty basic information. For example, 🍆 is known as an "eggplant" where I'm from, but UnicodeData.txt only knows it as "aubergine". |
Oh, right. I see it now. Python was giving me guff about doing a multiline regex search and I shelved it, thinking of either going back to AWK or doing what you said: parse the file and convert it to a searchable database. |
ugrep ^x
should show all characters whose description begins with "X". It does not. This was so much easier in awk... ☺The text was updated successfully, but these errors were encountered: