-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
約定俗成的音 對策 --> 錯別字自動修正, like iOS typo correction suggestion bubble. #2
Comments
This seems to be a feature worth pursuing (and a good differentiation factor "product"-wise too). |
One workaround is to add the following terms? @lukhnos how do you think?
|
這跟「打錯字或打錯音時,顯示建議詞」並不一樣,但確實可以用自訂詞的方式解決。:) |
Perhaps we can introduce some secondary suggestion table for these common typos? |
@justfont recetnly shared the repo https://github.com/justfont/The-Write-Right-Font, which includes the typo correction suggestion in this spreadsheet: I wrote the following script for my personal user dict. It yields ok-ish result IMO. import csv
SRC = "/usr/share/fcitx5/data/mcbopomofo-data.txt"
REPLACE = "20230401.csv"
def read_mcbopomofo_dict() -> dict[str, list[str]]:
"""Read the mcbopomofo dict and return a dict of phrase -> readings"""
with open(SRC) as f:
lines: list[str] = f.readlines()
# start with `ㄅ ㄅ -5.79764489`
lines = lines[510:]
d: dict[str, list[str]] = {}
for l in lines:
reading, phrase, _ = l.split(" ")
if phrase not in d:
d[phrase] = [reading]
else:
d[phrase].append(reading)
return d
def read_justfont_csv() -> list[list[str]]:
"""Read the justfont csv and return a list of phrases to replace"""
with open(REPLACE, newline="") as f:
reader = csv.reader(f, delimiter=",", quotechar="|")
phrases: list[list[str]] = []
for row in reader:
if row[0] == "華語" and row[1] == "替換單一字":
# normalize typo marker
row[3] = row[3].replace("‘", "'")
phrases.append(row[2:4])
if row[0] == "華語" and row[1] != "替換單一字":
print(f"[-] only single replacement is supported: {row}")
return phrases
if __name__ == "__main__":
mcbopomofo_dict: dict[str, list[str]] = read_mcbopomofo_dict()
typo_phrases: list[list[str]] = read_justfont_csv()
output: list[str] = []
for phrase in typo_phrases:
# 正確詞, 錯誤詞
correct, wrong = phrase[0], phrase[1]
if correct not in mcbopomofo_dict:
print(f"[-] phrase {phrase} not found in mcbopomofo_dict")
continue
# otherwise, correct in mcbopomofo_dict
# ensure marker is present
if "'" not in wrong:
# no typo marker, skip
print(f"[-] typo marker ' not found in {phrase}")
continue
marker: int = wrong.find("'")
if marker <= 0:
print(f"[-] incorrect typo marker ' position in {phrase}")
continue
# the typo character is right before the marker
idx: int = marker - 1
typo: str = wrong[idx]
if typo in mcbopomofo_dict:
# for all combinations found in the dict
for correct_reading in mcbopomofo_dict[correct]:
for typo_reading in mcbopomofo_dict[typo]:
# reconstruct the replacement reading
replacement: list[str] = []
for i, s in enumerate(correct_reading.split("-")):
if i == idx:
replacement.append(typo_reading)
else:
replacement.append(s)
replacement_phrase: str = "-".join(replacement)
if correct_reading != replacement_phrase:
print(
f"[+] {correct} {correct_reading} -> {correct} {replacement_phrase} ({typo} {typo_reading} {idx})"
)
output.append(f"{correct} {replacement_phrase}")
else:
print(f"[-] typo {typo} not found in mcbopomofo_dict")
continue
# finally, dump the output
for o in output:
print(o) output:
There are quite a few limitations with this alternative-user-dict approach, though. Such as 破音字 or (long) phrases not present in the McBopomofo's dictionary. I believe the more proper way is to bring up the typo suggestions in the selection prompt. |
For phrases available in moedict (but not in McBopomofo's dict) Ref: https://www.moedict.tw/%E7%97%85%E5%9C%A8%E8%86%8F%E8%82%93 etc
|
Since we are talking about bopomofo here, this ticket is likely a variant of spell check. As long as we can have a good UI/UX for the system (e.g., asking "do you mean this [foo]?" instead of correcting "fu" to "foo" directly), algorithms behind it can be like edit distance or https://en.wikipedia.org/wiki/Metaphone . If this is something we want to promote, I can find some time to do it. |
例如很多人要打「這麼」打成「這模」, 「應該」打成「因該」。
The text was updated successfully, but these errors were encountered: