Replies: 10 comments 4 replies
-
Is it to be done in the library (like the current "one step parse" implementation)?
Should it provide the sentence tokens? |
Beta Was this translation helpful? Give feedback.
-
So I guess we could have this API: This function would provide the dict handler with the 2D token array, get the additional disjuncts, add them to the current sentence disjuncts, and as a last step call
Does this logic need internal library info? In that case, we need API to get this info.
Regarding a tokenizer for the LG library (for text at least), you can just depend on the fact that the dict lists all the possible tokens - tokens that don't appear in it are not relevant for it. A tokenizer can just find all the combinations of these tokens in the input while minimizing the unknown words (valid sentences would not have unknown words.) What can be of a higher quality? (Depending on the language, we may need a more sophisticated way of defining token boundaries than the current Tokenizing without a complete LG dict, e.g. for language learning, is totally another thing. However, I have absolutely no ideas regarding tokenizing audion or vision... |
Beta Was this translation helpful? Give feedback.
-
No, it would be better to just have another flag in
Probably not. But .. I won't know until I'm ready to write the Atomese code to generate those disjuncts-on-the-fly. It would be best if there was some location in the LG code base, where you could say "gee if only a had a disjunct like this, then everything would parse" -- then I could supply that disjunct (and it's cost). That would be the "best" way. If there isn't any such "natural" or "elegant" way to do this, then ... There's always a brute-force solution: set a flag in Parse_Options, restart the parse from the very beginning , use that flag trigger the fetching of extra disjuncts, and then everything "just works". ... but that's brute force. Is there something prettier?
Right. I guess I confused myself: during the learning stage, its a chicken and egg-problem. I don't know what the token are, so there's a need to explore all of them. All possible tokenizations (that's what the "any/amy" languages do, btw, thanks for that .. the "any" language is a key part of the pipeline, while "amy" remains on the to-do list, I haven't gotten around to morphology yet.)
I do, but don't want to write long explanations here. There are long explanations elsewhere. In short, they involve the automated discovery of perceptron-like things that automatically pick out commonly recurring features. The neural-net guys keep talking about "gradient descent"; I'm following a strategy of working with a landsacpe of "cliffs" of infinite gradient, and trying to correctly position those cliffs. |
Beta Was this translation helpful? Give feedback.
-
The Chinese quality is worse than I would love it to be, but the Russian and English quality is decent.
It is written in Python, no Java: https://github.com/aigents/pygents/tree/main/pygents
Why crazy, you can think of it at general-purpose segmentation framework, specialized for letter segmentation in NLP in one particular case. Background: https://arxiv.org/abs/2205.11443 https://www.youtube.com/watch?v=AV_QQ7fqalw
Those "cliffs" are called "peaks" in the work that we are citing here: https://arxiv.org/abs/2205.11443 Yes, the audio is straightforward after the text, but video needs much more careful thought... |
Beta Was this translation helpful? Give feedback.
-
Hi @akolonin
It's a figure of speech, to introduce and defend a perhaps unusual or unexpected idea. Of course, as with any figure of speech, one must be careful in it's use, lest the audience misunderstand it. (I've been reading twitter for the last hour, and its amazing how many tweets are poorly worded and ambiguous.) Anyway ...
FYI, LG has a "language" called
There's also a language The intent of This idea seems to break down almost completely for Hebrew, Amharic, and Semitic languages in general. I do not have any good ideas on how these could be handled. I suppose one could segment letter-by-letter, but I fear this could become computationally very difficult, and there may be traps with non-planarity (LG can handle non-planar parses through a trick of making them planar, in the same way that non-planar electrical circuits can be drawn on pieces of paper; I have not explored how difficult it would be to learn such non-planar parses. I suspect it might be hard.) |
Beta Was this translation helpful? Give feedback.
-
Here's a random Chinese segmentation. I have to insert spaces between every hanzi:
The |
Beta Was this translation helpful? Give feedback.
-
Linas, Ben,
Which reminds me: @akolonin <https://github.com/akolonin> and
@rvignav <https://github.com/rvignav> have a whiz-band tokenizer that
produces high-quality tokenization for Chinese, and they're now looking
at other languages.
The Chinese quality is worse than I would love it to be, but the Russian
and English quality is decent.
Their tokenizer is written in Java, so there's no way to just "drop it
in" and have it work. But still ... the claim is that its very
high-precision.
It is written in Python, no Java:
https://github.com/aigents/pygents/tree/main/pygents
My vague day-dream is as follows: to (re-)implement that algorithm in
Atomese, and thus have tokenization done there (and somehow plugging
into LG) But why do such a crazy thing?
Why crazy, you can think of it at general-purpose segmentation
framework, specialized for letter segmentation in NLP in one particular
case.
Background:
https://arxiv.org/abs/2205.11443
https://www.youtube.com/watch?v=AV_QQ7fqalw
Cheers,
-Anton
On 30/06/2022 00:59, Linas Vepštas wrote:
API
It would be better to have a distinct API. There might be some
non-trivial decision-making logic needed to trigger this step. I don't
know what that might be, yet. So a clean separation seems best.
Should it provide the sentence tokens?
Yes! 2D array, or a struct with pointers would be fine, for now.
Which reminds me: @akolonin <https://github.com/akolonin> and @rvignav
<https://github.com/rvignav> have a whiz-band tokenizer that produces
high-quality tokenization for Chinese, and they're now looking at
other languages. Their tokenizer is written in Java, so there's no way
to just "drop it in" and have it work. But still ... the claim is that
its very high-precision.
My vague day-dream is as follows: to (re-)implement that algorithm in
Atomese, and thus have tokenization done there (and somehow plugging
into LG) But why do such a crazy thing? The daydream is to understand
the tokenization algorithm well-enough, so that an Atomese process
could learn it "from scratch". But why is learning-from-scratch so
important? Well, consider an audio stream: the goal is to segment the
audio stream, identify features in it: rising, falling tones, other
features, and convert those into streams of symbols (""alternatives")
that can be passed into LG. Amazingly, you can do similar things for
vision as well; I've got details written up elsewhere, in the "learn"
git repo.
—
Reply to this email directly, view it on GitHub
<#1310 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJNFKZFAT4YKL3RRJ7SR6DVRSFGTANCNFSM52DLXGSQ>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
-Anton Kolonin
telegram/skype/facebook: akolonin
mobile/WhatsApp: +79139250058
***@***.***
https://aigents.com
https://www.youtube.com/aigents
https://www.facebook.com/aigents
https://wt.social/wt/aigents
***@***.***
***@***.***
https://reddit.com/r/aigents
https://twitter.com/aigents
***@***.***
https://vk.com/aigents
https://aigents.com/en/slack.html
https://www.messenger.com/t/aigents
***@***.***
|
Beta Was this translation helpful? Give feedback.
-
@ampli: question about the dialect support. I'm having trouble making the following work; is it supposed to work? In the English 4.0.dict I added
and this loads without error. It will parse
to 4.0.dialect, and that didn't work either. Was this supposed to work, i.e. is this "some minor bug", or was it never meant to work this way, i.e. is a "major feature request"? The idea I'm shooting for is that perhaps the dictionary could contain expressions that have portions with very high dialect costs, (e.g. greater than My mental image is of a dictionary that's like a reservoir: if its too small to yield a parse, I make the dictionary bigger by selectively lowering dialect costs. This would be done during run-time, so that there's no need to reload new expressions. Tracing the code:
|
Beta Was this translation helpful? Give feedback.
-
There is a bug in the dict reading code, as can be seen in:
Note the In addition while I played with dialects, I found two unrelated bugs in link-parser (to be fixed):
|
Beta Was this translation helpful? Give feedback.
-
It was a one-line fix (PR soon). However, while testing it I encountered a new (and non-related) very rare parsing problem when tracon encoding is done, so I started to investigate it too... (I already found the place of the problem).
Indeed so. |
Beta Was this translation helpful? Give feedback.
-
I'd like to propose a new dictionary-lookup capability: on parse-failure, to requery the dictionary for more disjuncts. (Expressions, actually; everywhere I write "disjunct" below, I really mean "expression")
In building the AtomSpace-backed dictionary backend, I fell into a trap: during a cleanup and triming stage, I accidentally trimmed away disjuncts that are needed to obtain valid, low-cost parses. This is because the trimming/cleanup subsystem is not aware of their importance. I will fix this, by explicitly parsing a large number of sentences, and then keeping stats on the parses that were actually used, and make sure those disjuncts do not get discarded.
This is an OK and necessary fix, but I'm envisioning a more interesting fix. Suppose that there's a word that cannot be used in a parse (a skipped word), or a word for which the only disjunct(s) are extremely high cost. In this case, I'd like the parser to re-query the dictionary, and, this time, the dictionary could generate a set of plausible disjuncts, on the fly.
For example, I have a list of all MI between word-pairs. Given a word, and it's nearby neighbors, I could simply create a disjunct that will link to these words. The cost of that disjunct is just the sum-total of the MI costs. I imagine that in many/most cases, this could be a very good strategy for dealing with parse failures.
How could this be implemented? The requirements seem to be:
I think that's it ...
Tag @ampli
Beta Was this translation helpful? Give feedback.
All reactions