On-the-fly expressions (disjuncts) !? #1310

linas · 2022-06-28T20:07:07Z

linas
Jun 28, 2022
Maintainer

I'd like to propose a new dictionary-lookup capability: on parse-failure, to requery the dictionary for more disjuncts. (Expressions, actually; everywhere I write "disjunct" below, I really mean "expression")

In building the AtomSpace-backed dictionary backend, I fell into a trap: during a cleanup and triming stage, I accidentally trimmed away disjuncts that are needed to obtain valid, low-cost parses. This is because the trimming/cleanup subsystem is not aware of their importance. I will fix this, by explicitly parsing a large number of sentences, and then keeping stats on the parses that were actually used, and make sure those disjuncts do not get discarded.

This is an OK and necessary fix, but I'm envisioning a more interesting fix. Suppose that there's a word that cannot be used in a parse (a skipped word), or a word for which the only disjunct(s) are extremely high cost. In this case, I'd like the parser to re-query the dictionary, and, this time, the dictionary could generate a set of plausible disjuncts, on the fly.

For example, I have a list of all MI between word-pairs. Given a word, and it's nearby neighbors, I could simply create a disjunct that will link to these words. The cost of that disjunct is just the sum-total of the MI costs. I imagine that in many/most cases, this could be a very good strategy for dealing with parse failures.

How could this be implemented? The requirements seem to be:

If there's a parse-failure, then go back to the dictionary for a second round. The dictionary needs to be told that this is a retry, to get it working in a different mode.
The retry query needs to tell the dict about the sentence. This is because disjunct generation needs to have basic pruning occur in the generation: its pointless to generate disjuncts that attach to words that aren't in the sentence.
The lifetime of these generated disjuncts needs to be for this one sentence only. They are to be used only while needed, and not in general.

I think that's it ...

Tag @ampli

ampli · 2022-06-29T10:48:45Z

ampli
Jun 29, 2022
Collaborator

If there's a parse-failure, then go back to the dictionary for a second round.

Is it to be done in the library (like the current "one step parse" implementation)?
Or do you prefer to add API for that and do it in link-parser?

The retry query needs to tell the dict about the sentence.

Should it provide the sentence tokens?
If so, how to provide alternatives? It seems providing the 2-D array used for parsing should be fine.

1 reply

linas Jun 29, 2022
Maintainer Author

API

It would be better to have a distinct API. There might be some non-trivial decision-making logic needed to trigger this step. I don't know what that might be, yet. So a clean separation seems best.

Should it provide the sentence tokens?

Yes! 2D array, or a struct with pointers would be fine, for now.

Which reminds me: @akolonin and @rvignav have a whiz-bang tokenizer that produces high-quality tokenization for Chinese, and they're now looking at other languages. Their tokenizer is written in Java, so there's no way to just "drop it in" and have it work. But still ... the claim is that its very high-precision.

My vague day-dream is as follows: to (re-)implement that algorithm in Atomese, and thus have tokenization done there (and somehow plugging into LG) But why do such a crazy thing? The daydream is to understand the tokenization algorithm well-enough, so that an Atomese process could learn it "from scratch". But why is learning-from-scratch so important? Well, consider an audio stream: the goal is to segment the audio stream, identify features in it: rising, falling tones, other features, and convert those into streams of symbols (""alternatives") that can be passed into LG. Amazingly, you can do similar things for vision as well; I've got details written up elsewhere, in the "learn" git repo.

ampli · 2022-06-29T20:03:44Z

ampli
Jun 29, 2022
Collaborator

So I guess we could have this API:
int sentence_parse_with_additional_disjuncts(Sentence, Parse_Options);

This function would provide the dict handler with the 2D token array, get the additional disjuncts, add them to the current sentence disjuncts, and as a last step call sentence_parse().

There might be some non-trivial decision-making logic needed to trigger this step

Does this logic need internal library info? In that case, we need API to get this info.

tokenization for Chinese

Regarding a tokenizer for the LG library (for text at least), you can just depend on the fact that the dict lists all the possible tokens - tokens that don't appear in it are not relevant for it. A tokenizer can just find all the combinations of these tokens in the input while minimizing the unknown words (valid sentences would not have unknown words.) What can be of a higher quality? (Depending on the language, we may need a more sophisticated way of defining token boundaries than the current = marks.)

Tokenizing without a complete LG dict, e.g. for language learning, is totally another thing.

However, I have absolutely no ideas regarding tokenizing audion or vision...

0 replies

linas · 2022-06-30T00:09:45Z

linas
Jun 30, 2022
Maintainer Author

this API

No, it would be better to just have another flag in Parse_options, so that if its set, then parsing asks for more disjuncts. I guess.

internal library info

Probably not. But .. I won't know until I'm ready to write the Atomese code to generate those disjuncts-on-the-fly. It would be best if there was some location in the LG code base, where you could say "gee if only a had a disjunct like this, then everything would parse" -- then I could supply that disjunct (and it's cost). That would be the "best" way. If there isn't any such "natural" or "elegant" way to do this, then ...

There's always a brute-force solution: set a flag in Parse_Options, restart the parse from the very beginning , use that flag trigger the fetching of extra disjuncts, and then everything "just works". ... but that's brute force. Is there something prettier?

dict lists all the possible tokens

Right. I guess I confused myself: during the learning stage, its a chicken and egg-problem. I don't know what the token are, so there's a need to explore all of them. All possible tokenizations (that's what the "any/amy" languages do, btw, thanks for that .. the "any" language is a key part of the pipeline, while "amy" remains on the to-do list, I haven't gotten around to morphology yet.)

tokenizing audio or vision...

I do, but don't want to write long explanations here. There are long explanations elsewhere. In short, they involve the automated discovery of perceptron-like things that automatically pick out commonly recurring features. The neural-net guys keep talking about "gradient descent"; I'm following a strategy of working with a landsacpe of "cliffs" of infinite gradient, and trying to correctly position those cliffs.

0 replies

akolonin · 2022-06-30T02:27:56Z

akolonin
Jun 30, 2022

Which reminds me: @akolonin and @rvignav have a whiz-band tokenizer that produces high-quality tokenization for Chinese, and they're now looking at other languages.

The Chinese quality is worse than I would love it to be, but the Russian and English quality is decent.

Their tokenizer is written in Java, so there's no way to just "drop it in" and have it work. But still ... the claim is that its very high-precision.

It is written in Python, no Java:

https://github.com/aigents/pygents/tree/main/pygents

My vague day-dream is as follows: to (re-)implement that algorithm in Atomese, and thus have tokenization done there (and somehow plugging into LG) But why do such a crazy thing?

Why crazy, you can think of it at general-purpose segmentation framework, specialized for letter segmentation in NLP in one particular case.

Background:

https://arxiv.org/abs/2205.11443

https://www.youtube.com/watch?v=AV_QQ7fqalw

tokenizing audio or vision...
I do, but don't want to write long explanations here. There are long explanations elsewhere. In short, they involve the automated discovery of perceptron-like things that automatically pick out commonly recurring features. The neural-net guys keep talking about "gradient descent"; I'm following a strategy of working with a landsacpe of "cliffs" of infinite gradient, and trying to correctly position those cliffs.

Those "cliffs" are called "peaks" in the work that we are citing here: https://arxiv.org/abs/2205.11443

Yes, the audio is straightforward after the text, but video needs much more careful thought...

0 replies

linas · 2022-06-30T04:30:44Z

linas
Jun 30, 2022
Maintainer Author

Hi @akolonin

crazy

It's a figure of speech, to introduce and defend a perhaps unusual or unexpected idea. Of course, as with any figure of speech, one must be careful in it's use, lest the audience misunderstand it. (I've been reading twitter for the last hour, and its amazing how many tweets are poorly worded and ambiguous.) Anyway ...

general-purpose segmentation framework, specialized for letter segmentation

FYI, LG has a "language" called any that generates random parses. I use it to get a uniform random sampling of parses. Try it, it works -- type in anything and hit enter a few times -

$ link-parser any
linkparser> 
	Linkage 5, cost vector = (UNUSED=0 DIS= 0.00 LEN=1)

             +-----ANY----+
    +---ANY--+--ANY-+     +-ANY-+
    |        |      |     |     |
LEFT-WALL this[!] is[!] a[!] test[!]

There's also a language amy which is supposed to randomly segment words into 2,3 or more different random morphemes, and then create random, uniformly-distributed parses with those. It used to work, but now appears to have broken (I get link-grammar: Error: Word 1 'th=': Internal error: NULL X_node)

The intent of amy was to create uniformly sampled random morphologies for the languages that need it: the IndoEuropean languages mostly. I was going to run French, Spanish through this, and see if my pipeline could correctly discover morphemes. I figured French, Spanish, because they would be the simplest: they mostly split into just two parts, a stem and a suffix. Russian almost works that way, but is a bit more complex. I've run out of time to pursue these ideas, as I'm bogged down in other areas.

This idea seems to break down almost completely for Hebrew, Amharic, and Semitic languages in general. I do not have any good ideas on how these could be handled. I suppose one could segment letter-by-letter, but I fear this could become computationally very difficult, and there may be traps with non-planarity (LG can handle non-planar parses through a trick of making them planar, in the same way that non-planar electrical circuits can be drawn on pieces of paper; I have not explored how difficult it would be to learn such non-planar parses. I suspect it might be hard.)

0 replies

linas · 2022-06-30T04:45:16Z

linas
Jun 30, 2022
Maintainer Author

Here's a random Chinese segmentation. I have to insert spaces between every hanzi:

Found 9192 linkages (1000 of 1000 random linkages had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=1)

                              +----ANY----+
    +--ANY--+-ANY-+-ANY-+-ANY-+     +-ANY-+
    |       |     |     |     |     |     |
LEFT-WALL 这[!] 是[!] 一[!] 个[!] 测[!] 验[!]

The amy language should be able to do this, without inserting spaces. I have not yet thought about how such parse-trees could interact with your work. I'll try to think about that tomorrow.

0 replies

akolonin · 2022-10-11T08:04:11Z

akolonin
Oct 11, 2022

Linas, Ben, Which reminds me: @akolonin <https://github.com/akolonin> and @rvignav <https://github.com/rvignav> have a whiz-band tokenizer that produces high-quality tokenization for Chinese, and they're now looking at other languages. The Chinese quality is worse than I would love it to be, but the Russian and English quality is decent. Their tokenizer is written in Java, so there's no way to just "drop it in" and have it work. But still ... the claim is that its very high-precision. It is written in Python, no Java: https://github.com/aigents/pygents/tree/main/pygents My vague day-dream is as follows: to (re-)implement that algorithm in Atomese, and thus have tokenization done there (and somehow plugging into LG) But why do such a crazy thing? Why crazy, you can think of it at general-purpose segmentation framework, specialized for letter segmentation in NLP in one particular case. Background: https://arxiv.org/abs/2205.11443 https://www.youtube.com/watch?v=AV_QQ7fqalw Cheers,

-Anton

On 30/06/2022 00:59, Linas Vepštas wrote: API It would be better to have a distinct API. There might be some non-trivial decision-making logic needed to trigger this step. I don't know what that might be, yet. So a clean separation seems best. Should it provide the sentence tokens? Yes! 2D array, or a struct with pointers would be fine, for now. Which reminds me: @akolonin <https://github.com/akolonin> and @rvignav <https://github.com/rvignav> have a whiz-band tokenizer that produces high-quality tokenization for Chinese, and they're now looking at other languages. Their tokenizer is written in Java, so there's no way to just "drop it in" and have it work. But still ... the claim is that its very high-precision. My vague day-dream is as follows: to (re-)implement that algorithm in Atomese, and thus have tokenization done there (and somehow plugging into LG) But why do such a crazy thing? The daydream is to understand the tokenization algorithm well-enough, so that an Atomese process could learn it "from scratch". But why is learning-from-scratch so important? Well, consider an audio stream: the goal is to segment the audio stream, identify features in it: rising, falling tones, other features, and convert those into streams of symbols (""alternatives") that can be passed into LG. Amazingly, you can do similar things for vision as well; I've got details written up elsewhere, in the "learn" git repo. — Reply to this email directly, view it on GitHub <#1310 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJNFKZFAT4YKL3RRJ7SR6DVRSFGTANCNFSM52DLXGSQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -Anton Kolonin telegram/skype/facebook: akolonin mobile/WhatsApp: +79139250058 ***@***.*** https://aigents.com https://www.youtube.com/aigents https://www.facebook.com/aigents https://wt.social/wt/aigents ***@***.*** ***@***.*** https://reddit.com/r/aigents https://twitter.com/aigents ***@***.*** https://vk.com/aigents https://aigents.com/en/slack.html https://www.messenger.com/t/aigents ***@***.***

0 replies

linas · 2023-01-21T04:40:36Z

linas
Jan 21, 2023
Maintainer Author

@ampli: question about the dialect support. I'm having trouble making the following work; is it supposed to work?

In the English 4.0.dict I added

foo: D- & W- & {[S+]dial};
fou: D- & W- & {[S+]};

and this loads without error. It will parse The fou did but not The foo did. I tried setting !dialect=dial:0.66 and that didn't work, and then I added

[dial]
dial: 0.33

to 4.0.dialect, and that didn't work either. Was this supposed to work, i.e. is this "some minor bug", or was it never meant to work this way, i.e. is a "major feature request"?

The idea I'm shooting for is that perhaps the dictionary could contain expressions that have portions with very high dialect costs, (e.g. greater than max-disjunct-cost) but that, if a sentence won't parse, then I can lower the dialect cost, and thus make the dictionary larger.

My mental image is of a dictionary that's like a reservoir: if its too small to yield a parse, I make the dictionary bigger by selectively lowering dialect costs. This would be done during run-time, so that there's no need to reload new expressions.

Tracing the code:

dict-file/read-dict.c sets nl->tag_type = Exptag_dialect
create_external_exp updates cost according to dialect.
create_external_exp is called by lg_exp_resolve
Nothing calls lg_exp_resolve !? Dead end.

0 replies

ampli · 2023-01-21T17:48:45Z

ampli
Jan 21, 2023
Collaborator

There is a bug in the dict reading code, as can be seen in:

linkparser> !!foo
Token "foo" matches:
    foo                               2 disjuncts


Token "foo" expressions
    foo                        D- & W- & {@S}

Note the @S. I have not looked in the code yet, but I guess I will be able to fix that.
Meanwhile, please check if
foo: D- & W- & [{S+}]dial;
does what you want.

In addition while I played with dialects, I found two unrelated bugs in link-parser (to be fixed):

If the dialect is set, then !help dialect has heap-use-after-free (it crashes w/ASAN).
!help dialect prints an incorrect default. This bug is in the handling of string-value link-parser variables, as the incorrect value happens also in !help text and !help debug.

2 replies

linas Jan 21, 2023
Maintainer Author

This is not urgent. Using dialects is still a far-off daydream.

linas Jan 21, 2023
Maintainer Author

Setting foo: D- & W- & [{S+}]dial; works. There is, however, a feature/bug: the final cost on a parse of the foo is given by the dialect cost, even though the optional S link is never used. This is a "feature", since, the expression foo: D- & W- & [{S+}]dial; is equivalent to foo: [D- & W- & {S+}]dial; and so it is working-as-designed. However, it is surprising, in that the naive user might expect no cost, because the S link wasn't used. In short, [{S+}]dial and {[S+]dial} are not equivalent, even though a naive user might think they would be.

ampli · 2023-01-21T20:05:56Z

ampli
Jan 21, 2023
Collaborator

This is not urgent. Using dialects is still a far-off daydream.

It was a one-line fix (PR soon). However, while testing it I encountered a new (and non-related) very rare parsing problem when tracon encoding is done, so I started to investigate it too... (I already found the place of the problem).

so it is working-as-designed

Indeed so.

1 reply

ampli Jan 21, 2023
Collaborator

Fixed in PR #1382.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-the-fly expressions (disjuncts) !? #1310

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

On-the-fly expressions (disjuncts) !? #1310

linas Jun 28, 2022 Maintainer

Replies: 10 comments · 4 replies

ampli Jun 29, 2022 Collaborator

linas Jun 29, 2022 Maintainer Author

ampli Jun 29, 2022 Collaborator

linas Jun 30, 2022 Maintainer Author

akolonin Jun 30, 2022

linas Jun 30, 2022 Maintainer Author

linas Jun 30, 2022 Maintainer Author

akolonin Oct 11, 2022

linas Jan 21, 2023 Maintainer Author

ampli Jan 21, 2023 Collaborator

linas Jan 21, 2023 Maintainer Author

linas Jan 21, 2023 Maintainer Author

ampli Jan 21, 2023 Collaborator

ampli Jan 21, 2023 Collaborator

linas
Jun 28, 2022
Maintainer

Replies: 10 comments 4 replies

ampli
Jun 29, 2022
Collaborator

linas Jun 29, 2022
Maintainer Author

ampli
Jun 29, 2022
Collaborator

linas
Jun 30, 2022
Maintainer Author

akolonin
Jun 30, 2022

linas
Jun 30, 2022
Maintainer Author

linas
Jun 30, 2022
Maintainer Author

akolonin
Oct 11, 2022

linas
Jan 21, 2023
Maintainer Author

ampli
Jan 21, 2023
Collaborator

linas Jan 21, 2023
Maintainer Author

linas Jan 21, 2023
Maintainer Author

ampli
Jan 21, 2023
Collaborator

ampli Jan 21, 2023
Collaborator