Hard Monotonic Transducer #165

bonham79 · 2024-02-20T17:03:07Z

(Adding on issues board for documentation, this PR will be out over the week.)

Wu and Cotterell's papers on strong alignment seem just up our alley for the library. There should be an implementation of https://aclanthology.org/P19-1148/ , particularly the monotonic cases.

Currently I have going a version that allows the following variants:

Hard alignment. Non-monotonic.
Hard alignment. Monotonic.
Hard alignment. First order monotonic.

and 3) should be good for most tasks, 1) should be available for more niche cases.

Things to add on to the PR once it's up (this makes sense to me now, will makes sense with accompanying PR).

Original implementation stores all alignment and emission probabilities across the decoding as a cache for loss calculation. This seems unnecessary and should be a running sum.
Ideally, the validation step should give the loss for the gold transcription. However, this requires taking into account decoding passes that produce predictions longer/shorter than the gold. Need to find a good way to penalize this while still making the loss intuitable.
Need to double check, but I believe the prediction pass the decoder can be offloaded to our prexisting attention module.

@kylebgorman @Adamits Any additional preferences during development? I've been dancing back and forth with adding in the Ahroni and Goldberg Transducer too, just for completeness. (This and the Swissmen's transducer both supercede that one.)

kylebgorman · 2024-02-20T17:29:08Z

My only thought is I am most excited about variant 2.

I thought the A&G thing was outmoded also, but it's harmless if you want to do it later.

Adamits · 2024-02-20T17:49:06Z

Yeah this sounds good. I have a partial implementation of 3) in a fork somewhere from a year ago that I never finished because I got distracted :D. Will be great to see them in here.

iirc 2) and 3) should be small variations on the implementation of 1), right? I.e. in the paper I think 2) is basically 1), but they just enforce the monotonicity constraint in the mask.

Original implementation stores all alignment and emission probabilities across the decoding as a cache for loss calculation. This seems unnecessary and should be a running sum.

Agree

Ideally, the validation step should give the loss for the gold transcription. However, this requires taking into account decoding passes that produce predictions longer/shorter than the gold. Need to find a good way to penalize this while still making the loss intuitable.

Not sure I follow---why is this not an issue with existing architectures? I am pretty sure I have a trick for this in eval, where I think I PAD the shorter one to be the length of the other.

EDIT: I just realized the issue (also my trick does not actually solve loss issues). We normally do teacher forcing so it is a non-issue...

Need to double check, but I believe the prediction pass the decoder can be offloaded to our prexisting attention module.

I cannot remember---this would imply that all of the constraints are strictly for training, and at inference a regular old soft attention distribution is used?

I thought the A&G thing was outmoded also, but it's harmless if you want to do it later.

Agree, though if it's low effort, I am always a fan of having more baselines available. However, I feel like the trick for this model is fairly different from what our codebase typically does, so it might be more effort to implement than it seems. On the topic of baselines, I think Wu and Cotterel also compared to an RL baseline that samples alignments and optimizes with REINFORCE. We could also add that at some point :D. It is probably also available in their library.

Both of those are very low priority, though.

bonham79 · 2024-02-20T18:05:28Z

EDIT: I just realized the issue (also my trick does not actually solve loss issues). We normally do teacher forcing so it is a non-issue...

Yeah it's kinda annoying right? I'm tempted to just repeat the last character up to target length but that's not going to be accureate.

Need to double check, but I believe the prediction pass the decoder can be offloaded to our prexisting attention module.

I cannot remember---this would imply that all of the constraints are strictly for training, and at inference a regular old soft attention distribution is used?

Probably need the constraint for inference too unless the model just learns to zero out prior attention. What I mean is, there's a bit of duplicate work between the two (technically the outputs are taking an attention over all potential alignments), but I need to sit down a moment to figure how far that can be stretched without violating some assumptions.

I thought the A&G thing was outmoded also, but it's harmless if you want to do it later.

Agree, though if it's low effort, I am always a fan of having more baselines available. However, I feel like the trick for this model is fairly different from what our codebase typically does, so it might be more effort to implement than it seems.

Yeah, it's not a major model anymore, but I think it's handy for just showing power of constraints in word level tasks. A general focus of the library seems to be how monotonic and attention assumptions improve transduction tasks. So may be worth including for posterity.

On the topic of baselines, I think Wu and Cotterel also compared to an RL baseline that samples alignments and optimizes with REINFORCE. We could also add that at some point :D. It is probably also available in their library.

My RL is weak but I believe the Edit Action Transducer employs a version of reinforce. (Or Dagger. It's Daume adjacent is what I'm saying.) So while low-priority, it may play into a general framework of student-teacher approaches to include in here (#77). It'll take a few weekends for me to parse out, but I really like the idea that any model can support a drop-in expert/policy advisor for training/exploration.

bonham79 self-assigned this Feb 20, 2024

bonham79 added the enhancement New feature or request label Feb 20, 2024

bonham79 mentioned this issue May 7, 2024

Hard monotonic #186

Merged

kylebgorman closed this as completed in #186 Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard Monotonic Transducer #165

Hard Monotonic Transducer #165

bonham79 commented Feb 20, 2024

kylebgorman commented Feb 20, 2024

Adamits commented Feb 20, 2024 •

edited

Loading

bonham79 commented Feb 20, 2024

Hard Monotonic Transducer #165

Hard Monotonic Transducer #165

Comments

bonham79 commented Feb 20, 2024

kylebgorman commented Feb 20, 2024

Adamits commented Feb 20, 2024 • edited Loading

bonham79 commented Feb 20, 2024

Adamits commented Feb 20, 2024 •

edited

Loading