Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: checkpoint on preserve CR issue #1244

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mbeckerle
Copy link
Contributor

@mbeckerle mbeckerle commented May 20, 2024

This PR is the reawakening of this issue, which a bunch of work was done on, but the PR never got completed:

See: #912

DAFFODIL-1559

checkpoint - possibly unnecessary files
@stevedlawrence
Copy link
Member

Does Daffodil ever really need the CRL -> LF canonicalization? If CR is part of an actual field and is not used as a delimiter or something, it seems we should never do this lossy mapping. Maybe we should just always disable CR -> LF canonicalization and convert CR to PUA. It makes infosets messier, but enures we don't lose data.

@mbeckerle
Copy link
Contributor Author

Does Daffodil ever really need the CRL -> LF canonicalization? If CR is part of an actual field and is not used as a delimiter or something, it seems we should never do this lossy mapping. Maybe we should just always disable CR -> LF canonicalization and convert CR to PUA. It makes infosets messier, but enures we don't lose data.

I agree. We should not, in my view, ever convert CR to LF. We should preserve it. I think we could do better than to convert it to the PUA though.

See last comment on DAFFODIL-1559. I think we could map CRLF to U+202B + LF, and isolated CR to NEL(U+2028), and back for unparsing. This would be more palatable for such data as the data would not look polluted by wierd glyphs that get assigned to the PUA characters.

No matter what we need to supply several variations here, so we need to settle on the scheme for how users parameterize the InfosetOutputters and Inputters needs to be agreed upon. The default scheme needs to be exactly what we do today. Maybe Daffodil v4.0.0 can change the default behavior for XML mappings.

@stevedlawrence
Copy link
Member

The default scheme needs to be exactly what we do today.

I thought mapping CR to LF was a regression that should be fixed? Or have we always done this mapping?

we need to settle on the scheme for how users parameterize the InfosetOutputters and Inputters needs to be agreed upon

Agreed. Interesting idea to have more complex but more human friendly mappings. It does potentially cause issues with validation since changing the mapping requires different validation pattern facets and thus a different schema.

@mbeckerle
Copy link
Contributor Author

I thought mapping CR to LF was a regression that should be fixed? Or have we always done this mapping?

Actually, I am not sure about this. We need to look at the Daffodil release before Jan 2023 and test it there.

@stevedlawrence
Copy link
Member

I thought mapping CR to LF was a regression that should be fixed? Or have we always done this mapping?

Actually, I am not sure about this. We need to look at the Daffodil release before Jan 2023 and test it there.

I tested as far back as 3.1.0 and all of them did the CR -> LF mapping. So this is not a regression. Maybe not the preferred behavior, but probably a change we wait for 4.0.0 to change.

@mbeckerle
Copy link
Contributor Author

I tested as far back as 3.1.0 and all of them did the CR -> LF mapping. So this is not a regression. Maybe not the preferred behavior, but probably a change we wait for 4.0.0 to change.

Thanks for doing that. Fixing this functionally is very easy. We will need a switch to control it, and putting in a switch/param to enable/disable/choose the behavior is the stumbling block.

I believe you suggested we do this with a new properties-file method, as opposed to extending the current tunables system to include this behavior.

If you can write that up for the dev list then it's probably pretty easy to agree on and we can implement it with this as the driving use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants