-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Preserve attributes on HTML paragraphs #10850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. As noted I didn't understand the special treatment of "align".
Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.
src/Text/Pandoc/Readers/HTML.hs
Outdated
pParaWithWrapper :: PandocMonad m => Attr -> TagParser m Blocks | ||
pParaWithWrapper (ident, classes, kvs) = do | ||
guardEnabled Ext_native_divs -- Ensure native_divs is enabled for this behavior | ||
pInhalt <- trimInlines <$> pInTags "p" inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually use the naming convention of beginning parsers with p
; so it would be better to use something like inhalt
instead for this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've renamed it to contents
to follow your p
prefix convention for parsers. Thanks for catching that!
src/Text/Pandoc/Readers/HTML.hs
Outdated
let otherKVs = filter (\(k,_) -> k /= "align") kvs | ||
let validAlignKV = case alignValue of | ||
Just algn | algn `elem` ["left","right","center","justify"] -> [("align", algn)] | ||
_ -> [] | ||
let finalKVs = wrapperAttr : (validAlignKV ++ otherKVs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for treating the "align" attribute specially in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below
src/Text/Pandoc/Readers/HTML.hs
Outdated
return (case alignValue of | ||
Just algn | algn `elem` ["left","right","center","justify"] -> | ||
B.divWith ("", [], [("align", algn)]) paraBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the motivation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below
d9f2474
to
469a25e
Compare
Thanks for catching that! You're right about the align logic - that was actually an idea I had initially discarded, but it seems to have somehow made its way into the pull request anyway. I'll remove that special handling. Regarding your question about paragraph attributes in the wild - you're absolutely right to be concerned. Paragraphs with classes and other attributes are extremely common in modern HTML, especially with:
A configurable approach would be ideal here. We could add a command-line option or extension setting that controls attribute preservation behavior:
This would allow users to choose between clean output (which most expect from HTML→Markdown conversion) and technical preservation (needed for specific use cases like documentation sites requiring anchor links). The majority of conversions prioritize readability over technical fidelity, so defaulting to clean output while providing flexibility makes the most sense. |
Hi @jgm, It's been about 3 weeks since my last message and I haven't heard back yet. I wanted to follow up to see how you'd like me to proceed with this pull request. I've already removed the special handling for the align attribute as discussed. The main question remaining is how to handle the broader issue of attribute preservation, particularly given how common classes and other attributes are on paragraphs in real-world HTML. Should I:
I'm happy to iterate on this or adjust the direction - just let me know what works best for the project. Thanks! |
Great work! My two cents on the matter: a new extension would probably be best, and the default should be to remove the "clutter". |
cf1207c
to
c643637
Compare
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.
Split pPara into pParaWithWrapper and pParaSimple helpers. Ensure pParaWithWrapper correctly discards invalid align attributes. Add specific tests for align attribute in HTML reader and writer.
- Update MANUAL.txt to reflect `native_divs` wrapping of attributed `<p>` tags.
- Add test cases for HTML to native, native to HTML, HTML to HTML, and HTML to HTML5 conversions - Verify preservation of id, class, and data attributes on p tags
- Treat align attribute like any other attribute - Always wrap paragraphs with attributes in divs (including align-only) - Remove validation logic for align values - Update tests to reflect consistent wrapper behavior
- Replace monolithic test file with format-specific test suite (32 new test files) - Standardize paragraph attribute processing in 31 writer modules - Add paragraph_attributes extension to control attribute preservation behavior - Update shared writer utilities for consistent attribute handling - Modify HTML tests to reflect new attribute processing logic Fixes jgm#10768
Hi @jgm, I've completed the implementation of the Implementation Summary:
Key Features:
The implementation is ready for review. All tests pass and the extension works as intended for the use cases we discussed (EPUB index processing, semantic markup preservation, etc.). Thanks for your guidance on the architecture - the extension-based approach works really well! |
Is there a reason not to just always unwrap these? The other writers are doing that, and I don't see what's gained by throwing away this information. |
Update: in response to #11014 I added (unconditional) unwrapping of wrapper Divs to the HTML writer. In addition, thinking about this issue made me realize that |
Hi @jgm! Thanks for the update and for adding the unconditional unwrapping to the HTML writer in response to #11014! You're absolutely right that Interestingly, I've actually already implemented unconditional unwrapping of wrapper Divs across all writers in this PR (including the HTML writer 😆) - so we're aligned on that approach! The current implementation unwraps wrapper Divs to plain elements regardless of any extension settings. Given that the unconditional unwrapping is already in place, would it make sense to complete this PR as-is and then implement the broader
The current implementation actually provides a solid foundation for What's your preference - finish this focused PR and follow up with the broader |
No, I think it's better not to introduce |
This PR implements the preservation of attributes on HTML paragraphs, addressing issue #10768.
HTML reader now wraps attributed <p> tags in a Div with wrapper="1".
HTML writer unwraps these Divs back to attributed <p> tags.
This approach is similar to the Djot reader/writer as discussed in #10768, ensuring that semantic information in HTML attributes on paragraphs is preserved during conversion.