Preserve attributes on HTML paragraphs #10850

Valgard · 2025-05-17T18:09:59Z

This PR implements the preservation of attributes on HTML paragraphs, addressing issue #10768.

HTML reader now wraps attributed tags in a Div with wrapper="1".
HTML writer unwraps these Divs back to attributed tags.

This approach is similar to the Djot reader/writer as discussed in #10768, ensuring that semantic information in HTML attributes on paragraphs is preserved during conversion.

jgm

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

jgm · 2025-05-28T20:26:41Z

src/Text/Pandoc/Readers/HTML.hs

+pParaWithWrapper :: PandocMonad m => Attr -> TagParser m Blocks
+pParaWithWrapper (ident, classes, kvs) = do
+  guardEnabled Ext_native_divs -- Ensure native_divs is enabled for this behavior
+  pInhalt <- trimInlines <$> pInTags "p" inline


I usually use the naming convention of beginning parsers with p; so it would be better to use something like inhalt instead for this name.

Good point! I've renamed it to contents to follow your p prefix convention for parsers. Thanks for catching that!

jgm · 2025-05-28T20:28:10Z

src/Text/Pandoc/Readers/HTML.hs

+    let otherKVs = filter (\(k,_) -> k /= "align") kvs
+    let validAlignKV = case alignValue of
+                         Just algn | algn `elem` ["left","right","center","justify"] -> [("align", algn)]
+                         _ -> []
+    let finalKVs = wrapperAttr : (validAlignKV ++ otherKVs)


What is the motivation for treating the "align" attribute specially in this way?

See my comment below

jgm · 2025-05-28T20:29:58Z

src/Text/Pandoc/Readers/HTML.hs

+    return (case alignValue of
+              Just algn | algn `elem` ["left","right","center","justify"] ->
+                            B.divWith ("", [], [("align", algn)]) paraBlock


I don't understand the motivation for this.

See my comment below

Valgard · 2025-06-08T22:20:42Z

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

Thanks for catching that! You're right about the align logic - that was actually an idea I had initially discarded, but it seems to have somehow made its way into the pull request anyway. I'll remove that special handling.

Regarding your question about paragraph attributes in the wild - you're absolutely right to be concerned. Paragraphs with classes and other attributes are extremely common in modern HTML, especially with:

CSS frameworks (Bootstrap's text-center, lead, text-muted)
Utility-first frameworks (Tailwind's text-lg, mb-4, text-gray-600)
CMS-generated content (WordPress, Drupal automatically add classes)
JavaScript hooks (js-expandable, track-click)
Semantic styling (introduction, disclaimer, highlight)

A configurable approach would be ideal here. We could add a command-line option or extension setting that controls attribute preservation behavior:

Default mode: Strip most attributes for clean, readable Markdown
Preserve mode: Keep essential attributes (IDs for anchor links, alt text for images)
Full preservation: Maintain all attributes for round-trip conversion

This would allow users to choose between clean output (which most expect from HTML→Markdown conversion) and technical preservation (needed for specific use cases like documentation sites requiring anchor links). The majority of conversions prioritize readability over technical fidelity, so defaulting to clean output while providing flexibility makes the most sense.

Valgard · 2025-07-02T20:04:10Z

Hi @jgm,

It's been about 3 weeks since my last message and I haven't heard back yet. I wanted to follow up to see how you'd like me to proceed with this pull request.

I've already removed the special handling for the align attribute as discussed. The main question remaining is how to handle the broader issue of attribute preservation, particularly given how common classes and other attributes are on paragraphs in real-world HTML.

Should I:

Implement a configurable approach for attribute handling as we discussed?
Keep the current behavior and accept that some HTML→Markdown conversions might be cluttered?
Take a different approach entirely?

I'm happy to iterate on this or adjust the direction - just let me know what works best for the project.

Thanks!

tarleb · 2025-07-03T07:37:13Z

Great work!

My two cents on the matter: a new extension would probably be best, and the default should be to remove the "clutter".

- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.

Split pPara into pParaWithWrapper and pParaSimple helpers. Ensure pParaWithWrapper correctly discards invalid align attributes. Add specific tests for align attribute in HTML reader and writer.

- Update MANUAL.txt to reflect `native_divs` wrapping of attributed `` tags.

- Add test cases for HTML to native, native to HTML, HTML to HTML, and HTML to HTML5 conversions - Verify preservation of id, class, and data attributes on p tags

- Treat align attribute like any other attribute - Always wrap paragraphs with attributes in divs (including align-only) - Remove validation logic for align values - Update tests to reflect consistent wrapper behavior

- Replace monolithic test file with format-specific test suite (32 new test files) - Standardize paragraph attribute processing in 31 writer modules - Add paragraph_attributes extension to control attribute preservation behavior - Update shared writer utilities for consistent attribute handling - Modify HTML tests to reflect new attribute processing logic Fixes jgm#10768

Valgard · 2025-07-31T17:13:42Z

Hi @jgm,

I've completed the implementation of the paragraph_attributes extension as discussed. The solution follows the approach we agreed upon:

Implementation Summary:

Extension only available for HTML output format (-t html+paragraph_attributes)
HTML reader always creates wrapper Divs for paragraphs with attributes
HTML writer unwraps wrapper Divs to  tags when extension is enabled, strips attributes when disabled
All other writers unwrap wrapper Divs to plain paragraphs (attributes always discarded)
Native writer preserves complete AST unchanged

Key Features:

Clean default behavior: attributes are stripped without the extension
Opt-in functionality: users enable extension only when needed
No attribute filtering: all attributes preserved when extension is active
Consistent wrapper Div handling across all writers
Full test coverage with no regressions

The implementation is ready for review. All tests pass and the extension works as intended for the use cases we discussed (EPUB index processing, semantic markup preservation, etc.).

Thanks for your guidance on the architecture - the extension-based approach works really well!

jgm · 2025-08-01T19:18:31Z

HTML writer unwraps wrapper Divs to
tags when extension is enabled, strips attributes when disabled

Is there a reason not to just always unwrap these? The other writers are doing that, and I don't see what's gained by throwing away this information.

jgm · 2025-08-02T18:38:43Z

Update: in response to #11014 I added (unconditional) unwrapping of wrapper Divs to the HTML writer. In addition, thinking about this issue made me realize that paragraph_attributes is probably too narrowly focused. What about e.g. attributes on a ul element? I think a better approach would be an extension wrapper_divs that causes the HTML reader to create wrapper divs whenever an element contains attributes that would otherwise be ignored because there is no space for attributes in the corresponding AST constructor. Thoughts?

Valgard · 2025-08-02T21:45:59Z

Hi @jgm!

Thanks for the update and for adding the unconditional unwrapping to the HTML writer in response to #11014!

You're absolutely right that paragraph_attributes is too narrowly focused. The wrapper_divs extension approach is much more elegant and would handle <ul>, <ol>, <li>, <blockquote>, and other elements that face the same attribute preservation issue.

Interestingly, I've actually already implemented unconditional unwrapping of wrapper Divs across all writers in this PR (including the HTML writer 😆) - so we're aligned on that approach! The current implementation unwraps wrapper Divs to plain elements regardless of any extension settings.

Given that the unconditional unwrapping is already in place, would it make sense to complete this PR as-is and then implement the broader wrapper_divs extension as a follow-up? My thinking:

This PR demonstrates the pattern working across all writers with unconditional unwrapping
The broader scope (all block elements + potential consolidation of existing attribute extensions) might be better suited for a separate, focused PR
Incremental progress - we get immediate value for paragraph use cases while planning the comprehensive solution

The current implementation actually provides a solid foundation for wrapper_divs since the unwrapping mechanism is already proven to work consistently across all writers.

What's your preference - finish this focused PR and follow up with the broader wrapper_divs extension, or expand this PR to cover the full scope?

jgm · 2025-08-03T00:57:17Z

No, I think it's better not to introduce paragraph_attributes if it is just going to be obsoleted by wrapper_divs. Sorry to introduce this wrinkle into your PR!

Valgard force-pushed the main branch from 9cc9389 to f089a81 Compare May 17, 2025 18:13

jgm reviewed May 28, 2025

View reviewed changes

Valgard force-pushed the main branch 2 times, most recently from d9f2474 to 469a25e Compare June 8, 2025 22:20

Valgard force-pushed the main branch 6 times, most recently from cf1207c to c643637 Compare July 30, 2025 03:50

Valgard added 5 commits July 31, 2025 18:55

refactor(HTML): Improve pPara and align handling

ec0c62b

Split pPara into pParaWithWrapper and pParaSimple helpers. Ensure pParaWithWrapper correctly discards invalid align attributes. Add specific tests for align attribute in HTML reader and writer.

docs(HTML): Document native_divs behavior for attributed p tags

377dfbb

- Update MANUAL.txt to reflect `native_divs` wrapping of attributed `` tags.

test(HTML): add command tests for attributed p tags

f4b77f0

- Add test cases for HTML to native, native to HTML, HTML to HTML, and HTML to HTML5 conversions - Verify preservation of id, class, and data attributes on p tags

fix: remove special handling of align attribute in HTML paragraphs

9e65e19

- Treat align attribute like any other attribute - Always wrap paragraphs with attributes in divs (including align-only) - Remove validation logic for align values - Update tests to reflect consistent wrapper behavior

Valgard force-pushed the main branch from c643637 to 6672b70 Compare July 31, 2025 16:56

Valgard force-pushed the main branch from 6672b70 to 6a79d05 Compare July 31, 2025 16:59

Uh oh!

Preserve attributes on HTML paragraphs #10850

Are you sure you want to change the base?

Preserve attributes on HTML paragraphs #10850

Uh oh!

Conversation

Valgard commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgm left a comment

Choose a reason for hiding this comment

Uh oh!

jgm May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

jgm May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

jgm May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard commented Jun 8, 2025

Uh oh!

Valgard commented Jul 2, 2025

Uh oh!

tarleb commented Jul 3, 2025

Uh oh!

Valgard commented Jul 31, 2025

Uh oh!

jgm commented Aug 1, 2025

Uh oh!

jgm commented Aug 2, 2025

Uh oh!

Valgard commented Aug 2, 2025

Uh oh!

jgm commented Aug 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Valgard commented May 17, 2025 •

edited

Loading