Skip to content

Conversation

Valgard
Copy link

@Valgard Valgard commented May 17, 2025

This PR implements the preservation of attributes on HTML paragraphs, addressing issue #10768.

HTML reader now wraps attributed <p> tags in a Div with wrapper="1".
HTML writer unwraps these Divs back to attributed <p> tags.

This approach is similar to the Djot reader/writer as discussed in #10768, ensuring that semantic information in HTML attributes on paragraphs is preserved during conversion.

Copy link
Owner

@jgm jgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

pParaWithWrapper :: PandocMonad m => Attr -> TagParser m Blocks
pParaWithWrapper (ident, classes, kvs) = do
guardEnabled Ext_native_divs -- Ensure native_divs is enabled for this behavior
pInhalt <- trimInlines <$> pInTags "p" inline
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually use the naming convention of beginning parsers with p; so it would be better to use something like inhalt instead for this name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've renamed it to contents to follow your p prefix convention for parsers. Thanks for catching that!

Comment on lines 636 to 640
let otherKVs = filter (\(k,_) -> k /= "align") kvs
let validAlignKV = case alignValue of
Just algn | algn `elem` ["left","right","center","justify"] -> [("align", algn)]
_ -> []
let finalKVs = wrapperAttr : (validAlignKV ++ otherKVs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for treating the "align" attribute specially in this way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below

Comment on lines 652 to 654
return (case alignValue of
Just algn | algn `elem` ["left","right","center","justify"] ->
B.divWith ("", [], [("align", algn)]) paraBlock
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the motivation for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below

@Valgard Valgard force-pushed the main branch 2 times, most recently from d9f2474 to 469a25e Compare June 8, 2025 22:20
@Valgard
Copy link
Author

Valgard commented Jun 8, 2025

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

Thanks for catching that! You're right about the align logic - that was actually an idea I had initially discarded, but it seems to have somehow made its way into the pull request anyway. I'll remove that special handling.

Regarding your question about paragraph attributes in the wild - you're absolutely right to be concerned. Paragraphs with classes and other attributes are extremely common in modern HTML, especially with:

  • CSS frameworks (Bootstrap's text-center, lead, text-muted)
  • Utility-first frameworks (Tailwind's text-lg, mb-4, text-gray-600)
  • CMS-generated content (WordPress, Drupal automatically add classes)
  • JavaScript hooks (js-expandable, track-click)
  • Semantic styling (introduction, disclaimer, highlight)

A configurable approach would be ideal here. We could add a command-line option or extension setting that controls attribute preservation behavior:

  • Default mode: Strip most attributes for clean, readable Markdown
  • Preserve mode: Keep essential attributes (IDs for anchor links, alt text for images)
  • Full preservation: Maintain all attributes for round-trip conversion

This would allow users to choose between clean output (which most expect from HTML→Markdown conversion) and technical preservation (needed for specific use cases like documentation sites requiring anchor links). The majority of conversions prioritize readability over technical fidelity, so defaulting to clean output while providing flexibility makes the most sense.

@Valgard
Copy link
Author

Valgard commented Jul 2, 2025

Hi @jgm,

It's been about 3 weeks since my last message and I haven't heard back yet. I wanted to follow up to see how you'd like me to proceed with this pull request.

I've already removed the special handling for the align attribute as discussed. The main question remaining is how to handle the broader issue of attribute preservation, particularly given how common classes and other attributes are on paragraphs in real-world HTML.

Should I:

  • Implement a configurable approach for attribute handling as we discussed?
  • Keep the current behavior and accept that some HTML→Markdown conversions might be cluttered?
  • Take a different approach entirely?

I'm happy to iterate on this or adjust the direction - just let me know what works best for the project.

Thanks!

@tarleb
Copy link
Collaborator

tarleb commented Jul 3, 2025

Great work!

My two cents on the matter: a new extension would probably be best, and the default should be to remove the "clutter".

@Valgard Valgard force-pushed the main branch 6 times, most recently from cf1207c to c643637 Compare July 30, 2025 03:50
Valgard added 5 commits July 31, 2025 18:55
- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`.
- HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag.
- Add tests for HTML paragraph attribute roundtrip.
- Update EPUB golden files to reflect new AST for attributed paragraphs.
Split pPara into pParaWithWrapper and pParaSimple helpers.
Ensure pParaWithWrapper correctly discards invalid align attributes.
Add specific tests for align attribute in HTML reader and writer.
- Update MANUAL.txt to reflect `native_divs` wrapping of
  attributed `<p>` tags.
- Add test cases for HTML to native, native to HTML, HTML to HTML,
  and HTML to HTML5 conversions
- Verify preservation of id, class, and data attributes on p tags
- Treat align attribute like any other attribute
- Always wrap paragraphs with attributes in divs (including align-only)
- Remove validation logic for align values
- Update tests to reflect consistent wrapper behavior
- Replace monolithic test file with format-specific test suite
  (32 new test files)
- Standardize paragraph attribute processing in 31 writer
  modules
- Add paragraph_attributes extension to control attribute
  preservation behavior
- Update shared writer utilities for consistent attribute
  handling
- Modify HTML tests to reflect new attribute processing
  logic

Fixes jgm#10768
@Valgard
Copy link
Author

Valgard commented Jul 31, 2025

Hi @jgm,

I've completed the implementation of the paragraph_attributes extension as discussed. The solution follows the approach we agreed upon:

Implementation Summary:

  • Extension only available for HTML output format (-t html+paragraph_attributes)
  • HTML reader always creates wrapper Divs for paragraphs with attributes
  • HTML writer unwraps wrapper Divs to <p> tags when extension is enabled, strips attributes when disabled
  • All other writers unwrap wrapper Divs to plain paragraphs (attributes always discarded)
  • Native writer preserves complete AST unchanged

Key Features:

  • Clean default behavior: attributes are stripped without the extension
  • Opt-in functionality: users enable extension only when needed
  • No attribute filtering: all attributes preserved when extension is active
  • Consistent wrapper Div handling across all writers
  • Full test coverage with no regressions

The implementation is ready for review. All tests pass and the extension works as intended for the use cases we discussed (EPUB index processing, semantic markup preservation, etc.).

Thanks for your guidance on the architecture - the extension-based approach works really well!

@jgm
Copy link
Owner

jgm commented Aug 1, 2025

  • HTML writer unwraps wrapper Divs to

    tags when extension is enabled, strips attributes when disabled

Is there a reason not to just always unwrap these? The other writers are doing that, and I don't see what's gained by throwing away this information.

@jgm
Copy link
Owner

jgm commented Aug 2, 2025

Update: in response to #11014 I added (unconditional) unwrapping of wrapper Divs to the HTML writer. In addition, thinking about this issue made me realize that paragraph_attributes is probably too narrowly focused. What about e.g. attributes on a ul element? I think a better approach would be an extension wrapper_divs that causes the HTML reader to create wrapper divs whenever an element contains attributes that would otherwise be ignored because there is no space for attributes in the corresponding AST constructor. Thoughts?

@Valgard
Copy link
Author

Valgard commented Aug 2, 2025

Hi @jgm!

Thanks for the update and for adding the unconditional unwrapping to the HTML writer in response to #11014!

You're absolutely right that paragraph_attributes is too narrowly focused. The wrapper_divs extension approach is much more elegant and would handle <ul>, <ol>, <li>, <blockquote>, and other elements that face the same attribute preservation issue.

Interestingly, I've actually already implemented unconditional unwrapping of wrapper Divs across all writers in this PR (including the HTML writer 😆) - so we're aligned on that approach! The current implementation unwraps wrapper Divs to plain elements regardless of any extension settings.

Given that the unconditional unwrapping is already in place, would it make sense to complete this PR as-is and then implement the broader wrapper_divs extension as a follow-up? My thinking:

  1. This PR demonstrates the pattern working across all writers with unconditional unwrapping
  2. The broader scope (all block elements + potential consolidation of existing attribute extensions) might be better suited for a separate, focused PR
  3. Incremental progress - we get immediate value for paragraph use cases while planning the comprehensive solution

The current implementation actually provides a solid foundation for wrapper_divs since the unwrapping mechanism is already proven to work consistently across all writers.

What's your preference - finish this focused PR and follow up with the broader wrapper_divs extension, or expand this PR to cover the full scope?

@jgm
Copy link
Owner

jgm commented Aug 3, 2025

No, I think it's better not to introduce paragraph_attributes if it is just going to be obsoleted by wrapper_divs. Sorry to introduce this wrinkle into your PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants