Skip to content

Conversation

chocolatkey
Copy link
Member

Work in progress. Given the following input:

<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
	<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
	<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
	<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>

	<div>
	<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
	<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
	</div>


	<section role="doc-chapter" epub:type="chapter">
		<h1>Title of the chapter</h1>
	</section>
	<ul>
		<li>First item</li>
		<li>Second item</li>
		<li>Third item</li>
	</ul>
	<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>

	<img src="image1.avif" alt="Alternative text using the alt attribute">
	<span role="img" aria-label="Rating: 4 out of 5 stars">
		<span></span>
		<span></span>
		<span></span>
		<span></span>
		<span></span>
	</span>
	<figure aria-labelledby="cat-caption"> 
		<pre>
			/\_/\
		( o.o )
				 ^ 
		</pre>
		<figcaption id="cat-caption">
		ASCII Art of a cat face
		</figcaption>
	</figure>
</body>
</html>

the following guided nav doc is generated:

{
    "guided": [
        {
            "children": [
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image: "
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "This job requires a certain "
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "savoir faire"
                            }
                        },
                        {
                            "text": " that can only be acquired over time."
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "This is a paragraph with some very-strong bold text!"
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "And the next pagebreak is in the middle of a sentence."
                                }
                            ],
                            "role": [
                                "paragraph"
                            ]
                        }
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "Title of the chapter"
                                }
                            ],
                            "role": [
                                "heading"
                            ]
                        }
                    ],
                    "role": [
                        "chapter"
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "First item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        },
                        {
                            "children": [
                                {
                                    "text": "Second item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        },
                        {
                            "children": [
                                {
                                    "text": "Third item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        }
                    ],
                    "role": [
                        "list"
                    ]
                },
                {
                    "children": [
                        {
                            "imgref": "with_image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "description": "Alternative text using the alt attribute",
                    "imgref": "image1.avif",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "Rating: 4 out of 5 stars",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "ASCII Art of a cat face",
                    "role": [
                        "figure"
                    ]
                }
            ]
        }
    ]
}

@HadrienGardeur
Copy link
Member

Looking at the results, here are a few early comments:

  • we shouldn't cut into multiple elements like we did with Content Iterator when we encounter another language, instead we should use SSML on text and indicate language changes that way
  • SSML should also handle emphasis which would cover at least <em> and <i> but probably <strong> and <b> as well
  • we seem to use too many children everywhere, for example the <h1> element should result in a single object with a role (heading), a level (it's missing right now) and a text
  • this seems to be missing support for pagebreaks, whether they're on their own or within an other element (which would require SSML)

@chocolatkey
Copy link
Member Author

chocolatkey commented Oct 20, 2025

Updated input:

<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
	<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
	<p xml:lang="fr">Paragraphe avec image #1 <img src="src/image.jpg" alt="A cool image" /> et #2 <img src="src/image.jpg" alt="A second cool image" />!</p>
	<p xml:lang="fr"><img src="src/image.jpg" alt="The coolest image" /> et <img src="src/image.jpg" alt="The boring image" /></p>
	<p>A paragraph with: <img src="src/image.jpg" alt="A cool image" /><em xml:lang="fr">est cool!</em></p>
	<p><i>Simple paragraph</i></p>
	<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
	<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>
	<p>Just<br />testing<br>some<br /> breaks! And useless <span>elements</span>...</p>

	<div>
	<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
	<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
	</div>


	<section role="doc-chapter" epub:type="chapter">
		<h1>Title of the chapter</h1>
	</section>
	<ul>
		<li>First item</li>
		<li>Second item</li>
		<li>Third item</li>
	</ul>
	<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>
	<p aria-hidden="true">More Hidden text</p>
	<p aria-hidden="true">More Hidden text</p>

	<img src="image1.avif" alt="Alternative text using the alt attribute">
	<span role="img" aria-label="Rating: 4 out of 5 stars">
		<span></span>
		<span></span>
		<span></span>
		<span></span>
		<span></span>
	</span>
	<figure aria-labelledby="cat-caption"> 
		<pre>
			/\_/\
		( o.o )
			^ 
		</pre>
		<figcaption id="cat-caption">
		ASCII Art of a cat face
		</figcaption>
	</figure>
</body>
</html>

output:

{
    "guided": [
        {
            "children": [
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image:"
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image #1"
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "et #2"
                            }
                        },
                        {
                            "description": "A second cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "!"
                            }
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "description": "The coolest image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "et"
                            }
                        },
                        {
                            "description": "The boring image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "A paragraph with:"
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "ssml": "<emphasis xml:lang=\"fr\">est cool!</emphasis>"
                            }
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis level=\"reduced\">Simple paragraph</emphasis>"
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis>This job requires a certain </emphasis><lang xml:lang=\"fr\">savoir faire</lang>  that can only be acquired over time."
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis>This is a paragraph </emphasis><emphasis>with some very-</emphasis><emphasis>strong</emphasis>  bold text!"
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "Just<break/>testing<break/>some<break/> breaks! And useless elements..."
                    }
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "role": [
                                        "paragraph"
                                    ],
                                    "text": "And the next pagebreak is in the middle of a sentence."
                                }
                            ],
                            "role": [
                                "pagebreak"
                            ]
                        }
                    ]
                },
                {
                    "children": [
                        {
                            "level": 1,
                            "role": [
                                "heading"
                            ],
                            "text": "Title of the chapter"
                        }
                    ],
                    "role": [
                        "chapter"
                    ]
                },
                {
                    "children": [
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "First item"
                        },
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "Second item"
                        },
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "Third item"
                        }
                    ],
                    "role": [
                        "list"
                    ]
                },
                {
                    "description": "Alternative text using the alt attribute",
                    "imgref": "image1.avif",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "Rating: 4 out of 5 stars",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "ASCII Art of a cat face",
                    "role": [
                        "figure"
                    ]
                }
            ]
        }
    ]
}

@chocolatkey
Copy link
Member Author

Notes:

  • The following HTML --> SSML logic now takes place: <em> and <b> are turned into <emphasis>. <i> becomes <emphasis level="reduced">. <strong> becomes <emphasis level="strong">. <br> becomes <break>. Any change in language in the document becomes <lang xml:lang="xx">. Let me know if others are needed
  • In the example for "Title of Chapter", the roles are ["section", "chapter"]. The roles in the output above are just ["chapter"]. Based on the definition of section being more generic than chapter, this seems fine to me. The reason it's only chapter is because currently, if the element has a role from ARIA, inferring of the role from the actual HTML tag is skipped.
  • @HadrienGardeur What will we do about videos? There's audio/img/text ref but no video ref
  • noteref and pagebreak are WIP, I'm evaluating the best way to query link things together in the tree, whether a homegrown search will suffice or if we need goquery

@HadrienGardeur
Copy link
Member

HadrienGardeur commented Oct 20, 2025

Looking better overall.

I still notice objects with just children in them when we don't match the HTML element to a role though: that's the case for <body> and <div> in this example.

Given the very large number of <div> or <span> in an ebook, it would be better if we could avoid this.

The examples with an image in the middle of a sentence also make me wonder if we shouldn't have an approach similar to pagebreaks and notes, where we use a custom SSML tag instead of breaking up text into multiple objects.

This would apply to <img>, <audio> and video.

If we go back to this example:

<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>

The output should look like this:

{
  "role": ["paragraph"],
  "text": {
    "language": "fr",
    "ssml": "Paragraphe avec image: <readium:image id=\"image1\" />",
    "children": [
      {
        "role": ["image"],
        "id": "image1",
        "imgref": "src/image.jpg",
        "description": "A cool image"
      }
    ]
  }
}

@HadrienGardeur
Copy link
Member

HadrienGardeur commented Oct 20, 2025

For further contextualization, I think that we should include textref in our top-level nodes at least.

For example, if we add body as a role:

{
  "role": ["body"],
  "textref": "chapter.xhtml",
  "children": []
}

To further help with an implementation optimized for search and/or highlighting, we could also go beyond that and provide this information per node with fragments such as:

  • ID (#identifier)
  • and/or CSS selectors (#css(.content:nth-child(2))

For example a paragraph with par1 as its identifier:

{
  "role": ["paragraph"],
  "textref": "chapter.xhtml#par1"
}

@HadrienGardeur
Copy link
Member

The following HTML --> SSML logic now takes place: <em> and <b> are turned into <emphasis>. <i> becomes <emphasis level="reduced">. <strong> becomes <emphasis level="strong">. <br> becomes <break>. Any change in language in the document becomes <lang xml:lang="xx">. Let me know if others are needed

@GoobyTheBOI any thoughts on this based on your own work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants