-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: better <text> elements #56
Comments
This seems reasonable too. Unfortunately, it's also not that easy to implement because it would require a lot of changes to the current code base. Maybe it's easier to post-process the SVG files with an XSLT stylesheet in order to derive the desired format. I'll have a look. |
I probably won't implement this feature as part of dvisvgm in the near future. However, it doesn't seem to be too complicated create the desired output with an XSLT script. Here is a quick first attempt. It takes the output of |
I took a look at this issue, and I believe the first issue is impossible to solve with For the second issue, @mgieseki do you think it is possible to add that feature ? One issue is that it's possible for a single word to be broken up into multiple |
It's indeed not easy to detect words and word boundaries from plain DVI data since spaces are realized by explicit movements of the virtual cursor which determines the position of the next character (or any other visual object) to be placed. Horizontal movements also occur in case of kerning, stretched letter spacing, inside math formulae, etc. There are some ways to guess whether a horizontal movement denotes a space or something else, e.g. based on the space-related TFM data of a font, but it's not completely reliable.
The search issue is not caused by spreading the characters over several |
The elements generated by dvisvgm are "suboptimal". Consider the following input:
\begin{document}
Hallo Welt! Dies ist ein längerer Text.
\end{document}
The typical output is:
<text class='f0' x='67.746' y='63.7609'>Hallo<tspan x='93.9505'>W</tspan>
<tspan x='103.321'>elt!</tspan>
<tspan x='121.535'>Dies</tspan>
<tspan x='143.521'>ist</tspan>
<tspan x='157.374'>ein</tspan>
<tspan x='173.377'>l㑿</tspan>
<tspan x='176.116'>angerer</tspan>
<tspan x='211.474'>T</tspan>
<tspan x='217.812'>ext.</tspan>
There are two problems with this:
It would be ``more than helpful'' if spaces were inserted into the output, for instance following a heuristic that if horizontal advance between letters is above a certain threshold, a space is added.
This is a real problem: The pgfmanual needs about 10 MB as a PDF and about 600 MB (!) as a sequence of SVGs. Admittedly, a lot of this is due to the embedded fonts (addressed in a different feature request), but we are talking about at least 100 MB caused just by tspan's...
I propose the following change (knowing that it is not trivial, but it should be doable):
For each line, use a single tspan (when there is a font change, use a sub-tspan for these, when there is a special or a rect, stop the current and restart afterwards) and use the dx attribute to set the spacing and kering for each letter:
<text class='f0' x='67.746' y='63.7609'>
<tspan dx='0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 ... -1.5'>
Hallo Welt! Dies ist ein l㑿angerer Text.</tspan>
<tspan x='67.746' dy='12' dx='0 0 0 0 0 -1 ...'>
Eine zweite Zeile.</tspan>
</text>
The semantics of dx is that you specify an offset for each letter. Naturally, we will have lot's of 0's followed by spaces, but it is still more compact than a tspan for every three to four letters (and more easily compressible). Also, the text stays uninterrupted in the XML, which is useful for searching an processing purposes.
The text was updated successfully, but these errors were encountered: