-
Notifications
You must be signed in to change notification settings - Fork 42
HISTORICAL: Introduction to xml2rfc Version 3
Author: | Henrik Levkowetz [email protected] |
---|---|
Date: | 20 Feb 2019 |
Contents
This document is intended to give an overview of the changes to xml2rfc
resulting from the introduction of new renderers to support vocabulary version
3 of the RFC and Internet-Draft XML
markup language.
In 1999, Marshall Rose presented an XML
markup language for RFCs and
Internet-Drafts in RFC 2629. He also at the same time provided a tool
to convert the XML
markup to the text format used for RFC and draft publication.
In December 2016, RFC 7991 presented a revised and expanded markup vocabulary
based on the original RFC 2629 vocabulary. Given that the RFC 2629 vocabulary
had received a number of extensions over the years, the de facto vocabulary
implemented by the xml2rfc
tool in 2016 was called vocabulary 'v2
' for short,
and the vocabulary in RFC 7991 was called 'v3
'.
Recent versions of xml2rfc
, starting with version 2.7.0, have gradually
introduced features to support the v3
vocabulary. This document aims at giving
an overview of how the new features have been assembled and made available
in xml2rfc
.
The v3
XML
schema is a superset of the v2
schema. This means that
any v2
input document is a valid document according to the v3
schema.
The original v3 specification, RFC 7991, did however specify a number of
elements as being deprecated, and RFC 7998 went on to specify that the
preptool
, which is run before any v3
renderer, will limit elements
appearing in the new RFC publication format to not include any deprecated
elements.
This means that while any v2
document is valid input to the xml2rfc
tool, any deprecated XML
elements will be transformed to v3
-specific
constructs before being processed by the preptool
and the v3
renderers.
More about this under Syntax Changes in v3 XML Sources below.
This distinction should be clear when looking at the xml2rfc
processing
flow.
When speaking about v3
XML
in this remainder of this document, what
is meant is XML
that does not use any of the deprecated v2
elements.
Conceptually, the following diagram shows the processing flow through both
legacy (v2
) and new v3
processing blocks. Noteworthy is that the --v3
switch
controls the path taken to reach the output renderers; it does not restrict
the input file format.
The diagram above is a simplified presentation. For some of the details in the diagram, additional explanations are in order:
- The
--v3
switch is implied when other switches specific to thev3
processing path are given, for example--prep
,--v2v3
, and- The
v3
processing path is also implicitly chosen when the inputXML
document is declared to have version 3 by theversion="3"
attribute on the<xml>
element.- The original implementation of
xml2rfc
permitted the generation of multiple output format files with in the samexml2rfc
invocation, by specifying multiple format switches. This continues to be permitted, but the--v3
modifier switches all output from the legacy formats to thev3
formats. Output ofv2
andv3
format files in the same invocation is not supported.
The full set of available output format switches are at this writing:
Usage: xml2rfc SOURCE [OPTIONS] ... Example: xml2rfc draft.xml -o Draft-1.0 --text --html Options: -h, --help show this help message and exit Formats: Any or all of the following output formats may be specified. The default is --text. The destination filename will be based on the input filename, unless an argument is given to --basename. --text outputs to a text file with proper page breaks --html outputs to an html file --nroff outputs to an nroff file --pdf (unavailable due to missing external library) --raw outputs to a text file, unpaginated --expand outputs to an XML file with all references expanded --v2v3 convert vocabulary version 2 XML to version 3 --preptool run preptool on the input --info generate a JSON file with anchor to section lookup information
with the following modifiers:
Format Options: --v3 with --text and --html: use the v3 formatter, rather than the legacy one. --legacy with --text and --html: use the legacy text formatter, rather than the v3 one.
First of all, please note that you can use v2
XML
source files with
xml2rfc
, and still request v3
output formatters. In this case, xml2rfc
will run the v2
-to-v3
converter internally, to convert any v2
elements
in your input source to the equivalent v3
constructs. Everything should
work as expected, but as long as you are using v2
elements in your
input file, there will be a conversion step, and you will not be in full
control of the XML
which is actually sent to the renderers.
If you wish to transition to v3
source files, you can always convert
v2
sources explicitly to v3
by using the --v2v3
switch, and then
continue working with the resulting XML
file. When you do so, you need to
know which constructs are no longer acceptable, and which v3
-only constructs
to use instead.
Here are the v2
elements you should not use any more, and their
replacements:
Don't use these to generate lists:
list t # to indicate list element, as a child of <list>
Instead, use one of the 3 new list types: <ul>
[1],
<ol>
[2] and <dl>
[3]. These
map directly to the identically named HTML
elements, and are used in the same
way. For <ul>
(unordered list) and <ol>
(ordered list) use <li>
[4]
to indicate individual list elements. This:
<ul> <li>First item of an unordered list</li> <li>Second item</li> </ul>
should translate to:
- First item of an unordered list
- Second item
and similarly this:
<ol> <li>First item of an ordered list</li> <li>Second item</li> </ol>
should translate to:
- First item of an ordered list
- Second item
and finally, for definition lists:
<dl> <dt newline="true">what</dt> <dd> Definition lists associate a term with a definition. </dd> </dl>
should give:
- what
- Definition lists associate a term with a definition.
These, previously used to build tables, are deprecated:
texttable ttcol c
Instead, use nested <table>
[5]/ <tbody>
[6]/ <tr>
[7] / <td>
[8] the same
way you would in HTML
:
<table> <tbody> <tr> <td>Cell 1.1</td> <td>Cell 1.2</td> <tr> <tr> <td>Cell 2.1</td> <td>Cell 2.2</td> <tr> </tbody> </table>
in order to get:
Cell 1.1 Cell 1.2 Cell 2.1 Cell 2.2
To add table headers and footers, use <thead>
and <tfoot>
.
Vocabulary version 2 had special elements associated with text before or after figures and tables. These have been deprecated:
postamble preamble
Instead, simply add regular <t>
paragraphs before and/or after the figure
or table.
Deprecated:
spanx
Instead, specific text attribute elements have been introduced:
<em>
for emphasised text (typically rendered as slanted or italic text )<strong>
for boldly rendered text<sub>
for subscripttext<sup>
for superscripttext<tt>
for 'teletype' text (typically ageneric mono-spaced text
).
These are also deprecated:
facsimile vspace
<facsimile>
has no replacement.
<vspace>
has been replaced by an attribute newline="true"
when used
with definition lists, in order to make the definition start on a new line.
For other use cases, there is no replacement. A suggestion to support <br/>
generally in any inline context was vigorously opposed by some design team
members.
The v3
vocabulary introduces the possibility of providing multiple alternative
artwork executions. Where the artwork type is different between the alternatives,
this lets the renderer pick the best available alternative, as a function of the
output format.
The current implementation of xml2rfc
will prefer SVG
type artwork over
other alternatives when rendering HTML
and PDF
output, and will prefer
ascii-art
when rendering plain text output.
In order to specify a set of alternatives for a given artwork, you enclose all
of the alternative executions within an <artset>
element:
<artset anchor="flow-chart"> <artwork type="svg" src="flowchart.svg"/> <artwork type="ascii-art" src="flowchart.txt"/> </artset>
The XML
snippet above also illustrates a few other noteworthy features of the
<artwork>
and <artset>
elements under v3
:
The
type
attribute is necessary on<artwork>
ifxml2rfc
is to select the best execution for a given output format.The
src
attribute on<artwork>
makes it easy to work with external artwork files, as produced by various drawing tools. When run through thepreptool
, all external content will be pulled into the final prepped file, in order to have a publication file without any external dependencies.When referring to artwork from document text, any reference to a particular format out of several grouped within an
<artset>
is inappropriate, as there is no guarantee that one particular<artwork>
entry will be used in the rendering.For this reason, it is best to place any
anchor
attribute on the<artset>
element instead of on the<artwork>
elements. If there are anchors on<artwork>
elements within an<artset>
element, and no anchor on<artset>
, thepreptool
will promote the first<artwork>
anchor to the<artset>
element. Remaining anchors on enclosed<artwork>
elements will be removed.
Section naming is no longer done by setting a title
attribute on a <section>
elements. Instead use the new <name>
element, placed as the first child element
of the <section>
.
In v3
XML
it is possible to specify an alternative display tag for a
reference, using the new <displayreference>
element.
This element gives a mapping between the anchor of a reference and a name that will be displayed instead.
For example, if the reference uses the anchor "RFC6949", the following would cause that anchor in the body of displayed documents to be "RFC-DEV":
<xref target="RFC6949" /> ... <displayreference target="RFC6949" to="RFC-DEV"/> ... <reference anchor='RFC6949' target='https://www.rfc-editor.org/info/rfc6949'> <front> <title>RFC Series Format Requirements and Future Development</title> ... </reference>
In order to be able to better render for instance STD references which consist of
multiple individual RFCs, <referencegroup>
provides a way to group references
under one reference anchor:
<referencegroup anchor="STD78" target="https://www.rfc-editor.org/info/std78" > <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5343.xml"/> <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5590.xml"/> <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5591.xml"/> <xi:include href="https://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6353.xml"/> </referencegroup>
The following describes the v3
Unicode handling as implemented, with
the modifications and additions described in
draft-levkowetz-xml2rfc-v3-implementation-notes. For the <u>
element
in particular, this goes beyond what is described in RFC 7991.
In v3
, the elements <author>
, <organization>
, <street>
,
<city>
, <region>
, <code>
, <country>
, <postalLine>
,
<email>
, and <seriesInfo>
may contain non-ascii characters for the
purpose of rendering author names, addresses, and reference titles correctly.
They also have an additional "ascii" attribute for the purpose of proper
rendering in ascii-only media.
In order to insert Unicode characters in any other context, v3
requires
that the Unicode string be enclosed within an <u>
element. The element
will be expanded inline based on the value of its format
attribute. This
provides a generalised means of generating the 6 methods of Unicode renderings
listed in [RFC7997], Section 3.4, and also several others found in for
instance the RFC Format Tools example rendering of RFC 7700, at https://rfc-format.github.io/draft-iab-rfc-css-bis/sample2-v2.html.
The format
attribute accepts either a simplified format
specification, or a full format string with placeholders for the
various possible Unicode expansions.
The simplified format consists of dash-separated keywords, where each
keyword represents a possible expansion of the Unicode character or
string; use for example <u "lit-num-name">foo</u>
to expand the
text to its literal value, code point values, and code point names.
A combination of up to 3 of the following keywords may be used,
separated by dashes: num
, lit
, name
, ascii
, char
. The
keywords are expanded as follows and combined, with the second and
third enclosed in parentheses (if present):
"num"
- The numeric value(s) of the element text, in U+1234 notation
"name"
- The Unicode name(s) of the element text
"lit"
- The literal element text, enclosed in quotes
"char"
- The literal element text, without quotes
"ascii"
- The provided ASCII value
In order to ensure that no specification mistakes can result for
rendering methods that cannot render all Unicode code points, "num"
must always be part of the specified format.
The default value of the format
attribute is "lit-name-num"
.
Examples:
format="num-lit": Temperature changes in the Temperature Control Protocol are indicated by the character U+0394 ("Δ"). format="num-name": Temperature changes in the Temperature Control Protocol are indicated by the character U+0394 (GREEK CAPITAL LETTER DELTA). format="num-lit-name": Temperature changes in the Temperature Control Protocol are indicated by the character U+0394 ("Δ"). format="num-name-lit": Temperature changes in the Temperature Control Protocol are indicated by the character U+0394 (GREEK CAPITAL LETTER DELTA, "Δ"). format="name-lit-num": Temperature changes in the Temperature Control Protocol are indicated by the character GREEK CAPITAL LETTER DELTA ("Δ", U+0394). format="lit-name-num": Temperature changes in the Temperature Control Protocol are indicated by the character "Δ" (GREEK CAPITAL LETTER DELTA, U+0394).
If the <u>
element encloses a Unicode string, rather than a single
code point, the rendering reflects this. The element
<u format="num-lit">ᏚᎢᎵᎬᎢᎬᏒ</u>
will be expanded to U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2
("ᏚᎢᎵᎬᎢᎬᏒ")
.
Unicode characters in document text which are not enclosed in <u>
will be replaced with a question mark (?) and a warning will be
issued.
In order to provide for cases where the simplified format above is
insufficient, without relinquishing the requirement that the number
of a code point always must be rendered, the format
attribute can
also accept a full format string. This format uses placeholders
which consist of any of the key words above enclosed in curly braces;
outside of this, any ascii text is permissible. For example,
The <u format="{lit} character ({num})">Δ</u>.
will be rendered as
The "Δ" character (U+0394).
The code in various places give special consideration to the code points with these alternative names, defined in the rfc2629-xhtml.ent file which is part of the distribution:
<!ENTITY nbsp " "><!-- U+00A0 NO-BREAK SPACE --> <!ENTITY zwsp "​"><!-- U+200B ZERO WIDTH SPACE --> <!ENTITY nbhy "‑"><!-- U+2011 NON BREAKING HYPHEN --> <!ENTITY br "
"><!-- U+2028 LINE SEPARATOR --> <!ENTITY wj "⁠"><!-- U+2060 WORD JOINER -->
If any of these entity references are used in an input file, they are converted to unicode code points during parsing, for later consideration by the various formatters. Some of these (like U+2028) are always consumed by the formatter and never visible in the end result. Others are permitted to emerge in HTML output, but not in other formats.
Now, given that the entity references mentioned above are converted to code points on parsing, they won't be visible as entity references after v2v3 conversion. The RPC has found this a bit problematic, as their editor only shows a placeholder square for all of them. Even if the v2 input received by the RPC from an author contains ' ' they would not see ' ' after v2v3 conversion. For this reason, there is a step to convert these 5 code points back to the entity references listed above before writing out a v2v3 conversion result to file.
[1] | RFC 7991, Section 2.63: ul |
[2] | RFC 7991, Section 2.34: ol |
[3] | RFC 7991, Section 2.20: dl |
[4] | RFC 7991, Section 2.29: li |
[5] | RFC 7991, Section 2.54: table |
[6] | RFC 7991, Section 2.55: tbody |
[7] | RFC 7991, Section 2.61: tr |
[8] | RFC 7991, Section 2.56: td |