Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: new YaDeserialize interface #84

Open
scottlamb opened this issue Jul 4, 2020 · 3 comments
Open

RFC: new YaDeserialize interface #84

scottlamb opened this issue Jul 4, 2020 · 3 comments

Comments

@scottlamb
Copy link
Contributor

Rough proposal for a new YaDeserialize interface below. This started from my more modest proposal in #76. I found some other issues as I dug in more, and I was encouraged by @MarcAntoine-Arnaud 's "I'm quite open to big changes in this library". I've tried to start from first principles. I'd be very happy to get feedback and revise.

My goal is to support ONVIF well, and I'm trying to use @DmitrySamoylov 's library for that purpose, so the examples here will likely be quite familiar to him.


We propose a new YaDeserialize interface with several benefits over the present one:

  • Requires less generated and monomorphized code, shrinking binaries and speeding compilation.
  • Reduces possible YaDeserialize implementation errors.
  • Efficiently supports flattened fields and SOAP bodies.
  • Allows returning richer errors from xml-rs and YaDeserialize implementations.

Proposed interface

pub fn from_reader<R, T>(source: R) -> Result<T, Error>
where R: ::std::io::Read, T: YaDeserialize + Default { ... }

pub fn from_str<R, T>(source: &str) -> Result<T, Error>
where T: YaDeserialize + Default { ... }

pub fn read<R, T>(source: R, to: &mut T) -> Result<(), Error>
where R: ::std::io::Read, T: YaDeserialize { ... }

pub fn read_str<R, T>(source: &str, to: &mut T) -> Result<(), Error>
where T: YaDeserialize { ... }

pub struct ElementReader<'a> { ... }

/// Reader for a particular element and its subtree.
/// May be dropped to ignore the element.
impl<'a> ElementReader<'a> {
  pub fn name(&self) -> &::xml::OwnedName { ... }

  pub fn namespace(&self) -> &::xml::Namespace { ... }

  pub fn depth(&self) -> usize { ... }

  /// Reads a node which is expected to only contain text.
  ///
  /// Attributes and embedded comments are ignored.
  ///
  /// Returns error if there are any child elements.
  pub fn read_str(self) -> Result<String, Error> { ... }

  /// Passes responsibility for this element (and thus its subtree) to the
  /// given `YaDeserialize`. 
  pub fn read_to(self, s: &mut YaDeserialize) -> Result<(), Error> { ... }
}

/// Deserializes an XML element and its subtree.
pub trait YaDeserialize {
  /// Processes the start of this element.
  ///
  /// `reader` refers to the element itself and may be used to examine the
  /// element's name, namespace mappings, and depth.
  ///
  /// It's recommended that implementers validate the name and namespace if
  /// and only if this element is at depth 0. Non-root elements should perform
  /// no such validation, as the parent is responsible for determining what
  /// child elements map to what types.
  ///
  /// Attributes are supplied via `attribute` calls immediately after `start`.
  fn start(&mut self, cur: &ElementReader) -> Result<(), Error> {
    Ok(())
  }

  /// Processes a given attribute of this element's start tag.
  ///
  /// Implementations which can be used as flattened fields may consume
  /// attributes by returning `Ok(None)`, or "pass" them to later
  /// flattened fields by returning `Ok(Some(attr))`. If this `YaDeserialize`
  /// is not being used as a flattened field, this distinction is irrelevant.
  fn attribute(&mut self, cur: &ElementReader, attr: ::xml::OwnedAttribute)
               -> Result<Option<::xml::OwnedAttribute>, Error> {
    Ok(Some(attr))
  }

  /// Processes a child element.
  ///
  /// Implementations which can be used as flattened fields may consume
  /// elements by returning `Ok(None)`, or "pass" them to later
  /// flattened fields by returning `Ok(Some(child))`. If this `YaDeserialize`
  /// is not being used as a flattened field, this distinction is irrelevant.
  fn element(&mut self, cur: &ElementReader, child: ElementReader)
             -> Result<Option<ElementReader>, Error> {
    Ok(Some(child))
  }

  /// Processes character data found directly within this element.
  ///
  /// Implementations which can be used as flattened fields may consume
  /// characters by returning `Ok(None)`, or "pass" them to later
  /// flattened fields by returning `Ok(Some(s))`. If this `YaDeserialize`
  /// is not being used as a flattened field, this distinction is irrelevant.
  fn characters(&mut self, cur: &ElementReader, s: String) -> Result<Option<String>, Error> {
    Ok(())
  }

  /// Processes the end of this element.
  ///
  /// Note that ownership of `name` and `namespace` is passed here. Thus an
  /// implementation which stores these may choose to do so within `end`
  /// to avoid cloning.
  ///
  /// This is the appropriate place for validating that an element is
  /// complete.
  fn end(&mut self, name: ::xml::OwnedName, namespace: ::xml::Namespace) -> Result<(), Error> {
    Ok(())
  }
}

/// Returns error on start if already `Some`.
impl YaDeserialize for Option<T> where T: YaDeserialize + Default { ... }

impl YaDeserialize for Vec<T> where T: YaDeserialize + Default { ... }

impl<'a> YaDeserialize for &'a mut YaDeserialize { ... }

// This is a useful representation for `any` and unknown fields.
#[cfg(feature = "xmltree")]
impl YaDeserialize for ::xmltree::Element { ... }

// TODO: Error types details.

Example generated code

Given the input below:

#[derive(YaDeserialize, PartialEq, Debug, Default)]
#[yaserde(root = "base")]
pub struct XmlStruct {
  #[yaserde(attribute)]
  item: String,
  sub: SubStruct,
}

#[derive(YaDeserialize, PartialEq, Debug)]
#[yaserde(root = "sub")]
pub struct SubStruct {
  #[yaserde(attribute)]
  subitem: String,

  #[yaserde(flatten)]
  extra: xmltree::Element,
}

yaserde might generate the following YaDeserialize implementations:

impl YaDeserialize for XmlStruct {
  fn attribute(&mut self, cur: &ElementReader, attr: OwnedAttribute)
               -> Result<Option<OwnedAttribute>, Error> {
    match (&self.name.namespace, &self.name.local_name) {
      (None, "item") => self.item = value,
      _ => return Ok(Some(attr)),
    };
    Ok(None)
  }

  fn element(&mut self, cur: &ElementReader, child: ElementReader)
             -> Result<Option<ElementReader>, Error> {
    let name = child.name();
    match (&name.namespace, &name.local_name) {
      (None, "sub") => reader.read_to(&mut sub)?,
      _ => return Ok(Some(child)),
    };
    Ok(None)
  }
}

impl YaDeserialize for SubStruct {
  fn start(&mut self, cur: &ElementReader) -> Result<(), Error> {
    self.extra.start(cur)
  }

  fn attribute(&mut self, cur: &ElementReader, attr: OwnedAttribute)
               -> Result<Option<OwnedAttribute>, Error> {
    match (&attr.name.local_name, &attr.name.namespace) {
      ("subitem", None) => { self.subitem = attr.value; Ok(None) },
      _ => self.extra.attribute(name, value),
    }
  }

  fn element(&mut self, cur: &ElementReader, child: ElementReader)
             -> Result<Option<ElementReader>, Error> {
    let name = child.name();
    match (&name.local_name, &name.namespace) {
      ("sub", None) => reader.read_to(&mut sub)?,
      _ => return self.extra.element(reader),
    };
    Ok(None)
  }

  fn characters(&mut self, cur: &ElementReader, s: String)
                -> Result<Option<String>, Error> {
    self.extra.characters(s)
  }

  fn end(&mut self, name: ::xml::OwnedName, namespace: ::xml::Namespace) -> Result<(), Error> {
    self.extra.end(name, namespace)
  }
}

Design points

Compile time and code bloat

The onvif-rs/schema crate is a heavy user of yaserde derive macros (containing 1,968 #[derive] lines mentioning YaDeserialize or YaSerialize). By using cargo -Ztimings, we can see the schema crate's contribution to compilation times.

Compilation times, measured with cargo +nightly-2020-06-23 -Ztimings build at revision 4a654a4:

machine --debug --release
2016 Macbook Pro 120s 246s
4 GiB Raspberry Pi 4, no zram failure failure
4 GiB Raspberry Pi 4, zram 500s 780s

We can also examine its contribution to the final binary's size. cargo +nightly-2020-06-23 bloat --release --example camera --crates attributes 940KiB to this crate.

LTO makes surprisingly little difference given that many of the schema types should be unused by the example binary. With lto = "fat", the same cargo bloat command attributes 855KiB to the schema crate.

These numbers suggest it should be a priority to reduce the lines of generated code, improving both compile time and binary size.

TODO: once implemented, verify the new interface actually improves this significantly...currently this is a leap of faith...

Structure flattening

yaserde supports "flattening" structures and enums together via the #[yaserde(flatten)] field attribute. This support is similar to serde's struct flatten. The structopt crate has a similar flattening feature.

The xsd-parser-rs crate uses yaserde's flatten attribute to support placing an enum that represents a <xs:choice> within a struct that represents a <xs:sequence>. This allows implementing the struct and enum with #[derive(YaDeserialize)]. Without this feature, the struct would likely need a custom YaDeserialize implementation. Thus it's important to keep flatten support.

The current implementation is to serialize a new XML document into a Vec<u8>. This document contains all the "unused" elements in the current struct. The Vec<u8> is then parsed and used to deserialize each flattened field. This method contributes to code bloat, as well as being inefficient at runtime.

We propose instead that the monolithic YaDeserialize::deserialize be broken apart. With one call per attribute or element, the parent struct and all flattened fields can be deserialized simultaneously without buffering. Each could be given the option to consume an attribute or element or pass the buck to the next.

TODO: do we really need to support multiple flattened fields in a struct, or is one sufficient? The latter would allow us to simplify the interface by getting rid of the Option returns.

<xsd:any> support

Some XML schemas rely on <xsd:any> to function. For example, an ONVIF analytics configuration might contain the following snippet:

<tt:AnalyticsEngineConfiguration>
  <tt:AnalyticsModule Name="MyCellMotion" Type="tt:CellMotionEngine">
    <tt:Parameters>
      <tt:SimpleItem Name="Sensitivity" Value="50"/>
      <tt:ElementItem Name="Layout">
        <tt:CellLayout Columns="22" Rows="18">
          <tt:Transformation>
            <tt:Translate x="-1.000000" y="-1.000000"/>
            <tt:Scale x="0.006250" y="0.008340"/>
          </tt:Transformation>
        </tt:CellLayout>
      </tt:ElementItem>
    </tt:Parameters>
  </tt:AnalyticsModule>
</tt:AnalyticsEngineConfiguration>

The ElementItem is generic. It doesn't refer to the CellLayout type.

This is a poor schema design, and any generated code from it will be subpar. Validation could be much stronger, and automatically generated code much better, if onvif.xsd instead contained a specific element and type for a CellMotionEngine that mentioned the type of each element, rather than using a generic AnanlyticsModule and expressing the type information only in a table in a PDF document. (In this case, ONVIF Analytics Service Specification Table B-3.)

However, XML libraries aren't being written in 2020 to support new schemas with perfect designs. They're for interoperating with legacy schemas regardless of their flaws.

yaserde users are encouraged to hand-roll YaDeserialize implementations for these cases. In this example, callers would benefit from an enum AnalyticsModule written by a human who has examined the PDF specification.

Nonetheless, it's desirable to support automatic parsing of such schemas into usable, if untyped and unergonomic, representations. This would have two benefits:

  • Deserializing and serializing the document would not lose information. In this example, a caller might want to perform a GetAnalyticsConfiguration request, make some modification to the config, and write it back with a SetAnalyticsConfiguration without damaging other portions of the configuration.
  • Debug output would show the missing information, aiding development.

It's also common for schemas to support both an explicit list of elements and the xs:any type for extensibility (both forward compatibility and third-party extensions). In this case, it'd be most beneficial to support a flattened any representation.

We propose that yaserde provide (behind a feature flag) serialization into the xmltree::Element type. While this type has some limitations (eg eminence/xmltree-rs#13), it's a reasonable way to represent a generic XML document fragment and can be improved as necessary.

This is analogous to handling of unknown fields in other serialization formats, eg. protobuf.

xs:QName

Some schemas require interpreting a QName given in attribute values or character data. For example, an ONVIF metadata stream might begin as follows:

<tt:MetaDataStream xmlns:tns1="http://www.onvif.org/ver10/topics"
                   xmlns:tt="http://www.onvif.org/ver10/schema"
                   xmlns:wsnt="http://docs.oasis-open.org/wsn/b-2">
  <tt:Event>
    <wsnt:NotificationMessage>
      <wsnt:Topic Dialect="http://www.onvif.org/ver10/tev/topicExpression/ConcreteSet">
        tns1:RuleEngine/FieldDetector/ObjectsInside
      </wsnt:Topic>

In this message, tns1 refers to the namespace http://www.onvif.org/ver10/topics via the xmls:tns1 declaration on the <tt:MetaDataStream> start tag.

A YaDeserialize implementation which receives this character data needs to have the namespace declarations available to it to parse properly. yaserde should ensure this is supplied when parsing each element. yaserde doesn't need to supply the actual implementation of qname interpretation. In fact, an ONVIF metadata stream parser may need to parse non-compliant documents such as the following:

<tt:MetadataStream xmlns:tt="http://www.onvif.org/ver10/schema"
                   xmlns:wsnt="http://docs.oasis-open.org/wsn/b-2">
  <tt:Event>
    <wsnt:NotificationMessage>
      <wsnt:Topic Dialect="http://www.onvif.org/ver10/tev/topicExpression/ConcreteSet">
        tns1:Monitoring/OperatingTime/LastClockSynchronization
      </wsnt:Topic>

This document above is missing the critical tns1 namespace mapping. Parsing it successfully requires falling back to an assumed mapping. Having such a mapping for each type of document would clearly be out-of-scope for a general-purpose de/ser-ializer.

SOAP bodies

SOAP defines a body element which contains either a fault element or arbitrary, message-specific content (which can include attributes, elements, and character data). Typically a given operation's content is of a particular type specified in the service's WSDL.

The onvif-rs crate contains logic for deserializing a SOAP envelope and body for an arbitrary operation. It doesn't use yaserde. Instead, it uses xmltree to parse the document. If the body contains a fault element, it deserializes it and returns error. Otherwise, it serializes the doucment again for the caller to pass to yaserde.

A more efficient, ergonomic interface would avoid the redundant deserialization and serialization. Instead, it would directly deserialize a succesful body into a caller-chosen YaDeserialize implementation.

To avoid code bloat, the soap logic should not have to be monomorphized for each operation. Instead, YaDeserialize should be object-safe, and it should be possible to deserialize into a pre-exisitng SoapBody object.

The simple generated class below meets these requirements. A hand-rolled YaDeserialize could additionally enforce that a fault is mutually exclusive with attributes, other elements, and character data.

#[derive(YaDeserialize)]
struct SoapBody<'a> {
  fault: Option<SoapFault>,

  #[yaserde(flatten)]
  wrapping: &'a mut YaDeserialize,
}

We additionally observe that at least in the case of ONVIF, every body consists of only one child. A similar technique would allow implementing this as a single wrapper type without requiring monomorphization.

YaDeserialize implementation errors

As discussed in #76, there's no clear contract between child and parent elements today for the state of the Deserializer. Even if there were, there's no enforcement of this contract. It's possible to have bugs in which the child consumes elements that belong to the parent. These are difficult to debug.

By introducing the ElementReader interface, we can make such bugs entirely impossible.

Error type

By passing back an error type that can wrap a Box<::std::error::Error + Send + Sync + 'static>, we can allow the caller to see more richer information from YaDeserialize implementations. This might include a backtrace or an error type enum.

Similarly, by passing back an ::xml::reader::Error, we can allow callers to see the XML parser's position information.

We might also augment errors with the stack of XML elements as of when the error is generated.

Asynchronous incremental parsing

The Rust ecosystem is moving toward asynchronous handling of network streams. It'd be nice to match this by supporting parsing an XML stream in chunks without tying up a thread. Currently this doesn't seem to be possible with any of the popular XML parser libraries:

xml5ever's xml5ever::driver::XmlParser provides a push interface that could be useful for this purpose. However, this crate is advertised as "alpha quality".

Short of using this alpha-quality crate, asynchronous operation doesn't seem possible today. Nonetheless, let's look at what an ideal interface might look like.

A half-parsed state is a stack of YaDeserialize implementations, each likely owned by the previous one. This can't be represented as a Vec<&mut YaDeserialize> due to the self-referential ownership. It can be represented as a stack of recursive async fn invocations. Thus, the most logical interface for this use case is a straightforward adaptation of the synchronous recursive interface.

We propose implementing the synchronous interface today. The interface can be converted to asynchronous later when it's needed and when the xml parsing libraries are ready.

@mlevkov
Copy link

mlevkov commented Jul 16, 2020

@scottlamb
Aside from a well-documented proposal. Thank you for putting this together. Many of the issues mentioned here are very much so been a bit of struggle while working through implementation for my project. The error feedback is always a mystery. I frequently have to spend time narrowing down to the implementation, often a human (my) assumption mistake. Only to find out that it something rather silly, which could have been resolved, if I had some feedback for its origination.

As far as flatten is concerned. I've been rolling my own implementation of the struct and made some decisions to separate out the elements and attributes into a higher level struct, which uses multiple flatten notations for its underlying sub-structs. The benefits of such are outlined in the issue #103, specifically around mis-associations for the name of an attribute and element. Not sure of the ergonomics yet, but it works for now. It deserializes into a "clearly" distinct set of elements and attributes. Here is the snapshot of the deserialized "debug" example:

adaptation_set: [
                        AdaptationSet {
                            representation_base: RepresentationBase {
                                attribute: RepresentationBaseAttributes {
                                    profiles: None,
                                    width: None,
                                    height: None,
                                    sar: None,
                                    frame_rate: None,
                                    audio_sampling_rate: None,
                                    mime_type: None,
                                    segment_profiles: None,
                                    codecs: None,
                                    maximum_sap_period: None,
                                    start_with_sap: None,
                                    max_playout_rate: None,
                                    coding_dependency: None,
                                    scan_type: None,
                                },
                                element: RepresentationBaseElements {
                                    frame_packing: [],
                                    audio_channel_configuration: [],
                                    content_protection: [],
                                    essential_property: [],
                                    supplemental_property: [],
                                    inband_event_stream: [],
                                },
                            },

and its (partial) code:

#[derive(Default, Debug, Clone, PartialEq, YaSerialize, YaDeserialize)]
#[yaserde(
    prefix = "mpd",
    namespace = "mpd: urn:mpeg:dash:schema:mpd:2011",
    default_namespace = "mpd"
)]
pub struct AdaptationSet {
    #[yaserde(flatten)]
    pub representation_base: RepresentationBase,
    #[yaserde(flatten)]
    pub attribute: AdaptationSetAttributes,
    #[yaserde(flatten)]
    pub element: AdaptationSetElements,
}

One recent discovery, however, made my assumptions about the approach question its safety. I assumed, that if the underlying elements/attributes yield no response or existence, then the library would not deserialize those, and the need for Option would not be necessary. However, the assumption was incorrect, and such has yielded a stack overflow error in the case where the library tried to parse the non-existent element with a(flatten) notation, Issue #85 .

How would I be able to associate namespaces for same tag under this proposal? In the case where I have one namespace for the entire document, but then someone tries to use it with a slightly different version of the same document, which has a different namespace. For example, the document is designed against urn:mpeg:dash:schema:mpd:2011, but is trying use it with a document that has a namespace urn:mpeg:dash:schema:mpd:2014. Ideally, I would create fields/structs that are providing support for urn:mpeg:dash:schema:mpd:2014 schema. I would not object alternating or creating namespaces specific to their intended uses, as described in this Issue #45 .

Will it provide support for handling "Xlink" or that stays mainly unchanged and up to my "custom" processing?

I'm totally in support of your proposal, it would greatly enhance the library. How does it change "serialization", if at all?

@scottlamb
Copy link
Contributor Author

@MarcAntoine-Arnaud I held off until now because I hadn't heard from you. But this is holding back my project, so I'm working through it. My question is: are you interested in this sort of change? If so, I'm happy to work with you. If not, I'll release my own crate.

@ctrlcctrlv
Copy link

I have no idea what this project is about @scottlamb @MarcAntoine-Arnaud but I thought I'd make you aware of eminence/xmltree-rs#33 as you mentioned eminence/xmltree-rs#13 repeatedly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants