Skip to content

Document that XmlDocument.Save() creates UTF-8 files *with BOM* if there's an explicit "UTF-8" encoding attribute #2014

@mklement0

Description

@mklement0

https://github.com/dotnet/corefx/issues/34118 demonstrates that while XmlDocument.Save(string) creates BOM-less UTF-8 files in the absence of an encoding attribute, signaling UTF-8 encoding explicitly via an encoding attribute in the XML declaration unexpectedly creates a UTF-8 file with BOM.

This is problematic for two reasons:

  • From a cross-platform perspective: A document with a UTF-8 (pseudo-)BOM (Unicode signature) can cause problems in cross-platform use, because many utilities on Unix-like platforms and, e.g., Java's standard libraries, where many utilities neither expect nor know how to handle such a BOM.

    • While the XML standard does mandate that a compliant parser must recognize a UTF-8 BOM, the reality is that XML files are often read as plain-text files.
  • From an internal-consistency perspective: UTF-8 files should be created without BOM, as has been the default since the inception of .NET; specifying UTF-8 explicitly should only produce a BOM if explicitly requested (although the standard does allow such BOMs).

    • As an aside: a related intra-.NET inconsistency is that System.Text.Encoding.UTF8 returns an encoding that does produce a BOM, but this unexpected behavior is at least documented.

@krwq feels that fixing this inconsistency is too much of a breaking change, so the behavior should be documented; to summarize:

When the XmlDocument.Save(string) overload is used:

  • In the absence of an encoding attribute, the .Save(string) method creates a UTF-8 without BOM, in line with .NET's default and suitable for cross-platform use.

  • If a UTF-8-valued encoding attribute is present, the .Save(string) method creates a UTF-8-encoded file with BOM.

    • Note that it doesn't matter whether a given document was originally read from a file / a string with an explicit encoding="UTF-8" attribute (the case of UTF-8 doesn't matter) in its XML declaration, or whether a UTF-8 encoding attribute was created programmatically via XmlDocument.CreateXmlDeclaration().

    • @krwq demonstrates a workaround based on explicit creation of an XmlWriter instance here.

Finally, it's also worth mentioning that using an encoding value that isn't recognized (as one of the default / registered .NET encodings) causes an exception on calling .Save() (but not on reading).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Pri3Indicates issues/PRs that are low priorityarea-System.XmluntriagedNew issue has not been triaged by the area owner

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions