Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows include/import schemaLocations to have relative up path steps. #1037

Closed
wants to merge 3 commits into from

Conversation

mbeckerle
Copy link
Contributor

If schema authors put in relative path up-steps.
Then this enables reuse of DFDL schemas as XSD schemas in tools other than Daffodil itself without having to edit all the include/import schemaLocation attributes which are inter-component references.

DAFFODIL-2828

If schema authors put in relative path up-steps.
Then this enables reuse of DFDL schemas as XSD schemas
in tools other than Daffodil itself without having to
edit all the include/import schemaLocation attributes which
are inter-component references.

DAFFODIL-2828
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change doesn't seem unreasonable, but I have some thoughts and some error edge cases, curious what others think. How far do we want to go with enforcing correct includes?

// try it as a direct classpath resolution first, and if that fails,
// try removing all the upward path steps (if any).
val optURI = Misc.getResourceRelativeOption(sysId, baseURI).orElse(
Misc.getResourceRelativeOption(removeInitialUpwardPathSteps(sysId), baseURI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if removing all up-steps might be too aggressive and could potentially hide incorrect includes?

One example, say the importing schema was at jar:file:/path/to/foo.jar!/xsd/main.dfdl.xsd and it includes a schemaLocation of ../../../../../xsd/include.dfdl.xsd, with the intention of including /xsd/include.dfdl.xsd from a different jar on the classpath. That's too many up-steps, but if /xsd/include.dfdl.xsd is on the classpath then it will still work with this change when using Java. But that include will not work when these schemas are extracted out of jars because the up-steps go too far up. This could potentially be confusing if this include worked in one case but not another, and it might be hard to track down the issue of being a bad include because of that?

Another example, say we havejar:file:/path/to/foo.jar!/xsd/main/main.dfdl.xsd and it imports ../common/formats.dfdl.xsd, so intending to read /xsd/common/formats.dfdl.xsd out of the same jar because we never up-stepped out of that jar. If that fails to resolve, with this change we'll then look for /common/formats.dfdl.xsd on the classpath, which could potentially exist and include the wrong thing, when it should have been an error.

So I guess what I'm wondering is if we need to be more careful about the default to looking at the root if something fails? If he base is a Jar URI and there are up-steps, then maybe it should be an error if we have too many up-steps? And if there are up-steps but the up-steps don't go out of the jar, then it should be an error and we shouldn't attempt to look anywhere else?

I imagine this could become an enhancement to getResourceRelativeOption? For Jar URIs, if up-steps go too far it returns None, if up-steps go exactly to the jar it looks on the classpath (maybe looking at the current jar first?), otherwise it uses the normal relative resource behavior?

In fact, the first thing getResourceRelativeOption does is look for the resource on the classpath and completely ignores the context unless tha fails. That feels wrong to me.

Or maybe that extra complication isn't worth the effort for some probably rare edge cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the corner case you suggest does and could happen.

But not if schemas use package-like names for directories, or very unique file names.

It certainly is possible to get the number of "../" wrong, but it's also possible to cut/paste the wrong root-path component name and then forget to edit it.

One nice thing is for the relative paths within a schema, the IDE highlights them if they're wrong. Not inter-schema however.

I just tested all this with a few schemas. I used Janap128 and OTH-Gold (which are Owl's schemas). I changed all the schemaLocation paths that were relative to use relative up-steps, and the classpath/inter-component ones to have up-steps to the root.

All worked first try, once I put this same resolver change into the Owl xsat2 library.

I suggest this: we shouldn't let the perfect be the enemy of the good here. This change will really help out people trying to use DFDL schemas to validate XML data using, well, not-daffodil. Ex: C-based tooling.

If debugging include/import schemaLocations becomes problematic, then we can make changes to getResourceRelativeOption along the lines of what you are suggesting. We're writing our own class path search in that case but that would be ok because we could get much better diagnostics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the more it feels wrong to just throw away parts of an include path. If we're five directories deep and fail to resolve ../foo.dfdl.xsd, we probably shouldn't look for /foo.dfdl.xsd on the classpath, it's very unlikely that's what the user intended and could easily find the wrong thing.

How about this alternative approach, which doesn't handle too many up-steps but I think is maybe less complex and maintains directory namespaceing: We generate a relative URL the same way we do now. If that fails to resolve and if it's a jar URI then we extract the new path from it and look that up on the classpath.

For example, if the base URI is jar:file://path/to/foo.jar!/com/example/xsd/main.dfdl.xsd and that includes ../common.dfdl.xsd then we first look for jar:file://path/to/foo.jar!/com/example/common.dfdl.xsd. If that fails to resolve, then we look up /com/example/common.dfdl.xsd on the classpath. This way we don't lose our namespace information.

I think the code would be fairly straightforward:

def getResourceRelativeOption(...) : Option[URI] = {
{
  ...
  val completeURL =  new URL(contextURL, resName)
  ...
  val res = ...
  // new stuff here
  if (res.isEmpty && completeURL.getProtocol == "jar") {  
    val newPath = completeURL.openConnection().asInstanceOf[JarURLConnection].getEntryName
    Misc.getResourceOption(newPath)
  } else {
    res
  }
}

This doesn't detect too-many up-steps, but it doesn't throwaway relative path information so with namespaced directories makes it unlikely to find the wrong thing. And if relative steps go all the way up to the root of the jar (or beyond) then it essentially behaves the same as this PR and looks on the classpath with all relative steps gone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't work because a classpath can contain more than just jars. It can contain an arbitrary sequence of directories. So the URI isn't necessarily a jar URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this. (1) The schemaLocation gets combined with the current file URI to form a combined URI. If that then resolves, great.
If it doesn't work, then by cases:

(2) If the current file URI is a jar URI, then detect the "too many up steps" error. If the up-steps take you to the root of the jar, then the schemaLocation without the up-steps is resolved against the class path.
(3) If the up-steps do not take you to the root of the jar, then fail - should have worked at step (1)

(4) If the current file URI is NOT a jar, then see if the directories path steps immediately above all the ../ moves are named "src/*/resources" where * is main or test. If so, then heuristically we are at the root, so try the schemaLocation without the up-steps against the class path. If not fail - should have worked at step (1).
If the up-steps take you to within or above the src/main/resources or src/test/resources (or any other heuristic that names the root dir of a resource tree) then you have too many up steps error.

I think other similar heuristics would look to see if the up-steps take you to a resource_managed directory.

Copy link
Member

@stevedlawrence stevedlawrence Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally to 4, we may also want to consider the flattened directory layout, where there is no {main,test}/resources and everything is in just in src/ or test/.

Though, I wonder there is a potential issue with number 4, where things might work fine when running in the project repo because it sees src/*/resources parent dirs and switches to the classpath lookup, but then fails if copied somewhere else (e.g. some product) where those src/*/resources directories don't exist and so it can't know to use the classpath?

I'm not sure what could be done about this though...

@@ -18,7 +18,8 @@
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

<xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"/>
<!-- schemaLocation can have upward relative path step, or can be from a classpath root. -->
<xs:include schemaLocation="../org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises an interesting case. We use this DFDLGeneralFormat include in lots of schemas and the path is wrong for all of them if jars are extracted. Other resolves would expect the org directory to be in the same directory as the including schema, which id never the case.

If we go with my above suggestion, then at least everything will be broken for both Daffodil and things not using Daffodils resolver, so we're at least consistent. But that would mean every one of these DFDLGeneralFormat includes needs to be changed to have the right amount of up-steps, which is a pretty-invasive change. I guess we could special case imports starting with org/apache/dafodil and create a warning to suggest fixing this includes to use up-steps, with eventual deprecation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Well nothing will break, as it will try to use the path as is, then try it without the upsteps.

I think we might eventually want the deprecation thing on paths without the up-steps, but I'm not sure. An absolute package-qualified name is awfully simple to use. Getting the relative paths right is only needed for this cybersecurity use case where people are separating the parse/unparse from the XSD validator.

Having the up-steps on the schemaLocations makes refactoring the schema, e..g, changing locations of things harder. I would claim if a schema is only needed for parse/unparse, then one would prefer NOT to have those up-steps on the paths.

@mbeckerle
Copy link
Contributor Author

I guess I'm coming around to the notion that we need the up-step paths to be verified to be correct by daffodil. I.e., we can't have people just putting random strings of "../../../../../../../" in front of what are currently root-relative paths.

So I think we do need to know whether the upsteps take you to the root of the jar (ok. search classpath), stay within the jar (use relative only), or there are too many up-steps (error).

Fixing that will push this issue to the next Daffodil release (3.5.1) at least.

<xs:include schemaLocation="test space/test 2/multi_A_05_nons.dfdl.xsd"/>

<!-- schemaLocations can either be relative paths, or can be absolute paths -->
<xs:include schemaLocation="../../test space/test 2/multi_A_05_nons.dfdl.xsd"/>
Copy link
Member

@stevedlawrence stevedlawrence Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't about this line, but allows for threaded discussion

Just to throw it out and spur discussion, another completely different option is to punt on the issue entirely with minor tweaks to resolution:

  1. Relative imports (i.e. schemaLocations that do not start with a /) are looked up relative to the base and we never look at the classpath. If it fails to resolve relatively, the lookup fails. This works for files and intra-jar lookups and non-Daffodil resolvers that don't know about classpaths/jars since they already support relative lookups.

    Note that this breaks things like schemaLocation="org/apache/daffodil/xsd/DFDLGenralFormat.dfdl.xsd", but maybe we make exceptions for schema locations that start with org/apached/daffodil and always look those up on the classpath, and warn that it should be changed to be absolute.

  2. Absolute paths (i.e. schemaLocations that start with a /) are looked up absolutely on the classpath and if not found then looked up absolutely on the file system--no relative lookups. If a schema expects an include to be in another jar or directory on the classpath, an absolute path must be used. This works for jars and directories on the classpath, but not for non-Daffodil resolvers that don't use a classpath. In cases where non-Daffodil resolvers are, pre-processing must be done to convert those absolute paths to paths that work in the system.

    And preprocessing could maybe be done with a one-liner bash command, e.g.

    find /path/to/schemas/root -type f -print0 | \
      xargs -0 sed -i 's@schemaLocation="/@schemaLocation="/path/to/schemas/root/@'

    The modified schemas should work for both Daffodil and non-Daffodil resolvers. And a more complex script could alternatively calculate and prepend the right number of up-steps to these paths if the root changed if any resolvers don't support absolute paths, or if absolute paths could change.

This approach is very similar to what we have now, but makes the absolute path explicit. There's no guessing if a path should be on the classpath or not, it's either relative and looked up relatively, or it's not and it's looked up on the classpath/absolutely. By being explicit about requiring the /, it is also more clear which schemaLocations must be modified if classpath resolution isn't supported--it's any schemaLocation that starts with a /. Additionally, if directory levels are added or removed in the including schema, the absolute paths still work--no need to add or remove up-steps to get to the root if dirs are moved around.

I imagine some systems that use non-Daffodil resolvers could even (or might already?) do something like put their applications in a chroot or mount namespace, in which case the absolute paths might actually work unmodified? That's probably much more work than just pre-processing the paths, but is a potential alternative, and maybe even a good idea for security relevant applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I like the simpler rules making relative paths always look at the file system (except for an org/apache/daffodil heuristic if necessary) and absolute paths always look at the classpath.

@mbeckerle
Copy link
Contributor Author

Closing this PR. Subsumed by other work.

@mbeckerle mbeckerle closed this Aug 14, 2023
@mbeckerle mbeckerle deleted the daf-2828-schemaLocation branch August 16, 2023 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants