-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allows include/import schemaLocations to have relative up path steps. #1037
Conversation
87c9817
to
d3b76fc
Compare
If schema authors put in relative path up-steps. Then this enables reuse of DFDL schemas as XSD schemas in tools other than Daffodil itself without having to edit all the include/import schemaLocation attributes which are inter-component references. DAFFODIL-2828
d3b76fc
to
6abed68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change doesn't seem unreasonable, but I have some thoughts and some error edge cases, curious what others think. How far do we want to go with enforcing correct includes?
// try it as a direct classpath resolution first, and if that fails, | ||
// try removing all the upward path steps (if any). | ||
val optURI = Misc.getResourceRelativeOption(sysId, baseURI).orElse( | ||
Misc.getResourceRelativeOption(removeInitialUpwardPathSteps(sysId), baseURI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if removing all up-steps might be too aggressive and could potentially hide incorrect includes?
One example, say the importing schema was at jar:file:/path/to/foo.jar!/xsd/main.dfdl.xsd
and it includes a schemaLocation of ../../../../../xsd/include.dfdl.xsd
, with the intention of including /xsd/include.dfdl.xsd
from a different jar on the classpath. That's too many up-steps, but if /xsd/include.dfdl.xsd
is on the classpath then it will still work with this change when using Java. But that include will not work when these schemas are extracted out of jars because the up-steps go too far up. This could potentially be confusing if this include worked in one case but not another, and it might be hard to track down the issue of being a bad include because of that?
Another example, say we havejar:file:/path/to/foo.jar!/xsd/main/main.dfdl.xsd
and it imports ../common/formats.dfdl.xsd
, so intending to read /xsd/common/formats.dfdl.xsd
out of the same jar because we never up-stepped out of that jar. If that fails to resolve, with this change we'll then look for /common/formats.dfdl.xsd
on the classpath, which could potentially exist and include the wrong thing, when it should have been an error.
So I guess what I'm wondering is if we need to be more careful about the default to looking at the root if something fails? If he base is a Jar URI and there are up-steps, then maybe it should be an error if we have too many up-steps? And if there are up-steps but the up-steps don't go out of the jar, then it should be an error and we shouldn't attempt to look anywhere else?
I imagine this could become an enhancement to getResourceRelativeOption
? For Jar URIs, if up-steps go too far it returns None, if up-steps go exactly to the jar it looks on the classpath (maybe looking at the current jar first?), otherwise it uses the normal relative resource behavior?
In fact, the first thing getResourceRelativeOption
does is look for the resource on the classpath and completely ignores the context unless tha fails. That feels wrong to me.
Or maybe that extra complication isn't worth the effort for some probably rare edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the corner case you suggest does and could happen.
But not if schemas use package-like names for directories, or very unique file names.
It certainly is possible to get the number of "../" wrong, but it's also possible to cut/paste the wrong root-path component name and then forget to edit it.
One nice thing is for the relative paths within a schema, the IDE highlights them if they're wrong. Not inter-schema however.
I just tested all this with a few schemas. I used Janap128 and OTH-Gold (which are Owl's schemas). I changed all the schemaLocation paths that were relative to use relative up-steps, and the classpath/inter-component ones to have up-steps to the root.
All worked first try, once I put this same resolver change into the Owl xsat2 library.
I suggest this: we shouldn't let the perfect be the enemy of the good here. This change will really help out people trying to use DFDL schemas to validate XML data using, well, not-daffodil. Ex: C-based tooling.
If debugging include/import schemaLocations becomes problematic, then we can make changes to getResourceRelativeOption along the lines of what you are suggesting. We're writing our own class path search in that case but that would be ok because we could get much better diagnostics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the more it feels wrong to just throw away parts of an include path. If we're five directories deep and fail to resolve ../foo.dfdl.xsd
, we probably shouldn't look for /foo.dfdl.xsd
on the classpath, it's very unlikely that's what the user intended and could easily find the wrong thing.
How about this alternative approach, which doesn't handle too many up-steps but I think is maybe less complex and maintains directory namespaceing: We generate a relative URL the same way we do now. If that fails to resolve and if it's a jar URI then we extract the new path from it and look that up on the classpath.
For example, if the base URI is jar:file://path/to/foo.jar!/com/example/xsd/main.dfdl.xsd
and that includes ../common.dfdl.xsd
then we first look for jar:file://path/to/foo.jar!/com/example/common.dfdl.xsd
. If that fails to resolve, then we look up /com/example/common.dfdl.xsd
on the classpath. This way we don't lose our namespace information.
I think the code would be fairly straightforward:
def getResourceRelativeOption(...) : Option[URI] = {
{
...
val completeURL = new URL(contextURL, resName)
...
val res = ...
// new stuff here
if (res.isEmpty && completeURL.getProtocol == "jar") {
val newPath = completeURL.openConnection().asInstanceOf[JarURLConnection].getEntryName
Misc.getResourceOption(newPath)
} else {
res
}
}
This doesn't detect too-many up-steps, but it doesn't throwaway relative path information so with namespaced directories makes it unlikely to find the wrong thing. And if relative steps go all the way up to the root of the jar (or beyond) then it essentially behaves the same as this PR and looks on the classpath with all relative steps gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't work because a classpath can contain more than just jars. It can contain an arbitrary sequence of directories. So the URI isn't necessarily a jar URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this. (1) The schemaLocation gets combined with the current file URI to form a combined URI. If that then resolves, great.
If it doesn't work, then by cases:
(2) If the current file URI is a jar URI, then detect the "too many up steps" error. If the up-steps take you to the root of the jar, then the schemaLocation without the up-steps is resolved against the class path.
(3) If the up-steps do not take you to the root of the jar, then fail - should have worked at step (1)
(4) If the current file URI is NOT a jar, then see if the directories path steps immediately above all the ../ moves are named "src/*/resources" where * is main or test. If so, then heuristically we are at the root, so try the schemaLocation without the up-steps against the class path. If not fail - should have worked at step (1).
If the up-steps take you to within or above the src/main/resources or src/test/resources (or any other heuristic that names the root dir of a resource tree) then you have too many up steps error.
I think other similar heuristics would look to see if the up-steps take you to a resource_managed directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally to 4, we may also want to consider the flattened directory layout, where there is no {main,test}/resources and everything is in just in src/ or test/.
Though, I wonder there is a potential issue with number 4, where things might work fine when running in the project repo because it sees src/*/resources parent dirs and switches to the classpath lookup, but then fails if copied somewhere else (e.g. some product) where those src/*/resources directories don't exist and so it can't know to use the classpath?
I'm not sure what could be done about this though...
@@ -18,7 +18,8 @@ | |||
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" | |||
xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"> | |||
|
|||
<xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"/> | |||
<!-- schemaLocation can have upward relative path step, or can be from a classpath root. --> | |||
<xs:include schemaLocation="../org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises an interesting case. We use this DFDLGeneralFormat include in lots of schemas and the path is wrong for all of them if jars are extracted. Other resolves would expect the org
directory to be in the same directory as the including schema, which id never the case.
If we go with my above suggestion, then at least everything will be broken for both Daffodil and things not using Daffodils resolver, so we're at least consistent. But that would mean every one of these DFDLGeneralFormat includes needs to be changed to have the right amount of up-steps, which is a pretty-invasive change. I guess we could special case imports starting with org/apache/dafodil
and create a warning to suggest fixing this includes to use up-steps, with eventual deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Well nothing will break, as it will try to use the path as is, then try it without the upsteps.
I think we might eventually want the deprecation thing on paths without the up-steps, but I'm not sure. An absolute package-qualified name is awfully simple to use. Getting the relative paths right is only needed for this cybersecurity use case where people are separating the parse/unparse from the XSD validator.
Having the up-steps on the schemaLocations makes refactoring the schema, e..g, changing locations of things harder. I would claim if a schema is only needed for parse/unparse, then one would prefer NOT to have those up-steps on the paths.
DAFFODIL-2828
DAFFODIL-2828
I guess I'm coming around to the notion that we need the up-step paths to be verified to be correct by daffodil. I.e., we can't have people just putting random strings of "../../../../../../../" in front of what are currently root-relative paths. So I think we do need to know whether the upsteps take you to the root of the jar (ok. search classpath), stay within the jar (use relative only), or there are too many up-steps (error). Fixing that will push this issue to the next Daffodil release (3.5.1) at least. |
<xs:include schemaLocation="test space/test 2/multi_A_05_nons.dfdl.xsd"/> | ||
|
||
<!-- schemaLocations can either be relative paths, or can be absolute paths --> | ||
<xs:include schemaLocation="../../test space/test 2/multi_A_05_nons.dfdl.xsd"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't about this line, but allows for threaded discussion
Just to throw it out and spur discussion, another completely different option is to punt on the issue entirely with minor tweaks to resolution:
-
Relative imports (i.e. schemaLocations that do not start with a
/
) are looked up relative to the base and we never look at the classpath. If it fails to resolve relatively, the lookup fails. This works for files and intra-jar lookups and non-Daffodil resolvers that don't know about classpaths/jars since they already support relative lookups.Note that this breaks things like
schemaLocation="org/apache/daffodil/xsd/DFDLGenralFormat.dfdl.xsd"
, but maybe we make exceptions for schema locations that start withorg/apached/daffodil
and always look those up on the classpath, and warn that it should be changed to be absolute. -
Absolute paths (i.e. schemaLocations that start with a
/
) are looked up absolutely on the classpath and if not found then looked up absolutely on the file system--no relative lookups. If a schema expects an include to be in another jar or directory on the classpath, an absolute path must be used. This works for jars and directories on the classpath, but not for non-Daffodil resolvers that don't use a classpath. In cases where non-Daffodil resolvers are, pre-processing must be done to convert those absolute paths to paths that work in the system.And preprocessing could maybe be done with a one-liner bash command, e.g.
find /path/to/schemas/root -type f -print0 | \ xargs -0 sed -i 's@schemaLocation="/@schemaLocation="/path/to/schemas/root/@'
The modified schemas should work for both Daffodil and non-Daffodil resolvers. And a more complex script could alternatively calculate and prepend the right number of up-steps to these paths if the root changed if any resolvers don't support absolute paths, or if absolute paths could change.
This approach is very similar to what we have now, but makes the absolute path explicit. There's no guessing if a path should be on the classpath or not, it's either relative and looked up relatively, or it's not and it's looked up on the classpath/absolutely. By being explicit about requiring the /
, it is also more clear which schemaLocations must be modified if classpath resolution isn't supported--it's any schemaLocation that starts with a /
. Additionally, if directory levels are added or removed in the including schema, the absolute paths still work--no need to add or remove up-steps to get to the root if dirs are moved around.
I imagine some systems that use non-Daffodil resolvers could even (or might already?) do something like put their applications in a chroot or mount namespace, in which case the absolute paths might actually work unmodified? That's probably much more work than just pre-processing the paths, but is a potential alternative, and maybe even a good idea for security relevant applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I like the simpler rules making relative paths always look at the file system (except for an org/apache/daffodil heuristic if necessary) and absolute paths always look at the classpath.
Closing this PR. Subsumed by other work. |
If schema authors put in relative path up-steps.
Then this enables reuse of DFDL schemas as XSD schemas in tools other than Daffodil itself without having to edit all the include/import schemaLocation attributes which are inter-component references.
DAFFODIL-2828