-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #36 from gerhardgossen/master
Escape redirect URLs in RealCDXExtractorOutput
- Loading branch information
Showing
3 changed files
with
35 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 changes: 28 additions & 0 deletions
28
src/test/java/org/archive/extract/RealCDXExtractorOutputTest.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
package org.archive.extract; | ||
|
||
import java.net.MalformedURLException; | ||
import java.net.URI; | ||
import java.net.URISyntaxException; | ||
import java.net.URL; | ||
import java.net.URLEncoder; | ||
|
||
import junit.framework.TestCase; | ||
|
||
|
||
public class RealCDXExtractorOutputTest extends TestCase { | ||
|
||
public void testEscapeResolvedUrl() throws Exception { | ||
String context ="http://www.uni-giessen.de/cms/studium/dateien/informationberatung/merkblattpdf"; | ||
String spec = "http://fss.plone.uni-giessen.de/fß/studium/dateien/informationberatung/merkblattpdf/file/Mérkblatt zur Gestaltung von Nachteilsausgleichen.pdf?föo=bar#änchor"; | ||
String escaped = RealCDXExtractorOutput.resolve(context, spec); | ||
assertTrue(escaped.indexOf(" ") < 0); | ||
URI parsed = new URI(escaped); | ||
assertEquals("änchor", parsed.getFragment()); | ||
} | ||
|
||
public void testNoDoubleEscaping() throws Exception { | ||
String spec = "https://www.google.com/search?q=java+escape+url+spaces&ie=utf-8&oe=utf-8"; | ||
String resolved = RealCDXExtractorOutput.resolve(spec, spec); | ||
assertTrue(spec.equals(resolved)); | ||
} | ||
} |
598c524
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this fix! I discovered another case where broken content in the redir field breaks the CDX files:
mailto:John.Doe @Informatik.Uni-Oldenburg.DE
(notice the space before the@
). This seems to be not properly handled by theresolve
method. I am not sure if this is a problem of theURI
class and how to fix this in a nice way.598c524
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crikey, you can get HTTP redirects to mailto: URIs? That works?!
Anyway, the URI class can be used to escape mailto: URIs if you use the seven arg constructor:
And it seems that these fields are all available on the URL class so that should work I think.
598c524
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See issue #37 which I filed regarding these problems. There some more strange cases. They are not a big issue, since only a minor fraction of URLs is affected (in my 3TB crawl only 19 lines are affected), nevertheless, I though it's good to document these cases here.