Skip to content

Commit

Permalink
Merge pull request #36 from gerhardgossen/master
Browse files Browse the repository at this point in the history
Escape redirect URLs in RealCDXExtractorOutput
  • Loading branch information
anjackson committed Dec 17, 2014
2 parents eada0e1 + 1ee18d8 commit 598c524
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 3 deletions.
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
1.1.5
-----
* [Escape redirect URLs in RealCDXExtractorOutput](https://github.com/iipc/webarchive-commons/pull/36)
* [Tests fail on Windows](https://github.com/iipc/webarchive-commons/issues/2)
* [Test fails on Java 8](https://github.com/iipc/webarchive-commons/issues/31)

Expand Down
9 changes: 6 additions & 3 deletions src/main/java/org/archive/extract/RealCDXExtractorOutput.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import java.io.OutputStream;
import java.io.PrintWriter;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.List;
Expand Down Expand Up @@ -307,12 +308,14 @@ private String extractHTMLMetaRefresh(String origUrl, MetaData m) {
return "-";
}

private String resolve(String context, String spec) {
static String resolve(String context, String spec) {
// TODO: test!
try {
URL cUrl = new URL(context);
URL resolved = new URL(cUrl,spec);
return resolved.toURI().toASCIIString();
URL url = new URL(cUrl, spec);
// this constructor escapes its arguments, if necessary
URI uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), url.getRef());
return uri.toASCIIString();

} catch (URISyntaxException e) {
} catch (MalformedURLException e) {
Expand Down
28 changes: 28 additions & 0 deletions src/test/java/org/archive/extract/RealCDXExtractorOutputTest.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
package org.archive.extract;

import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.net.URLEncoder;

import junit.framework.TestCase;


public class RealCDXExtractorOutputTest extends TestCase {

public void testEscapeResolvedUrl() throws Exception {
String context ="http://www.uni-giessen.de/cms/studium/dateien/informationberatung/merkblattpdf";
String spec = "http://fss.plone.uni-giessen.de/fß/studium/dateien/informationberatung/merkblattpdf/file/Mérkblatt zur Gestaltung von Nachteilsausgleichen.pdf?föo=bar#änchor";
String escaped = RealCDXExtractorOutput.resolve(context, spec);
assertTrue(escaped.indexOf(" ") < 0);
URI parsed = new URI(escaped);
assertEquals("änchor", parsed.getFragment());
}

public void testNoDoubleEscaping() throws Exception {
String spec = "https://www.google.com/search?q=java+escape+url+spaces&ie=utf-8&oe=utf-8";
String resolved = RealCDXExtractorOutput.resolve(spec, spec);
assertTrue(spec.equals(resolved));
}
}

3 comments on commit 598c524

@rjoberon
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this fix! I discovered another case where broken content in the redir field breaks the CDX files: mailto:John.Doe @Informatik.Uni-Oldenburg.DE (notice the space before the @). This seems to be not properly handled by the resolve method. I am not sure if this is a problem of the URI class and how to fix this in a nice way.

@anjackson
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crikey, you can get HTTP redirects to mailto: URIs? That works?!

Anyway, the URI class can be used to escape mailto: URIs if you use the seven arg constructor:

URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment)

And it seems that these fields are all available on the URL class so that should work I think.

@rjoberon
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See issue #37 which I filed regarding these problems. There some more strange cases. They are not a big issue, since only a minor fraction of URLs is affected (in my 3TB crawl only 19 lines are affected), nevertheless, I though it's good to document these cases here.

Please sign in to comment.