Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems in writing proper CDX files with RealCDXExtractorOutput #37

Open
rjoberon opened this issue Dec 18, 2014 · 0 comments
Open

problems in writing proper CDX files with RealCDXExtractorOutput #37

rjoberon opened this issue Dec 18, 2014 · 0 comments

Comments

@rjoberon
Copy link

Thanks to @gerhardgossen's pull request #36 the most important problems with the redir field are now fixed. Investigating one of our crawls in more depth, I found further redir values that break the CDX file format due to spaces (I anonymized the mail addresses):

mailto: [email protected]
mailto:john.doe @Informatik.Uni-Oldenburg.DE
mailto:john.doe@blicher Tbinger Anhang
mailto:[email protected]?subject=Antrag auf SAP Zugang
E:/SmartSource Data Collector/util/content/wt_dcs.gif
ttp://find.galegroup.com/bncn/infomark.do?serQuery=Locale%28en%2C%2C%29%3AFQE%3D%28JX%2CNone%2C16%29%22Dublin Gazette%22%24&queryType=PH&type=pubIssues&prodId=BBCN&version=1.0&source=library

So the main reasons I found are

  1. spaces in e-mail addresses (in all parts),
  2. links to local files (without protocol), and
  3. broken protocol names

which can be summarized by broken URIs can cause broken CDX files which I think should not be the case.

Another issue I found was a CDX line that did not contain a MIME type column which causes similar problems.

rjoberon referenced this issue Dec 18, 2014
Escape redirect URLs in RealCDXExtractorOutput
@ghost ghost mentioned this issue Jul 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant