Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing some zip links #739

Closed
cdeadspine opened this issue Jan 8, 2021 · 3 comments
Closed

Missing some zip links #739

cdeadspine opened this issue Jan 8, 2021 · 3 comments

Comments

@cdeadspine
Copy link

cdeadspine commented Jan 8, 2021

Very nice plugin, I am surprised at how well it works on a very old website, except for one strange exception: 6 .zip file downloads from a single page are not crawled or "cached" or generated or "post processed"

(the .zip links are not present in any of these results)
There are many other downloads such as .pdf and .mov and picture files that are all crawled and in the generated site perfectly

Crawl Queue (Detected URLs) | 4547 URLs in database 
Crawl Cache | 4547 URLs in database
Generated Static Site | 4540 files, using 319.84 MBPath 
Post-processed Static Site | 4540 files, using 320.66 MBPath

The links to the zips are on a page that requires "login", the given login is the wordpress admin (put into wp2static -> options), and works for other pdf links on the same page.

The generated html that links to the files seems very normal, I don't know why they wouldn't be crawled properly?

<p style="text-align: center;"><a ref="magnificPopup" href="/wp-content/uploads/2014/07/File.jpg"><img loading="lazy" class="aligncenter size-medium wp-image-286" src="/wp-content/uploads/2014/07/File-300x165.jpg" alt="File" width="300" height="165" srcset="/wp-content/uploads/2014/07/File-300x165.jpg 300w, /wp-content/uploads/2014/07/File-500x276.jpg 500w, /wp-content/uploads/2014/07/File-200x110.jpg 200w, /wp-content/uploads/2014/07/File.jpg 507w" sizes="(max-width: 300px) 100vw, 300px" /></a>
<a href="/wp-content/uploads/2017/11/Patient-Awareness-Kit.zip">Patient Awareness Kit3</a> [ZIP (Multiple Documents Compressed)] <span class="thumbnail_caption clearfix" style="color: #000000;"><span style="color: #000000;"> </span></span></p>

I wonder if it is something about compression or the mention of "multiple documents compressed"? Regardless I have to manually download from the running wordpress
/wp-content/uploads/2017/11/Patient-Awareness-Kit.zip
and place it into my static copy, and everything works fine

There are exactly 6 .zip files not crawled, all from the same index.html statically generated

Plugins (i dont see how this could matter because there is just a blatent a-href link to a seemingly static file not being crawled?)


Add From Server  Version 3.4.5
Akismet Anti-Spam 4.1.8
All in One SEO  4.0.8
All-in-One WP Migration 7.32 
AMPActivate  2.0.9
Anti-Malware Security and Brute-Force FirewallScan Settings  4.19.69Disable auto-updates
Bitnami Production Console 1.2 
GoDaddy Pro Sites Worker 4.9.7
Google Analytics for WordPress7.14.0
Google Language Translator 6.0.8
Gravity FormsSettings 2.4.22
Hello Dolly 1.7.2
Insert Headers and Footers 1.5.0
Jetpack by WordPress.com 9.2.1
PHP Compatibility Checker 1.5.0
Simple Tags 2.62
Sucuri Security 1.8.24 
Ultimate Addons for Visual Composer 3.15.2
Ultimate TinyMCE 5.7
W3 Total Cache 2.0.1
WP Mail SMTP 2.5.1
WP Maintenance Mode 2.3.0
WP Serverless Forms 1.3.1
WP-Members  3.3.8
WP2Static 7.1.6
WPBakery Visual Composer 4.12
Yoast SEO 15.5

I can not find out from the documentation or internet searches where the WP2static logs files are, there is a bunch of results about "advanced tab -> debug" but I don't see that tab anywhere on my wordpress dashboard with this 7.1.6 version?

I am quite sure that these 6 zip downloads are in fact the only problem. I have done my own crawling of the entire static generation result and I can see all of the internal and external broken links, all of the /wp-json /feed /comments /xmlrpc /wp.me that can be ignored.

My best guess at what to look at is that actually the /wp-content/uploads/2017/11/Patient-Awareness-Kit.zip url is not protected by basic authentication, but the page that links to it did require basic authentication. (However other pdf links on the same page were crawled successfully)
So if it was a WP2static basic authentication problem, perhaps somehow it found the pdf files through another means other than crawling the basic-auth required index.html ? Is there some special file finding functionality other than brute force crawling all pages from the home page through links?

@leonstafford
Copy link
Contributor

leonstafford commented Jan 10, 2021

Hi @cdeadspine,

Thanks for trying it out and glad it worked for you, besides those .zip files. Let me quickly check the code, it may be that we work on an "inclusion list" for which extensions/content types to include.

Ah, yep, here it is.

There would have been a reason to exclude compressed archives at some point. WP2Static aims to be overridable/more extensible than the earlier versions, so right below that, you can see:

$file_extensions_to_ignore =
            apply_filters(
                'wp2static_file_extensions_to_ignore',
                $file_extensions_to_ignore
            );

Which exposes a filter you can use to modify that list. You can put something in your theme's functions.php, like:

add_filter( 'wp2static_file_extensions_to_ignore', 'cdeadspine_allow_zips', 10, 1 );

function cdeadspine_allow_zips( $extensions ) {
    
    $extension_exclusions_without_zip = somefunctiontotransform( $extensions );
    
    return $extension_exclusions_without_zip;
}

I just pseudo-coded that, so you'd need to add some logic to remove that .zip from the array, let me know if you get stuck on that and can use a bit more brain :D

@leonstafford
Copy link
Contributor

  • I fixed up order in that add_filter example.

Basically, it's a function that has an input (the exclusions) and an output (the modified exclusions) and you add some magic in between to modify the array that WP2Static exposes.

@john-shaffer
Copy link
Contributor

The options for files to ignore are now in the advanced crawling add-on, and I think we're fine there since advanced options are out-of-scope for the main plugin. Opened elementor/wp2static-addon-advanced-crawling#5 for some further refinement, but closing this since the fix is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants