Fix Tumblr import #548

dmdeller · 2024-08-22T22:16:02Z

I was trying to run the Tumblr importer under Ruby 3 and the latest Jekyll, and I ran into these two problems, when using the option --rewrite_urls true:

There was a call to URI.encode() which was removed in Ruby 3. This causes an uncaught exception. This PR fixes this by replacing it with URI::Parser.new.escape().
The redirect URLs generated were incorrect, which caused a 404 when a user tries to follow one of the old Tumblr-style links.

(Neither problem occurs when not using that option. I needed that option, though.)

It was generating URLs like this: /2024/08/22/2019-04-03-how-i-handle-errors-in-ios-apps.html (Today's date is incorrectly used as the directory path, and then the actual original date of the post is incorrectly inserted into the slug)

It should have been generating URLs like this: /2019/04/03/how-i-handle-errors-in-ios-apps.html (Correctly matches where the imported post ends up after jekyll build)

The second issue was somewhat trickier to fix. I'm not familiar with the Jekyll::Document API. I couldn't figure out why it returns the wrong result. Instead, I resorted to just assembling the string manually (which required passing a little more data to rewrite_urls_and_redirects() than was being done previously).

I'm open to feedback for how to improve this further. Thanks for taking a look!

parkr · 2024-08-30T07:55:20Z

lib/jekyll-import/importers/tumblr.rb

            jekyll_url = if Jekyll.const_defined? :Post
                           Jekyll::Post.new(site, site.source, "", "tumblr/#{post[:name]}").url
                         else
-                           Jekyll::Document.new(site.in_source_dir(relative_path), :site => site, :collection => site.posts).url
+                           "/" + Date.parse(post[:date]).to_s.tr("-", "/") + "/" + post[:slug] + ".html"


I believe these two changes output the files in different places than the current code does. Perhaps we should check this with a test to ensure a proper Jekyll site is output. I believe the goal was to write all the posts into the _posts directory so that they can be paginated and have all the goodies that come with documents in collections.

When run with the flag --rewrite_urls true, the Tumblr importer outputs files in two different places:

_posts for the full text blog articles

post for the redirects. These are simple files that do not contain any text of the blog posts, but instead only contain an HTML meta tag which causes the browser to redirect to the new, canonical blog post URL. The purpose of this is to avoid breaking URLs in the style that Tumblr used to use, prior to the blog being migrated to Jekyll. Tumblr-style URLs look like https://blog.domain.example/post/1234567890/blog-post-title-slug. So, these need to be redirected and this is how it is done.

When run without the flag --rewrite_urls true, only the former file location is used, and redirects are not generated.

The changes in my PR only apply to the latter file location. The former location still continues to generate as normal, and is not affected by my changes (that is handled by different code in a different method). It is true that the latter code does not output to the standard _posts location, but as you can see, there is a reason for this.

My PR does not change this overall behavior, it is the same as before; it was already being written to both _posts and post. However, the existing code generates incorrect HTML, as I wrote earlier, and my PR fixes that. My PR does not change the directory structure or file locations.

My apologies for not writing a unit test for this. I ran into some difficulty with the development dependencies, because I don’t have MySQL, PostgreSQL, etc. installed (they are not necessary for the Tumblr importer) and I didn’t have time to look into that. As a result I was unable to get the unit tests running locally.

Let me know what you think about my above explanation, and if you would prefer, I can try to look more into getting some test coverage for this.

From my read of the code, tumblr_url is the redirect page, and jekyll_url (the line commented on) is the "new" post URL from _posts. In the diff I see, both are modified, one to url_with_slug (redirect page), and one stops using Jekyll::Document. The thing I'm considering here is the "hard-coded" new jekyll_url you have here. I think it's possible to import into a Jekyll site path that has a _config.yaml file which could alter the document url, right? Perhaps it's an edge case, but it's possible.

In any event, if it's best to follow this path, then we shouldn't check for Jekyll::Post and instead use the same jekyll_url for both the old versions of Jekyll (with Post) and the new versions of Jekyll (with Document).

Sorry about the test dependencies. One thing I'd try is to comment out the dev dependencies you don't need, and to run bundle exec ruby test/test_tumblr_importer.rb directly to run just the one file. That way, mysql & friends won't be require'd. These tests aren't particularly well-kept, so apologies for that! If you're open to it, you might also try a GitHub Codespace so you don't need to modify your own dev environment. We don't have a devcontainer with all the dependencies setup at this time, but hopefully a sudo apt-get install mysql-client etc could do the trick. I think each gem's repo has installation instructions for Linux (with dependencies) if you go this route. Ruby with C extensions is definitely not the developer's paradise!

From my read of the code, tumblr_url is the redirect page, and jekyll_url (the line commented on) is the "new" post URL from _posts.

I think that's right. In this case, jekyll_url is being used to generate the content of the HTML meta tag which redirects the user from the old URL to the new URL.

In the diff I see, both are modified, one to url_with_slug (redirect page), and one stops using Jekyll::Document.

tumblr_url contains the same value as it did before; the change is a result of the fact that I renamed :slug to :url_with_slug, which also contains the same value as it did before. :slug was a confusing name for it because it actually already contained a full URL, hence why I renamed it. Furthermore, I also needed a hash key with just the slug (not the entire URL), and the only logical name for that was :slug, which was already in use... (see lines 175-176 on the right side of the diff to see what I mean)

The thing I'm considering here is the "hard-coded" new jekyll_url you have here. I think it's possible to import into a Jekyll site path that has a _config.yaml file which could alter the document url, right? Perhaps it's an edge case, but it's possible.

Very good point. I was worried there might be something like that. Do you have any suggestion for how to do it properly? I'm afraid I'm not familiar enough with this API.

In any event, if it's best to follow this path, then we shouldn't check for Jekyll::Post and instead use the same jekyll_url for both the old versions of Jekyll (with Post) and the new versions of Jekyll (with Document).

I was hesitant to change that part of the branch because I didn't have an old version of Jekyll to test with. But I can do this if that's what you recommend.

dmdeller added 2 commits August 22, 2024 14:26

Replace API call that was removed in Ruby 3

b217606

Fix incorrect redirect URL generation

a909f3a

parkr reviewed Aug 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Tumblr import #548

Fix Tumblr import #548

dmdeller commented Aug 22, 2024

parkr Aug 30, 2024

dmdeller Sep 4, 2024 •

edited

Loading

parkr Sep 5, 2024

parkr Sep 5, 2024

dmdeller Sep 5, 2024

Fix Tumblr import #548

Are you sure you want to change the base?

Fix Tumblr import #548

Conversation

dmdeller commented Aug 22, 2024

parkr Aug 30, 2024

Choose a reason for hiding this comment

dmdeller Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

parkr Sep 5, 2024

Choose a reason for hiding this comment

parkr Sep 5, 2024

Choose a reason for hiding this comment

dmdeller Sep 5, 2024

Choose a reason for hiding this comment

dmdeller Sep 4, 2024 •

edited

Loading