Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Tumblr import #548

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions lib/jekyll-import/importers/tumblr.rb
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def write_post(post, use_markdown, add_highlights)
if use_markdown
content = html_to_markdown content
if add_highlights
tumblr_url = URI.parse(post[:slug]).path
tumblr_url = URI.parse(post[:url_with_slug]).path
redirect_dir = tumblr_url.sub(%r!\/!, "") + "/"
FileUtils.mkdir_p redirect_dir
content = add_syntax_highlights(content, redirect_dir)
Expand Down Expand Up @@ -162,16 +162,18 @@ def post_to_hash(post, format)
end
{
:name => "#{date}-#{slug}.#{format}",
:date => date,
:header => {
"layout" => "post",
"title" => title,
"date" => Time.parse(post["date"]).xmlschema,
"tags" => (post["tags"] || []),
"tumblr_url" => post["url-with-slug"],
},
:content => content,
:url => post["url"],
:slug => post["url-with-slug"],
:content => content,
:url => post["url"],
:slug => slug,
:url_with_slug => post["url-with-slug"],
}
end

Expand Down Expand Up @@ -209,11 +211,11 @@ def rewrite_urls_and_redirects(posts)
# Create an initial empty file for the post so that we can instantiate a post object.
relative_path = "_posts/tumblr/#{post[:name]}"
File.write(relative_path, "")
tumblr_url = URI.parse(URI.encode(post[:slug])).path
tumblr_url = URI.parse(URI::Parser.new.escape(post[:url_with_slug])).path
jekyll_url = if Jekyll.const_defined? :Post
Jekyll::Post.new(site, site.source, "", "tumblr/#{post[:name]}").url
else
Jekyll::Document.new(site.in_source_dir(relative_path), :site => site, :collection => site.posts).url
"/" + Date.parse(post[:date]).to_s.tr("-", "/") + "/" + post[:slug] + ".html"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these two changes output the files in different places than the current code does. Perhaps we should check this with a test to ensure a proper Jekyll site is output. I believe the goal was to write all the posts into the _posts directory so that they can be paginated and have all the goodies that come with documents in collections.

Copy link
Author

@dmdeller dmdeller Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When run with the flag --rewrite_urls true, the Tumblr importer outputs files in two different places:

  1. _posts for the full text blog articles
  2. post for the redirects. These are simple files that do not contain any text of the blog posts, but instead only contain an HTML meta tag which causes the browser to redirect to the new, canonical blog post URL. The purpose of this is to avoid breaking URLs in the style that Tumblr used to use, prior to the blog being migrated to Jekyll. Tumblr-style URLs look like https://blog.domain.example/post/1234567890/blog-post-title-slug. So, these need to be redirected and this is how it is done.

When run without the flag --rewrite_urls true, only the former file location is used, and redirects are not generated.

The changes in my PR only apply to the latter file location. The former location still continues to generate as normal, and is not affected by my changes (that is handled by different code in a different method). It is true that the latter code does not output to the standard _posts location, but as you can see, there is a reason for this.

My PR does not change this overall behavior, it is the same as before; it was already being written to both _posts and post. However, the existing code generates incorrect HTML, as I wrote earlier, and my PR fixes that. My PR does not change the directory structure or file locations.

My apologies for not writing a unit test for this. I ran into some difficulty with the development dependencies, because I don’t have MySQL, PostgreSQL, etc. installed (they are not necessary for the Tumblr importer) and I didn’t have time to look into that. As a result I was unable to get the unit tests running locally.

Let me know what you think about my above explanation, and if you would prefer, I can try to look more into getting some test coverage for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my read of the code, tumblr_url is the redirect page, and jekyll_url (the line commented on) is the "new" post URL from _posts. In the diff I see, both are modified, one to url_with_slug (redirect page), and one stops using Jekyll::Document. The thing I'm considering here is the "hard-coded" new jekyll_url you have here. I think it's possible to import into a Jekyll site path that has a _config.yaml file which could alter the document url, right? Perhaps it's an edge case, but it's possible.

In any event, if it's best to follow this path, then we shouldn't check for Jekyll::Post and instead use the same jekyll_url for both the old versions of Jekyll (with Post) and the new versions of Jekyll (with Document).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the test dependencies. One thing I'd try is to comment out the dev dependencies you don't need, and to run bundle exec ruby test/test_tumblr_importer.rb directly to run just the one file. That way, mysql & friends won't be require'd. These tests aren't particularly well-kept, so apologies for that! If you're open to it, you might also try a GitHub Codespace so you don't need to modify your own dev environment. We don't have a devcontainer with all the dependencies setup at this time, but hopefully a sudo apt-get install mysql-client etc could do the trick. I think each gem's repo has installation instructions for Linux (with dependencies) if you go this route. Ruby with C extensions is definitely not the developer's paradise!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my read of the code, tumblr_url is the redirect page, and jekyll_url (the line commented on) is the "new" post URL from _posts.

I think that's right. In this case, jekyll_url is being used to generate the content of the HTML meta tag which redirects the user from the old URL to the new URL.

In the diff I see, both are modified, one to url_with_slug (redirect page), and one stops using Jekyll::Document.

tumblr_url contains the same value as it did before; the change is a result of the fact that I renamed :slug to :url_with_slug, which also contains the same value as it did before. :slug was a confusing name for it because it actually already contained a full URL, hence why I renamed it. Furthermore, I also needed a hash key with just the slug (not the entire URL), and the only logical name for that was :slug, which was already in use... (see lines 175-176 on the right side of the diff to see what I mean)

The thing I'm considering here is the "hard-coded" new jekyll_url you have here. I think it's possible to import into a Jekyll site path that has a _config.yaml file which could alter the document url, right? Perhaps it's an edge case, but it's possible.

Very good point. I was worried there might be something like that. Do you have any suggestion for how to do it properly? I'm afraid I'm not familiar enough with this API.

In any event, if it's best to follow this path, then we shouldn't check for Jekyll::Post and instead use the same jekyll_url for both the old versions of Jekyll (with Post) and the new versions of Jekyll (with Document).

I was hesitant to change that part of the branch because I didn't have an old version of Jekyll to test with. But I can do this if that's what you recommend.

end
redirect_dir = tumblr_url.sub(%r!\/!, "") + "/"
FileUtils.mkdir_p redirect_dir
Expand Down
Loading