Skip to content

Commit c8fcfc7

Browse files
committed
Medium post importer (from medium export)
1 parent 2d75ca8 commit c8fcfc7

File tree

7 files changed

+348
-21
lines changed

7 files changed

+348
-21
lines changed

docs/content.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -439,8 +439,8 @@ For **Markdown**, one must rely on an extension. For example, using the `mdx_inc
439439
Importing an existing site
440440
==========================
441441

442-
It is possible to import your site from WordPress, Tumblr, Dotclear, and RSS
443-
feeds using a simple script. See :ref:`import`.
442+
It is possible to import your site from several other blogging sites
443+
(like WordPress, Tumblr, ..) using a simple script. See :ref:`import`.
444444

445445
Translations
446446
============

docs/importer.rst

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ software to reStructuredText or Markdown. The supported import formats are:
1111

1212
- Blogger XML export
1313
- Dotclear export
14+
- Medium export
1415
- Tumblr API
1516
- WordPress XML export
1617
- RSS/Atom feed
@@ -65,6 +66,7 @@ Optional arguments
6566
-h, --help Show this help message and exit
6667
--blogger Blogger XML export (default: False)
6768
--dotclear Dotclear export (default: False)
69+
--medium Medium export (default: False)
6870
--tumblr Tumblr API (default: False)
6971
--wpfile WordPress XML export (default: False)
7072
--feed Feed to parse (default: False)
@@ -80,8 +82,7 @@ Optional arguments
8082
(default: False)
8183
--filter-author Import only post from the specified author
8284
--strip-raw Strip raw HTML code that can't be converted to markup
83-
such as flash embeds or iframes (wordpress import
84-
only) (default: False)
85+
such as flash embeds or iframes (default: False)
8586
--wp-custpost Put wordpress custom post types in directories. If
8687
used with --dir-cat option directories will be created
8788
as "/post_type/category/" (wordpress import only)
@@ -113,6 +114,14 @@ For Dotclear::
113114

114115
$ pelican-import --dotclear -o ~/output ~/backup.txt
115116

117+
For Medium::
118+
119+
$ pelican-import --medium -o ~/output ~/medium-export/posts/
120+
121+
The Medium export is a zip file. Unzip it, and point this tool to the
122+
"posts" subdirectory. For more information on how to export, see
123+
https://help.medium.com/hc/en-us/articles/115004745787-Export-your-account-data.
124+
116125
For Tumblr::
117126

118127
$ pelican-import --tumblr -o ~/output --blogname=<blogname> <api_key>
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
2+
<hr/><h3>Title header</h3><p>A paragraph of content.</p><p>Paragraph number two.</p><p>A list:</p><ol><li>One.</li><li>Two.</li><li>Three.</li></ol><p>A link: <a data-href="https://example.com/example" href="https://example.com/example" target="_blank">link text</a>.</p><h3>Header 2</h3><p>A block quote:</p><blockquote>quote words <strong>strong words</strong></blockquote><p>after blockquote</p><figure><img data-height="282" data-image-id="image1.png" data-width="739" src="https://cdn-images-1.medium.com/max/800/image1.png"/><figcaption>A figure caption.</figcaption></figure><p>A final note: <a data-href="http://stats.stackexchange.com/" href="http://stats.stackexchange.com/" rel="noopener" target="_blank">Cross-Validated</a> has sometimes been helpful.</p><hr/><p><em>Next: </em><a data-href="https://medium.com/@username/post-url" href="https://medium.com/@username/post-url" target="_blank"><em>Next post</em>
3+
</a></p>
4+
<p>By <a href="https://medium.com/@username">User Name</a> on <a href="https://medium.com/p/medium-short-url"><time datetime="2017-04-21T17:11:55.799Z">April 21, 2017</time></a>.</p><p><a href="https://medium.com/@username/this-post-url">Canonical link</a></p><p>Exported from <a href="https://medium.com">Medium</a> on December 1, 2023.</p>
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>A title</title><style>
2+
* {
3+
font-family: Georgia, Cambria, "Times New Roman", Times, serif;
4+
}
5+
html, body {
6+
margin: 0;
7+
padding: 0;
8+
}
9+
h1 {
10+
font-size: 50px;
11+
margin-bottom: 17px;
12+
color: #333;
13+
}
14+
h2 {
15+
font-size: 24px;
16+
line-height: 1.6;
17+
margin: 30px 0 0 0;
18+
margin-bottom: 18px;
19+
margin-top: 33px;
20+
color: #333;
21+
}
22+
h3 {
23+
font-size: 30px;
24+
margin: 10px 0 20px 0;
25+
color: #333;
26+
}
27+
header {
28+
width: 640px;
29+
margin: auto;
30+
}
31+
section {
32+
width: 640px;
33+
margin: auto;
34+
}
35+
section p {
36+
margin-bottom: 27px;
37+
font-size: 20px;
38+
line-height: 1.6;
39+
color: #333;
40+
}
41+
section img {
42+
max-width: 640px;
43+
}
44+
footer {
45+
padding: 0 20px;
46+
margin: 50px 0;
47+
text-align: center;
48+
font-size: 12px;
49+
}
50+
.aspectRatioPlaceholder {
51+
max-width: auto !important;
52+
max-height: auto !important;
53+
}
54+
.aspectRatioPlaceholder-fill {
55+
padding-bottom: 0 !important;
56+
}
57+
header,
58+
section[data-field=subtitle],
59+
section[data-field=description] {
60+
display: none;
61+
}
62+
</style></head><body><article class="h-entry">
63+
<header>
64+
<h1 class="p-name">A name (like title)</h1>
65+
</header>
66+
<section data-field="subtitle" class="p-summary">
67+
Summary (first several words of content)
68+
</section>
69+
<section data-field="body" class="e-content">
70+
<section name="ad15" class="section section--body section--first"><div class="section-divider"><hr class="section-divider"></div><div class="section-content"><div class="section-inner sectionLayout--insetColumn"><h3 name="20a3" id="20a3" class="graf graf--h3 graf--leading graf--title">Title header</h3><p name="e3d6" id="e3d6" class="graf graf--p graf-after--h3">A paragraph of content.</p><p name="c7a8" id="c7a8" class="graf graf--p graf-after--p">Paragraph number two.</p><p name="42aa" id="42aa" class="graf graf--p graf-after--p">A list:</p><ol class="postList"><li name="d65f" id="d65f" class="graf graf--li graf-after--p">One.</li><li name="232b" id="232b" class="graf graf--li graf-after--li">Two.</li><li name="ef87" id="ef87" class="graf graf--li graf-after--li">Three.</li></ol><p name="e743" id="e743" class="graf graf--p graf-after--p">A link: <a href="https://example.com/example" data-href="https://example.com/example" class="markup--anchor markup--p-anchor" target="_blank">link text</a>.</p><h3 name="4cfd" id="4cfd" class="graf graf--h3 graf-after--p">Header 2</h3><p name="433c" id="433c" class="graf graf--p graf-after--p">A block quote:</p><blockquote name="3537" id="3537" class="graf graf--blockquote graf-after--p">quote words <strong class="markup--strong markup--blockquote-strong">strong words</strong></blockquote><p name="00cc" id="00cc" class="graf graf--p graf-after--blockquote">after blockquote</p><figure name="edb0" id="edb0" class="graf graf--figure graf-after--p"><img class="graf-image" data-image-id="image1.png" data-width="739" data-height="282" src="https://cdn-images-1.medium.com/max/800/image1.png"><figcaption class="imageCaption">A figure caption.</figcaption></figure><p name="f401" id="f401" class="graf graf--p graf-after--p graf--trailing">A final note: <a href="http://stats.stackexchange.com/" data-href="http://stats.stackexchange.com/" class="markup--anchor markup--p-anchor" rel="noopener" target="_blank">Cross-Validated</a> has sometimes been helpful.</p></div></div></section><section name="09a3" class="section section--body section--last"><div class="section-divider"><hr class="section-divider"></div><div class="section-content"><div class="section-inner sectionLayout--insetColumn"><p name="81e8" id="81e8" class="graf graf--p graf--leading"><em class="markup--em markup--p-em">Next: </em><a href="https://medium.com/@username/post-url" data-href="https://medium.com/@username/post-url" class="markup--anchor markup--p-anchor" target="_blank"><em class="markup--em markup--p-em">Next post</em>
71+
</section>
72+
<footer><p>By <a href="https://medium.com/@username" class="p-author h-card">User Name</a> on <a href="https://medium.com/p/medium-short-url"><time class="dt-published" datetime="2017-04-21T17:11:55.799Z">April 21, 2017</time></a>.</p><p><a href="https://medium.com/@username/this-post-url" class="p-canonical">Canonical link</a></p><p>Exported from <a href="https://medium.com">Medium</a> on December 1, 2023.</p></footer></article></body></html>

pelican/tests/test_generators.py

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@ def test_generate_feeds_override_url(self):
264264

265265
def test_generate_context(self):
266266
articles_expected = [
267+
["A title", "published", "medium_posts", "article"],
267268
["Article title", "published", "Default", "article"],
268269
[
269270
"Article with markdown and summary metadata multi",
@@ -391,13 +392,24 @@ def test_generate_categories(self):
391392
# terms of process order will define the name for that category
392393
categories = [cat.name for cat, _ in self.generator.categories]
393394
categories_alternatives = (
394-
sorted(["Default", "TestCategory", "Yeah", "test", "指導書"]),
395-
sorted(["Default", "TestCategory", "yeah", "test", "指導書"]),
395+
sorted(
396+
["Default", "TestCategory", "medium_posts", "Yeah", "test", "指導書"]
397+
),
398+
sorted(
399+
["Default", "TestCategory", "medium_posts", "yeah", "test", "指導書"]
400+
),
396401
)
397402
self.assertIn(sorted(categories), categories_alternatives)
398403
# test for slug
399404
categories = [cat.slug for cat, _ in self.generator.categories]
400-
categories_expected = ["default", "testcategory", "yeah", "test", "zhi-dao-shu"]
405+
categories_expected = [
406+
"default",
407+
"testcategory",
408+
"medium_posts",
409+
"yeah",
410+
"test",
411+
"zhi-dao-shu",
412+
]
401413
self.assertEqual(sorted(categories), sorted(categories_expected))
402414

403415
def test_do_not_use_folder_as_category(self):
@@ -549,7 +561,8 @@ def test_period_archives_context(self):
549561
granularity: {period["period"] for period in periods}
550562
for granularity, periods in period_archives.items()
551563
}
552-
expected = {"year": {(1970,), (2010,), (2012,), (2014,)}}
564+
self.maxDiff = None
565+
expected = {"year": {(1970,), (2010,), (2012,), (2014,), (2017,)}}
553566
self.assertEqual(expected, abbreviated_archives)
554567

555568
# Month archives enabled:
@@ -570,14 +583,15 @@ def test_period_archives_context(self):
570583
for granularity, periods in period_archives.items()
571584
}
572585
expected = {
573-
"year": {(1970,), (2010,), (2012,), (2014,)},
586+
"year": {(1970,), (2010,), (2012,), (2014,), (2017,)},
574587
"month": {
575588
(1970, "January"),
576589
(2010, "December"),
577590
(2012, "December"),
578591
(2012, "November"),
579592
(2012, "October"),
580593
(2014, "February"),
594+
(2017, "April"),
581595
},
582596
}
583597
self.assertEqual(expected, abbreviated_archives)
@@ -602,14 +616,15 @@ def test_period_archives_context(self):
602616
for granularity, periods in period_archives.items()
603617
}
604618
expected = {
605-
"year": {(1970,), (2010,), (2012,), (2014,)},
619+
"year": {(1970,), (2010,), (2012,), (2014,), (2017,)},
606620
"month": {
607621
(1970, "January"),
608622
(2010, "December"),
609623
(2012, "December"),
610624
(2012, "November"),
611625
(2012, "October"),
612626
(2014, "February"),
627+
(2017, "April"),
613628
},
614629
"day": {
615630
(1970, "January", 1),
@@ -619,6 +634,7 @@ def test_period_archives_context(self):
619634
(2012, "October", 30),
620635
(2012, "October", 31),
621636
(2014, "February", 9),
637+
(2017, "April", 21),
622638
},
623639
}
624640
self.assertEqual(expected, abbreviated_archives)
@@ -836,8 +852,12 @@ def test_standard_metadata_in_default_metadata(self):
836852

837853
categories = sorted([category.name for category, _ in generator.categories])
838854
categories_expected = [
839-
sorted(["Default", "TestCategory", "yeah", "test", "指導書"]),
840-
sorted(["Default", "TestCategory", "Yeah", "test", "指導書"]),
855+
sorted(
856+
["Default", "TestCategory", "medium_posts", "yeah", "test", "指導書"]
857+
),
858+
sorted(
859+
["Default", "TestCategory", "medium_posts", "Yeah", "test", "指導書"]
860+
),
841861
]
842862
self.assertIn(categories, categories_expected)
843863

@@ -864,6 +884,7 @@ def test_article_order_by(self):
864884
generator.generate_context()
865885

866886
expected = [
887+
"A title",
867888
"An Article With Code Block To Test Typogrify Ignore",
868889
"Article title",
869890
"Article with Nonconformant HTML meta tags",

pelican/tests/test_importer.py

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121
get_attachments,
2222
tumblr2fields,
2323
wp2fields,
24+
mediumpost2fields,
25+
mediumposts2fields,
26+
strip_medium_post_content,
27+
medium_slug,
2428
)
2529
from pelican.utils import path_to_file_url, slugify
2630

@@ -708,3 +712,82 @@ def get_posts(api_key, blogname, offset=0):
708712
posts,
709713
posts,
710714
)
715+
716+
717+
class TestMediumImporter(TestCaseWithCLocale):
718+
def setUp(self):
719+
super().setUp()
720+
self.test_content_root = "pelican/tests/content"
721+
# The content coming out of parsing is similar, but not the same.
722+
# Beautiful soup rearranges the order of attributes, for example.
723+
# So, we keep a copy of the content for the test.
724+
content_filename = f"{self.test_content_root}/medium_post_content.txt"
725+
with open(content_filename, encoding="utf-8") as the_content_file:
726+
# Many editors and scripts add a final newline, so live with that
727+
# in our test
728+
the_content = the_content_file.read()
729+
assert the_content[-1] == "\n"
730+
the_content = the_content[:-1]
731+
self.post_tuple = (
732+
"A title",
733+
the_content,
734+
# slug:
735+
"2017-04-21-medium-post",
736+
"2017-04-21 17:11",
737+
"User Name",
738+
None,
739+
(),
740+
"published",
741+
"article",
742+
"html",
743+
)
744+
745+
def test_mediumpost2field(self):
746+
"""Parse one post"""
747+
post_filename = f"{self.test_content_root}/medium_posts/2017-04-21_-medium-post--d1bf01d62ba3.html"
748+
val = mediumpost2fields(post_filename)
749+
self.assertEqual(self.post_tuple, val, val)
750+
751+
def test_mediumposts2field(self):
752+
"""Parse all posts in an export directory"""
753+
posts = [
754+
fields
755+
for fields in mediumposts2fields(f"{self.test_content_root}/medium_posts")
756+
]
757+
self.assertEqual(1, len(posts))
758+
self.assertEqual(self.post_tuple, posts[0])
759+
760+
def test_strip_content(self):
761+
"""Strip out unhelpful tags"""
762+
html_doc = (
763+
"<section>This keeps <i>lots</i> of <b>tags</b>, but not "
764+
"the <section>section</section> tags</section>"
765+
)
766+
soup = BeautifulSoup(html_doc, "html.parser")
767+
self.assertEqual(
768+
"This keeps <i>lots</i> of <b>tags</b>, but not the section tags",
769+
strip_medium_post_content(soup),
770+
)
771+
772+
def test_medium_slug(self):
773+
# Remove hex stuff at the end
774+
self.assertEqual(
775+
"2017-04-27_A-long-title",
776+
medium_slug(
777+
"medium-export/posts/2017-04-27_A-long-title--2971442227dd.html"
778+
),
779+
)
780+
# Remove "--DRAFT" at the end
781+
self.assertEqual(
782+
"2017-04-27_A-long-title",
783+
medium_slug("medium-export/posts/2017-04-27_A-long-title--DRAFT.html"),
784+
)
785+
# Remove both (which happens)
786+
self.assertEqual(
787+
"draft_How-to-do", medium_slug("draft_How-to-do--DRAFT--87225c81dddd.html")
788+
)
789+
# If no hex stuff, leave it alone
790+
self.assertEqual(
791+
"2017-04-27_A-long-title",
792+
medium_slug("medium-export/posts/2017-04-27_A-long-title.html"),
793+
)

0 commit comments

Comments
 (0)