Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle empty lines in md #8

Open
maelle opened this issue Sep 4, 2018 · 3 comments
Open

How to handle empty lines in md #8

maelle opened this issue Sep 4, 2018 · 3 comments

Comments

@maelle
Copy link
Contributor

maelle commented Sep 4, 2018

I am trying to parse this Markdown file

It's full of empty lines due to knitr rendering it from Rmd I guess. On GitHub it renders well. But when I try to parse it I cannot get the structure that's in the .Rmd: the table is either separated in different blocks, or if I remove empty lines, it gets glued to the rest of the README.

rmd <- "https://raw.githubusercontent.com/ropensci/drake/master/README.Rmd"

md <- "https://raw.githubusercontent.com/ropensci/drake/master/README.md"


library("magrittr")
rmd %>%
  readLines() %>%
  commonmark::markdown_xml(extensions = TRUE) %>%
  xml2::read_xml()
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#>  [1] <thematic_break/>
#>  [2] <heading level="2">\n  <text>output:</text>\n  <softbreak/>\n  <tex ...
#>  [3] <html_block>&lt;!-- README.md is generated from README.Rmd. Please  ...
#>  [4] <code_block info="{r knitrsetup, echo = FALSE}">knitr::opts_chunk$s ...
#>  [5] <code_block info="{r mainexample, echo = FALSE}">suppressMessages(s ...
#>  [6] <html_block>&lt;center&gt;\n&lt;img src="https://ropensci.github.io ...
#>  [7] <html_block>&lt;table class="table"&gt;&lt;thead&gt;&lt;tr class="h ...
#>  [8] <heading level="1">\n  <text>The drake R package </text>\n  <html_i ...
#>  [9] <paragraph>\n  <code>drake</code>\n  <text> — or, Data Frames in R  ...
#> [10] <heading level="1">\n  <text>What gets done stays done.</text>\n</h ...
#> [11] <paragraph>\n  <text>Too many data science projects follow a </text ...
#> [12] <list type="ordered" start="1" delim="period" tight="true">\n  <ite ...
#> [13] <paragraph>\n  <text>It is hard to avoid restarting from scratch.</ ...
#> [14] <html_block>&lt;center&gt;\n&lt;a href="https://twitter.com/fossilo ...
#> [15] <paragraph>\n  <text>With </text>\n  <code>drake</code>\n  <text>,  ...
#> [16] <list type="ordered" start="1" delim="period" tight="true">\n  <ite ...
#> [17] <heading level="1">\n  <text>How it works</text>\n</heading>
#> [18] <paragraph>\n  <text>To set up a project, load your packages,</text ...
#> [19] <code_block info="{r mainpackages}">library(drake)\nlibrary(dplyr)\ ...
#> [20] <paragraph>\n  <text>load your custom functions,</text>\n</paragraph>
#> ...

md %>%
  readLines() %>%
  commonmark::markdown_xml(extensions = FALSE) %>%
  xml2::read_xml()
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#>  [1] <html_block>&lt;!-- README.md is generated from README.Rmd. Please  ...
#>  [2] <html_block>&lt;center&gt;\n</html_block>
#>  [3] <html_block>&lt;img src="https://ropensci.github.io/drake/images/in ...
#>  [4] <html_block>&lt;/center&gt;\n</html_block>
#>  [5] <html_block>&lt;table class="table"&gt;\n</html_block>
#>  [6] <html_block>&lt;thead&gt;\n</html_block>
#>  [7] <html_block>&lt;tr class="header"&gt;\n</html_block>
#>  [8] <html_block>&lt;th align="left"&gt;\n</html_block>
#>  [9] <paragraph>\n  <text>Release</text>\n</paragraph>
#> [10] <html_block>&lt;/th&gt;\n</html_block>
#> [11] <html_block>&lt;th align="left"&gt;\n</html_block>
#> [12] <paragraph>\n  <text>Usage</text>\n</paragraph>
#> [13] <html_block>&lt;/th&gt;\n</html_block>
#> [14] <html_block>&lt;th align="left"&gt;\n</html_block>
#> [15] <paragraph>\n  <text>Development</text>\n</paragraph>
#> [16] <html_block>&lt;/th&gt;\n</html_block>
#> [17] <html_block>&lt;/tr&gt;\n</html_block>
#> [18] <html_block>&lt;/thead&gt;\n</html_block>
#> [19] <html_block>&lt;tbody&gt;\n</html_block>
#> [20] <html_block>&lt;tr class="odd"&gt;\n</html_block>
#> ...

md %>%
  readLines() %>%
  .[. != ""] %>%
  commonmark::markdown_xml(extensions = FALSE) %>%
  xml2::read_xml()
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <html_block>&lt;!-- README.md is generated from README.Rmd. Please e ...
#> [2] <html_block>&lt;center&gt;\n&lt;img src="https://ropensci.github.io/ ...

Created on 2018-09-04 by the reprex package (v0.2.0).

@maelle
Copy link
Contributor Author

maelle commented Sep 4, 2018

For context, I'm trying to parse READMEs that GitHub considers to be the preferred README https://developer.github.com/v3/repos/contents/#get-the-readme and anyway I must be missing something, surely if GitHub can render this table there is a way for me to correctly parse the Markdown file. 🤔

@maelle
Copy link
Contributor Author

maelle commented Sep 5, 2018

possibly related commonmark/commonmark-spec#490

@maelle
Copy link
Contributor Author

maelle commented Sep 5, 2018

For my very specific use case I'll use regex to extract the html of the 1st table but it seems suboptimal of course!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant