Using Markdown and Pandoc for Publication

The other day I was involved in editing job, in which I was supposed to edit 18 articles written in Microsoft Word (doc/docx format) and convert them into pdf format (for printing into a book) and html format (for web publishing). Manuscripts written by people not proficient in doc(x) format are notorious for formatting heterogeneity and errors making conversion of documents into different formats a nightmare. I accomplished the task with help of a couple of open source softwares with following steps:
  1. Installing appropriate softwares.
  2. Making a folder where I will keep all the markdown files. That folder becomes my working directory for R project (say, WORK). Make following subfolders: fig to hold figures, html to hold final html files, html/fig which will be a copy of the fig subfolder and will be referenced by the html files, pdf to hold final pdf files. Make a folder .pandoc/templates in the HOME folder which will hold the Pandoc Templates (default.html(5) and default.latex)
  3. Opening the doc(x) documents (say, doc1.doc(x)) with LibreOffice Writer. Saving any figures in fig folder in png format.
  4. Saving documents in html format (say, doc1.html).
  5. Convert html document into markdown format with Pandoc.
  6. Modify markdown files in any of the text editors.
  7. Build a YAML file in the WORK folder holding all the variables to be used throughout all the documents (say, my.yaml). Any document specific YAML can be inserted in the md file.
  8. Build a css file (say, my.css) in WORK/html folder, which contain all the necessary formatting codes for html output.
  9. Convert the markdown files into pdf and html format in Pandoc.

Installing appropriate softwares

The following softwares were used (clocking on the hyperlinks will lead to the sites from where the softwares can be downloaded):
  1. Ubuntu 12.04 64 bit
  2. R version 3.1.1
  3. R Studio 0.98.932
  4. LibreOffice Writer 4.1.0.4
  5. Pandoc. It comes pre-installed with current version of R Studio.
  6. Pandoc templates. There are many more sites where tailormade templates can be found to be used.

Working on doc(x) in LibreOffice Writer

After opening the doc1.doc(x) file in LibreOffice Writer, we save any pictures in it in the WORK/fig after giving it an appropriate name, preferably in .png format.
We save the file to doc1.html using LibreOffice Writer.

Converting html into markdown format

We take help of Pandoc to convert html into markdown format.
We open the terminal and reach the WORK folder and enter following to create doc1.md.
pandoc doc1.html -o doc1.md

Making appropriate Pandoc template

We copy the default.html and default.latex into the home/.pandoc/templates folder as told before.
We open the default.html in text editor. Following is an example of the template:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"$if(lang)$ lang="$lang$" xml:lang="$lang$"$endif$>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
$for(author-meta)$
  <meta name="author" content="$author-meta$" />
$endfor$
$if(date-meta)$
  <meta name="date" content="$date-meta$" />
$endif$
  <title>$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$</title>
  <style type="text/css">code{white-space: pre;}</style>
$if(quotes)$
  <style type="text/css">q { quotes: "“" "”" "‘" "’"; }</style>
$endif$
$if(highlighting-css)$
  <style type="text/css">
$highlighting-css$
  </style>
$endif$
$for(css)$
  <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
$endfor$
$if(math)$
  $math$
$endif$
$for(header-includes)$
  $header-includes$
$endfor$
</head>
<body>
$for(include-before)$
$include-before$
$endfor$
$if(title)$
<div id="$idprefix$header">
<h1 class="title">$title$</h1>
$if(subtitle)$
<h1 class="subtitle">$subtitle$</h1>
$endif$

<div class="author"><b>$author$</b></div>

<div class="affil"><i>$affiliation$</i></div>

$if(date)$
<h3 class="date">$date$</h3>
$endif$
</div>
$endif$
$if(toc)$
<div id="$idprefix$TOC">
$toc$
</div>
$endif$
$body$
$for(include-after)$
$include-after$
$endfor$
</body>
</html>
The following characteristics are seen from the above code segment:
  1. $---$: These are the variables, the values of which are to be provided with YAML document (to be told later). Sometimes, when variable is in form of a collection (like author -> name & address in YAML), the variable name of author can be accessed as $author.name$ and address of author can be accessed as $author.address$
  2. $if(---)$ --- $else$ --- $endif$ construct: This the branching code for the template. One example is as below:
    $if(date)$
    <h3 class="date">$date$</h3>
    $endif$
    
    The bove construct means that if date variable is given in YAML then it will be entered in the html document as h3 with class “date” (whose formatting can be manipulated inside css file).
  3. $for(---)$ --- $endfor$ construct: This is the loopng code for the template. One example is as below:
    $for(css)$
    <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
    $endfor$
    
    The above construct checks for the css variable which is a collection of variables. It inserts given html statement <link ---- /> for each element of css variable.
  4. $body$ construct: This variable contains all the contents of doc1.md file after converting into html format by Pandoc converter. We cannot change anything which is denoted by $body$ variable inside the template. If we want to assign a new class (or say id) to any of the element inside the md file, we will have to do it by inserting raw html statement, as depicted below.
    ## header 2
    The normal statement
    <p class="myclass">Content of the special paragraph.  It can **contain** markdown codes.</p>
    Another normal statement
    ## Another header 2
    
Similar template is available for latex, which can be modified by the user.
The details of Pandoc template can be found here.
The reader is requested to add any more resources for the above (I am not aware of them).

Editing markdown file

The doc1.md file is edited with R Studio editor using the standard method manually. There are many resources of Pandoc Markdown, this and this.

Make a YAML file

The YAML code can be put inside individual markdown files (for variables which are different for each markdown files) or put inside a separate file as depicted above.
The minimum content of my.yaml should be as under:
---
css: my.css
---
The details of YAML language construct can be found here.
In summary, the following points are evident:
  1. The YAML construct is delimited with the following
    ---
    YAML CODE
    ---
    
  2. Each variable (which was denoted as $variable$ in Pandoc template) is denoted as variable and following is the code for assigning a value to the variable.
    ---
    variable: value
    ---
    
  3. The following is an example of complex variable (equivalent to list in R).
    ---
    author:
      name: xxx
      address: yyy
    ---
    
    The name of author is accessed in Pandoc template as $author.name$. Note is to be made of indentation in front of name and address. Indentation is to be made by inserting space, not tab.
  4. The following is an example of collection (equivalent of vector in R).
---
css:
 - my1.css
 - my2.css
---
The variable css has two values associated with it (my1.css and my2.css).
$for(css)$
  <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
$endfor$
The above code segment in Pandoc Template will access both the values of css and insert a line each for my1.css and my2.css.

Converting resulting md files into html and pdf format

Finally, the resulting md files (associated with fig, css, template and yaml files) can be converted into html and pdf format by using following codes in terminal.
pandoc doc1.md my.yaml -s --data-dir=/home/HOME/.pandoc -o html/doc1.html  @for html file output@
pandoc doc1.md my.yaml -s --data-dir=/home/HOME/.pandoc -o pdf/doc1.pdf  @for pdf file output@
If many md files are present, as in the project I was doing, then the whole process may be automated using a batch file with the following code:
file <- as.list(list.files()[grep(".md",list.files())])

foo <- function(x) {
  s.pdf <- paste0("pandoc ", x, " m.yaml -s --data-dir=/home/HOME/.pandoc  -o pdf/", str_sub(x, 1L, -4L), ".pdf")
  s.htm <- paste0("pandoc ", x, " m.yaml -s --data-dir=/home/HOME/.pandoc -o html/", str_sub(x, 1L, -4L), ".html")
  system(s.pdf)
  system(s.htm)
}

lapply(file, foo)

Conclusion

The above described method was very efficient in terms of time taken and human effort expended to format all the documents into a uniform one.
YOUR COMMENTS/CRITICISMS ARE WELCOME.
BYE.

Addition

Another excellent link demonstrating how to automate the output depending on the output of the document.
2

View comments

Hi everybody,

I recently read about the relationship between bernoulli process and poisson distribution. I wrote about it explaining the process with simulation.

Click here to visit the post.

Hope you like it.

Comments are welcome.

Bye.

Dear all, recently I went through on the R6 OO system in R and was fascinated by its sleek network of environments. I have written a post whatever I could understand of the package. Click on  http://rpubs.com/sumprain/R6 for the post.

Comments and criticisms are welcome.
1

Yesterday, I delivered a talk on "Interpretation of Results of Clinical Research" in Annual Alumni Meet of Hematology Department of All India Institute of Medical Sciences, New Delhi, India. Here is the link for the same. https://github.com/sumprain/blog/tree/master/aiims_presentation.

Comments and criticisms are welcome.

Recently I have finished working on and developing a deterministic, compartmental model of erythropoeisis (How Red Blood Cells are produced and destroyed) in R using deSolve package. I have also made a Shiny application for the simulation. The model can be used as a primer for developing more complicated models, eg. for competing erythpoeisis in post bone marrow transplant settings.

Please visit http://rpubs.com/sumprain/erythropoeisis_model for the manuscript.

Dear all,

Click on the github site to see my new post. It is about a new alternative measure to compare difference between performance between two interventions.

Using Markdown and Pandoc for Publication

The other day I was involved in editing job, in which I was supposed to edit 18 articles written in Microsoft Word (doc/docx format) and convert them into pdf format (for printing into a book) and html format (for web publishing). Manuscripts written by people not proficient in doc(x) format are notorious for formatting heterogeneity and errors making conversion of documents into different formats a nightmare.
2

Clarifying difference between Ratio and Interval Scale of Measurement

Introduction

Recently while preparing lecture on scales of measurements and types of statistical data, I came across two scales of measurement when numbers are used to denote a quantitative variable. I took some time to clarify the difference between “Interval and "Ratio” scales of measurements. I am writing down what I understand of the above mentioned scales.
1

Is difference in proportion appropriate measure to compare performance of a drug over another one?

Introduction

For past few weeks, a question lingered in my mind that “Is the traditional approach of assessing difference in proportion (both in ways of arithmetic difference and ratio) between intervention A and intervention B as a way to ascertain the performance of intervention A and intervention B appropiate?”.
4

Preventing escaping in HTML

library(xtable) ## ## Attaching package: 'xtable' ## ## The following objects are masked from 'package:Hmisc': ## ## label, label<- library(stringr) library(whisker) Problem statement

Being a novice in R language, the problem I faced maight be a novice one, but I spent hours working on it.

I was working on making a html based report from a database (PostgreSQL), which would gather text information from the database and put it in the report in html format.
3

Publishing in GitHub

I struggled to make my first repository in GitHub. I finally found out the steps to do so.

Make your folder in local host and add the required files in the folder (SRCFOLDER). One of the required files is README.md, which will contain overview of the project. Add git to the SRCFOLDER. It will make a .git folder into the SRCFOLDER $ SRCFOLDER git init Add the files into the git. $ SRCFOLDER git add *.* Commit the files into git.
Subscribe
Subscribe
Blog Archive
My Blog List
My Blog List
About Me
About Me
I am a clinical hematologist practising in India with data analysis and R as my passions.
Loading
Dynamic Views theme. Powered by Blogger.