Art of Statistical Inference

Art of Statistical Inference

This post was written by me a few years ago, when I started learning the art and science of data analysis. It will be a good starter for the amateur data analysts.

Introduction

What is statistics? There are about a dozen of definitions of statistics given by different authors, of which the following seems to be an adequate one: “A field of study concerned with (1) collection, organization, summarization and analysis of data, and (2) drawing inferences about a body of data when only a part of data is observed.” All of us clinicians do statistical analyses many times daily in our practice subconsciously, especially Bayesian form of analysis in which based on our prior experience and present parameters we try to work out probability of outcomes. The aim of this article is to make us feel the beauty of statistics without complicating the issues too much.

Scope of statistics in biology

Statistics forms the backbone of any research activity. In biology and medical science, following are the commonly used statistical methods:

  1. Descriptive statistics.

  2. Comparing measures of central tendencies between two or more populations.

    a. Parametric tests, e.g. t test, ANOVA, etc.

    b. Non parametric tests, e.g. Kruskal Wallis test, Mann Whitney test, etc.

  3. Analysis of proportions (Chi square test, 2x2 table analysis).

  4. Regression analyses

    a. Univariate vs multivariate regression analyses.

    b. Linear, polynomial, non-linear regression analyses.

    c. Generalised Linear Models (logistic regression).

  5. Survival analysis (Life tables, log rank analysis, Kaplan Meier curves, Cox proportional regression analysis).

  6. Multivariate data analysis (MANOVA, Cluster analysis, etc.).

  7. Time series analysis and Markov chain models.

  8. Computer simulation studies (Monte Carlo simulations).

First five above-mentioned branches are used in day-to-day clinical research depending on characteristics of data and research question asked. This article does not aim to dwell into details of each of the above areas, but will try to explain the stepping-stones of statistical inference, from which we will be able to climb up to the specific area of interest after reading a bit.

Mathematics of probability forms the backbone of statistics (e.g., what is probability that … this patient is going to improve, this person is suffering from tuberculosis, the null hypothesis is acceptable, etc?). In next few paragraphs, we will try to understand that.

Basics of probability

Definitions

We carry out experiment or trial and note outcomes. An experiment is called random when the outcome is not known beforehand with certainty, but is known to occur from a finite or infinite set of possible, mutually exclusive (occurrence of one outcome rules out occurrence of another outcome after that experiment) outcomes. A number between 0 and 1, inclusive is assigned to each outcome and is known as the probability of occurrence of that particular outcome. We can add up probabilities of mutually exclusive outcomes to get probability of composite outcomes (events). We get 1 on adding up probabilities of all mutually exclusive outcomes. These are the requirement for a number to be called probability.

If we assign a number (can be decimal also) for each outcome, it becomes a random variable (RV), a central concept in probability.

For example, measuring BP in a patient (flipping a coin) is random experiment, all possible BPs (heads or tails) are the set of mutually exclusive outcomes which are associated with a probability, not exactly known for BP (0.5 for heads and 0.5 tails), which adds up to 1. If we assign number to BP, say actual BP minus 10 (1 for heads and 0 for tails), the set of outcomes becomes a RV.

Theoretical probability vs Experimental probability

Probability values known from probabilistic mathematics (like for heads and tails in flipping unbiased coin) are known as theoretical probabilities. We usually approximate theoretical probability of an outcome by carrying out many experiments and getting ratio between number of occurrences of that outcome and number of experiment carried out. This is experimental probability and it tends to theoretical probability as number of experiments tends to infinity.

Concept of probability distribution and density function

The RV along with probabilities of all of the possible values of RV is called the probability distribution. The mathematical function, which describes this distribution, is called distribution function, and after mathematical modification (differentiation), we get probability density function (pdf) for continuous RV.

Expectation and variance

Expectation of a RV is the value, which we expect to find on average, if we do experiment many times (ideally infinite). It is same as mean of a population. It is calculated by adding up all products of values of RV with their probabilities. Expectation of RV, which is sum of two or more RVs, is sum of expectation of individual RVs.

Variance is a measure of dispersion, and is the expected value of square of difference of value of RV and expectation of RV. Conceptually, we carry out experiment many a times and after each experiment we calculate square of difference of value of RV and expectation of RV. Variance is the average of all the values. It is also an expectation. Variance of RV, which is sum of two or more independent RVs, is sum of variance of individual RVs.

Parametric distribution

There are many distribution functions, which actually denote families of probability distribution functions and assume a particular probability distribution function depending on a particular set of input values of so-called parameters. These are known as parametric distributions. Examples include binomial (parameter: p), normal (parameters: mean, standard deviation), exponential (parameter: lambda), gamma (parameters: a, b) etc. In many a statistical analyses, our aim is to estimate the parameters (say mean and standard deviation) from a given sample which we assume to be from a particular parametric distribution (say normal distribution), as will be described later on.

Scope of RV

Any numerical quantity that does not have a fixed value and has got a distribution function associated with it becomes a RV. That means parameter estimators and sample means (as will be discussed later on) are also RV and have distribution functions, and expectation and variance associated with them. Similarly, the definition of experiment can also change with context, like when calculating distribution of sample means, our experiment changes from taking blood pressure of a single person to randomly selecting sample of size of say 1000 persons, checking their blood pressure and calculating their mean.

Conditional, marginal and joint probability

Conditional probability is expressed as probability of event 'A' occurring given event 'B' has occurred after conducting an experiment. Like, after rolling a dice, probability of getting 2 is 1{2}/6{1,2,3,4,5,6}. However, given it is even (event 'B'), probability of it being 2 (event 'A') is 1{2}/3{2,4,6}. Again, given it is 2 (event 'A'), probability of it being even (event 'B') is 1. Following is a table showing marginal and joint probabilities of two RVs:

Association and dependence

Two RVs are independent, if the marginal probability distribution of one RV is same as conditional probability distribution of one RV given any value of second RV. For example, marginal probability of Y=0 (previous example) is 1/6. Conditional probability of Y=0 given X=0 is 1/6 divided by ½, that is 1/3. In this case, conditional probability of Y=0 given X=0 and marginal probability of Y=0 are different, implying that X and Y are dependent RVs, as expected of course. If joint probabilities of all possible values of all RVs are equal to product of their marginal probabilities, we call those RVs as independent.

Covariance and correlation are two measures of linear association (dependence) of two RVs. Covariance is expected value of {product of (difference of values of RV1 and expected value of RV1) and (difference of values of RV2 and expected value of RV2)}. Correlation is a mathematical modification of covariance to restrict values between -1 and 1. Values of 1 means complete positive correlation, -1 means complete negative correlation (values of RVs are negatively related) and 0 means no linear correlation. It is important to note that, correlation and covariance only indicate amount of linear association. Two RVs may be absolutely related to each other by some non-linear function (as RV1 is square of RV2), will have less than 1 correlation.

It is vitally important to note that the statistical measures of association and dependence do not indicate anything about causality.

Statistical inference

Concept of population and samples

After understanding concept of random variable, next most important concept in statistics to be understood is of population and sample. Population denotes all possible values of a RV and its probability distribution. Population, like RV can be finite or infinite, continuous or discrete. If we know exactly about a population, rest of this article is not required. In real life situations, we define population at the beginning of the experiment, meaning we assign all possible values to the RV denoting population, but we are not aware of the probability distribution.

We carry out the experiment many a times and get a random sample of the underlying population and aim to infer population probability distribution (and other parameters) with the help of sample, assuming that samples behave approximately as the population. This is called statistical inference and is the heart of statistics. This assumption gets truer as the sample size increases (ideally, as it reaches infinity).

We call a sample random when, (1) every member of the sample is taken from the underlying population and has same probability distribution (unknown) and (2) every member of the sample is independent of others.

It is important to note that sample characteristics changes, if new random samples are collected. The probability distribution of sample statistic (i.e. sample mean, sample variance, sample range) is known as sampling distribution, and has associated expectation and variance.

Data description

Data is the values of sample. Before we make any inference from the data, we need to describe the data. It is very difficult and usually impossible to analyze raw data. One of the easiest ways to describe a data is by way of some low dimensional summary, which is known as a statistic. Data needs to be analyzed for its centrality, dispersion and shape. Commonly used measures of centrality are sample mean, median and mode. Commonly used measures of dispersion are variance, standard deviation, range, inter-quantile distance. Commonly used measures of shape are skewness (measures asymmetry about mean, skewness of zero meaning symmetry, negative skew meaning skew towards left, positive skew meaning skew towards right) and kurtosis (measures peakness). We are not going to dwell further into these measures.

Instead of reducing the dataset, it is often helpful to display the whole data graphically. Analysis of data visually is an extremely useful and easy preliminary step in the overall process. It allows us to assess qualitatively the probability distribution (general structure) of data, difference between two or more datasets, relationships between two datasets, etc. Most of the times, we can obtain answer after a mere visual inspection of data. Useful techniques for assessing structure of data and comparing two or more datasets are histograms, kernel density estimators, strip charts, box-plots. Q-Q plots and normality plots are useful for estimating deviation of data from normal distribution (Normally distributed data will have linear plot). Scatter plots are extremely useful for depicting relationship between two datasets.

Concept of likelihood function and maximum likelihood estimator

Understanding likelihood function and maximum likelihood estimate is central to quantitative statistical inference.

Once we have got a sample (y\( _1 \), y\( _2 \), …, y\( _n \)) of size n from a population with a probability distribution (which may not be known, just be guessed from theoretical knowledge or guessed from central limit theorem) with parameter(s) θ, our aim is to estimate value(s) of θ, which explains the data best to an arbitrary accuracy. We form likelihood function, l(θ) which is a pdf with unknown θ and known y\( _i \) s. This function indicates likelihood of the data being explained by that particular θ. This function is mathematically easier to manipulate, if we assume the sample to be random (independent). A fundamental principle in statistical inference is that if l(θ1) > l(θ2), then θ1 explains data better that θ2 and that the amount of support is directly proportional to l(θ).

We get value of θ which maximizes likelihood function by differentiation or numerically. This value of θ is known as maximum likelihood estimator (MLE). This is the most general way for statistical inference. MLE is not actual θ, which is unknown to us and is just an estimate. So, it is a RV and has a probability distribution associated with it.

Bayesian inference

The essence of Bayesian inference is continuous update of pdf of a parameter, θ with advent of new knowledge about it.

We construct p(θ), pdf to reflect our knowledge by making p(θ) large for those values of θ that seem most likely and p(θ) small for those values of θ that seem least likely, according to our state of knowledge. And different people can have different states of knowledge, hence different pdf. This is known as priori distribution.

We carry out experiment to obtain a random sample to update our knowledge about distribution of θ, for that we make likelihood function, i.e. a pdf of random sample given θ. We can then get pdf of θ given that particular random sample by a bit of mathematical manipulation. This new updated pdf of θ is known as posterior distribution.

Few assumptions for statistical inference (theories of large samples)

It is intuitively obvious that large samples are better than small, and less obviously, that with enough data one is eventually lead to the right answer. Few theorems for large samples are:

  1. If we get a random sample of size n from a distribution with mean μ, expected value of sample mean is μ.

  2. If we get random sample of size n from a distribution with variance σ\( _2 \), variance of sample mean (square of standard error of mean) is σ\( _2 \)/n. This means, as n increases, we become more accurate about estimating population mean.

  3. As sample size increases to infinity, it is certain that the sample mean will converge to population mean.

  4. As sample size increases to infinity, probability distribution of sample mean will assume that of normal distribution with mean = population mean (μ) and variance = σ\( _2 \)/n. μ can be estimated by sample mean and σ\( _2 \) by sample variance (approximately). This is irrespective of the probability distribution of population. This theorem is the Central Limit Theorem, and is the backbone of parametric statistical inference.

Next obvious question is how big the sample size should be for central limit theorem to be valid. There is no definite answer to it, as the population distribution is more asymmetric; sample size needs to be larger.

Concept of confidence intervals/likelihood set

As described previously, calculating MLE is the most general way for statistical inference and that MLE is only an estimate of the actual and unknown parameter (θ). Therefore, we need to quantify the accuracy of MLE as an estimate of θ. In other words, we want to know what other values of θ, in addition to MLE, provide a reasonably good explanation of data. Here comes the concept of likelihood set (LS). LS\( _α \) is the set of θs that give l(\( \theta \))/l(MLE) \( \geq \alpha \) , where α is a number between 0 and 1, exclusive, usually taken as 0.1 by convention and α quantifies “reasonably well”. As depicted in the diagram, LS\( _α \) is usually an interval.

As already discussed, MLE is a random variable and so has a probability distribution function, F\( _MLE \) associated with it. In addition to LS, F\( _MLE \) can be used to assess accuracy of MLE as an estimator of θ. If F\( _MLE \) is tightly concentrated around θ, MLE is highly accurate measure of θ. If MLE of a parameter is same as sample mean, as it is in binomial, normal, poisson or exponential distribution, central limit theorem applies. It implies, in large samples, MLE can be described as normally distributed with mean and variance that can be estimated by sample mean and sample variance respectively.

For any normal distribution, about 95% of the mass is within ± 2 standard deviations (square root of variance). So, the interval MLE ± 2 standard deviations is a reasonable estimation interval of population parameter (θ). This interval is the 95% confidence interval (95% CI). It can be shown that LS0.1 and 95% CI are both same in large samples. In smaller samples, LS gives a more accurate measure of θ.

Concept of hypothesis testing

Hypothesis testing is one of the commonest ways of statistical inference done in routine practice. In each instance, there are two mutually exclusive (may not be exhaustive) hypotheses – H\( _0 \), null hypothesis and H\( _a \), alternative hypothesis. By tradition, H\( _0 \) tells that current theory is right and Ha tells that our current theory needs updating. The fundamental idea is to see whether the data are “compatible” with the specific H\( _0 \). If so, there is no reason to doubt H\( _0 \), else there is reason to doubt H\( _0 \) and possibly to consider H\( _a \) in its stead. Typically, there is a four-step process:

  1. Formulate a scientific null hypothesis and translate it into statistical terms.

  2. Choose a low dimensional statistic, say w = w(y\( _1 \), y\( _2 \), …,y\( _n \)) such that the distribution of w is specified under H\( _0 \) and likely to be different under H\( _a \).

  3. Calculate or approximate, the distribution of w under H\( _0 \).

  4. Check whether the observed value of w, calculated from y\( _1 \), y\( _2 \), …,y\( _n \) is compatible with its distribution under H\( _0 \).

To illustrate in more detail, let us consider testing a new blood pressure medication. The H\( _0 \) is that the new medication is not any more effective than the old. We will consider two ways a study may be conducted and see how to test the hypotheses both ways.

METHOD 1: A large number of patients are enrolled in a study and their blood pressure are measured (large sample size). Half are randomly chosen to receive new medication (treatment) and other half receives the old (control). Blood pressure is measure for all patients at baseline and then repeated after a pre-specified time. Let, Y\( _{c,i} \) be change in blood pressure after receiving control in patient i and Y\( _{t,j} \) be change in blood pressure after receiving treatment in patient j. Y\( _{c,1} \), Y\( _{c,2} \), …, Y\( _{c,n} \) is random sample from distribution f\( _c \) and Y\( _{t,1} \), Y\( _{t,2} \), …, Y\( _{t,n} \) is random sample from distribution f\( _t \). Usually f\( _c \) and f\( _t \) are unknown. Expected value of control population is μ\( _c \) and treatment population is μ\( _t \). Variance of control population is σ\( _{2c} \) and treatment population is σ\( _{2t} \). All these parameters are not known. The translation of the hypotheses into statistical terms is H\( _0 \): μ\( _t \) = μ\( _c \) and H\( _a \): μ\( _t \) ≠ μ\( _c \). Because we are testing a difference in means, let low dimensional statistic (w) be difference between sample means (\( \bar{Y}_t \) - \( \bar{Y}_c \)). If the sample size is reasonably large, then central limit theorem says, w approximately follows a normal distribution with mean 0 and variance σ\( _{2w} \) under H\( _0 \) with σ\( _{2w} \) = (σ\( _{2t} \) + σ\( _{2c} \))/n. The mean 0 comes from H\( _0 \) and variance σ\( _{2w} \) comes from adding variances of independent random variables. σ\( _{2t} \) and σ\( _{2c} \) and so σ\( _{2w} \) can be estimated from the sample variance. So, we calculate w from the data and see whether it is within about 2 or 3 standard deviations of where H\( _0 \) says it should be. If it isn't, that's suggestive of sample not belonging to the population described in H\( _0 \). p value is the probability that the sample belongs to the population described in H\( _0 \).

METHOD 2: A large number of patients are enrolled in a study and their blood pressure is measured. They are matched together in pairs according to relevant medical characteristics. The two patients in a pair are chosen to be as similar to each other as possible. In each pair, one patient is randomly chosen to receive the new medication (treatment); the other receives the old (control). After a prespecified amount of time their blood pressures are measured again. Let Y\( _{t,j} \) and Y\( _{c,i} \) be the change in blood pressure for the j'th treatment and i'th control patients The researcher records X\( _i \) = 1, if Y\( _{t,j} \) > Y\( _{c,i} \) or 0 otherwise. X\( _1 \), X\( _2 \), …, X\( _n \) are sample of size n from bernoulli's distribution with parameter θ. The translation of the hypothesis into statistical terms is H\( _0 \): θ = 0.5 and H\( _a \): θ ≠ 0.5. Let low dimensional statistic (w) be mean of sum of all Xs. Under H\( _0 \), w behaves as random sample from binomial distribution with parameters θ=0.5 and n. We can find out of values of θ which correspond to probabilities of 0.025 and 0.975. If w falls within that range, we are 95% confident that H\( _0 \) explains the data, else it is an suggestive evidence against H\( _0 \) (we can still be wrong 5% of times). If the sample size is reasonably large, then central limit theorem says, w approximately follows a normal distribution with mean 0.5 and variance σ\( _{2w} \) under H\( _0 \). Population variance can be estimated by sample variance. So as before, we calculate w from the data and see whether it is within about 2 or 3 standard deviations of where H\( _0 \) says it should be. If it isn't, that's evidence against H\( _0 \). This was an alternative way to analyze the same data. This way is worse of the two methods, as we are losing information in converting difference in BP (continuous data) to dichotomous data (BP decreased or not), so the hypothesis testing becomes less efficient.

Conclusion

This article has presented the basic concepts of probability and statistical inference in brief. It has not dealt with more complicated topics as order statistics and non-parametric analysis. The aim of this article has been to open up a new horizon of statistical thinking amongst us, ultimate aim of which is to develop our own custom-made methods for statistical inference for the problem in hand.

0

Add a comment

  1. Hi everybody,

    I recently read about the relationship between bernoulli process and poisson distribution. I wrote about it explaining the process with simulation.

    Click here to visit the post.

    Hope you like it.

    Comments are welcome.

    Bye.

    Dr Suman
    0

    Add a comment

  2. Dear all, recently I went through on the R6 OO system in R and was fascinated by its sleek network of environments. I have written a post whatever I could understand of the package. Click on  http://rpubs.com/sumprain/R6 for the post.

    Comments and criticisms are welcome.

    1

    View comments

  3. Yesterday, I delivered a talk on "Interpretation of Results of Clinical Research" in Annual Alumni Meet of Hematology Department of All India Institute of Medical Sciences, New Delhi, India. Here is the link for the same. https://github.com/sumprain/blog/tree/master/aiims_presentation.

    Comments and criticisms are welcome.

    Dr Suman Kumar (@sumankumarpram1)
    0

    Add a comment

  4. Recently I have finished working on and developing a deterministic, compartmental model of erythropoeisis (How Red Blood Cells are produced and destroyed) in R using deSolve package. I have also made a Shiny application for the simulation. The model can be used as a primer for developing more complicated models, eg. for competing erythpoeisis in post bone marrow transplant settings.

    Please visit http://rpubs.com/sumprain/erythropoeisis_model for the manuscript.

    The model and description is available at GitHub site, https://github.com/sumprain/blog/tree/master/erythropoeisis_model.

    The description of the model is given in model_description.html, model_description.Rmd and model_description.md files

    The shiny application can be found at shiny folder.

    The work.R file is the source file for the R code.

    Please feel free to comment and post your criticisms.

    Dr Suman Kumar (@sumankumarpram1)
    0

    Add a comment

  5. Dear all,

    Click on the github site to see my new post. It is about a new alternative measure to compare difference between performance between two interventions.

    Any comments, criticisms are welcome

    Dr Suman Kumar Pramanik
    (@sumankumarpram1)
    0

    Add a comment


  6. Using Markdown and Pandoc for Publication

    The other day I was involved in editing job, in which I was supposed to edit 18 articles written in Microsoft Word (doc/docx format) and convert them into pdf format (for printing into a book) and html format (for web publishing). Manuscripts written by people not proficient in doc(x) format are notorious for formatting heterogeneity and errors making conversion of documents into different formats a nightmare. I accomplished the task with help of a couple of open source softwares with following steps:
    1. Installing appropriate softwares.
    2. Making a folder where I will keep all the markdown files. That folder becomes my working directory for R project (say, WORK). Make following subfolders: fig to hold figures, html to hold final html files, html/fig which will be a copy of the fig subfolder and will be referenced by the html files, pdf to hold final pdf files. Make a folder .pandoc/templates in the HOME folder which will hold the Pandoc Templates (default.html(5) and default.latex)
    3. Opening the doc(x) documents (say, doc1.doc(x)) with LibreOffice Writer. Saving any figures in fig folder in png format.
    4. Saving documents in html format (say, doc1.html).
    5. Convert html document into markdown format with Pandoc.
    6. Modify markdown files in any of the text editors.
    7. Build a YAML file in the WORK folder holding all the variables to be used throughout all the documents (say, my.yaml). Any document specific YAML can be inserted in the md file.
    8. Build a css file (say, my.css) in WORK/html folder, which contain all the necessary formatting codes for html output.
    9. Convert the markdown files into pdf and html format in Pandoc.

    Installing appropriate softwares

    The following softwares were used (clocking on the hyperlinks will lead to the sites from where the softwares can be downloaded):
    1. Ubuntu 12.04 64 bit
    2. R version 3.1.1
    3. R Studio 0.98.932
    4. LibreOffice Writer 4.1.0.4
    5. Pandoc. It comes pre-installed with current version of R Studio.
    6. Pandoc templates. There are many more sites where tailormade templates can be found to be used.

    Working on doc(x) in LibreOffice Writer

    After opening the doc1.doc(x) file in LibreOffice Writer, we save any pictures in it in the WORK/fig after giving it an appropriate name, preferably in .png format.
    We save the file to doc1.html using LibreOffice Writer.

    Converting html into markdown format

    We take help of Pandoc to convert html into markdown format.
    We open the terminal and reach the WORK folder and enter following to create doc1.md.
    pandoc doc1.html -o doc1.md
    

    Making appropriate Pandoc template

    We copy the default.html and default.latex into the home/.pandoc/templates folder as told before.
    We open the default.html in text editor. Following is an example of the template:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml"$if(lang)$ lang="$lang$" xml:lang="$lang$"$endif$>
    <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      <meta http-equiv="Content-Style-Type" content="text/css" />
      <meta name="generator" content="pandoc" />
    $for(author-meta)$
      <meta name="author" content="$author-meta$" />
    $endfor$
    $if(date-meta)$
      <meta name="date" content="$date-meta$" />
    $endif$
      <title>$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$</title>
      <style type="text/css">code{white-space: pre;}</style>
    $if(quotes)$
      <style type="text/css">q { quotes: "“" "”" "‘" "’"; }</style>
    $endif$
    $if(highlighting-css)$
      <style type="text/css">
    $highlighting-css$
      </style>
    $endif$
    $for(css)$
      <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
    $endfor$
    $if(math)$
      $math$
    $endif$
    $for(header-includes)$
      $header-includes$
    $endfor$
    </head>
    <body>
    $for(include-before)$
    $include-before$
    $endfor$
    $if(title)$
    <div id="$idprefix$header">
    <h1 class="title">$title$</h1>
    $if(subtitle)$
    <h1 class="subtitle">$subtitle$</h1>
    $endif$
    
    <div class="author"><b>$author$</b></div>
    
    <div class="affil"><i>$affiliation$</i></div>
    
    $if(date)$
    <h3 class="date">$date$</h3>
    $endif$
    </div>
    $endif$
    $if(toc)$
    <div id="$idprefix$TOC">
    $toc$
    </div>
    $endif$
    $body$
    $for(include-after)$
    $include-after$
    $endfor$
    </body>
    </html>
    
    The following characteristics are seen from the above code segment:
    1. $---$: These are the variables, the values of which are to be provided with YAML document (to be told later). Sometimes, when variable is in form of a collection (like author -> name & address in YAML), the variable name of author can be accessed as $author.name$ and address of author can be accessed as $author.address$
    2. $if(---)$ --- $else$ --- $endif$ construct: This the branching code for the template. One example is as below:
      $if(date)$
      <h3 class="date">$date$</h3>
      $endif$
      
      The bove construct means that if date variable is given in YAML then it will be entered in the html document as h3 with class “date” (whose formatting can be manipulated inside css file).
    3. $for(---)$ --- $endfor$ construct: This is the loopng code for the template. One example is as below:
      $for(css)$
      <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
      $endfor$
      
      The above construct checks for the css variable which is a collection of variables. It inserts given html statement <link ---- /> for each element of css variable.
    4. $body$ construct: This variable contains all the contents of doc1.md file after converting into html format by Pandoc converter. We cannot change anything which is denoted by $body$ variable inside the template. If we want to assign a new class (or say id) to any of the element inside the md file, we will have to do it by inserting raw html statement, as depicted below.
      ## header 2
      The normal statement
      <p class="myclass">Content of the special paragraph.  It can **contain** markdown codes.</p>
      Another normal statement
      ## Another header 2
      
    Similar template is available for latex, which can be modified by the user.
    The details of Pandoc template can be found here.
    The reader is requested to add any more resources for the above (I am not aware of them).

    Editing markdown file

    The doc1.md file is edited with R Studio editor using the standard method manually. There are many resources of Pandoc Markdown, this and this.

    Make a YAML file

    The YAML code can be put inside individual markdown files (for variables which are different for each markdown files) or put inside a separate file as depicted above.
    The minimum content of my.yaml should be as under:
    ---
    css: my.css
    ---
    
    The details of YAML language construct can be found here.
    In summary, the following points are evident:
    1. The YAML construct is delimited with the following
      ---
      YAML CODE
      ---
      
    2. Each variable (which was denoted as $variable$ in Pandoc template) is denoted as variable and following is the code for assigning a value to the variable.
      ---
      variable: value
      ---
      
    3. The following is an example of complex variable (equivalent to list in R).
      ---
      author:
        name: xxx
        address: yyy
      ---
      
      The name of author is accessed in Pandoc template as $author.name$. Note is to be made of indentation in front of name and address. Indentation is to be made by inserting space, not tab.
    4. The following is an example of collection (equivalent of vector in R).
    ---
    css:
     - my1.css
     - my2.css
    ---
    
    The variable css has two values associated with it (my1.css and my2.css).
    $for(css)$
      <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
    $endfor$
    
    The above code segment in Pandoc Template will access both the values of css and insert a line each for my1.css and my2.css.

    Converting resulting md files into html and pdf format

    Finally, the resulting md files (associated with fig, css, template and yaml files) can be converted into html and pdf format by using following codes in terminal.
    pandoc doc1.md my.yaml -s --data-dir=/home/HOME/.pandoc -o html/doc1.html  @for html file output@
    pandoc doc1.md my.yaml -s --data-dir=/home/HOME/.pandoc -o pdf/doc1.pdf  @for pdf file output@
    
    If many md files are present, as in the project I was doing, then the whole process may be automated using a batch file with the following code:
    file <- as.list(list.files()[grep(".md",list.files())])
    
    foo <- function(x) {
      s.pdf <- paste0("pandoc ", x, " m.yaml -s --data-dir=/home/HOME/.pandoc  -o pdf/", str_sub(x, 1L, -4L), ".pdf")
      s.htm <- paste0("pandoc ", x, " m.yaml -s --data-dir=/home/HOME/.pandoc -o html/", str_sub(x, 1L, -4L), ".html")
      system(s.pdf)
      system(s.htm)
    }
    
    lapply(file, foo)
    

    Conclusion

    The above described method was very efficient in terms of time taken and human effort expended to format all the documents into a uniform one.
    YOUR COMMENTS/CRITICISMS ARE WELCOME.
    BYE.

    Addition

    Another excellent link demonstrating how to automate the output depending on the output of the document.
    2

    View comments

  7. Clarifying difference between Ratio and Interval Scale of Measurement

    Clarifying difference between Ratio and Interval Scale of Measurement

    Introduction

    Recently while preparing lecture on scales of measurements and types of statistical data, I came across two scales of measurement when numbers are used to denote a quantitative variable. I took some time to clarify the difference between “Interval and "Ratio” scales of measurements. I am writing down what I understand of the above mentioned scales.

    Process of measuring a variable

    First step in variable measurement is to understand the concept we want to measure, i.e., we would like to define the variable on a conceptual level. Then we need to make an operational definition of the variable, which includes the following steps:

    1. Setting up of a domain of all the possible values the variable can assume.

    2. Understanding the meaning of different values the variable can assume.

      • Different values can just mean different classes (categories) - Nominal scale
      • Different values can mean different classes (categories) with some ordering (direction of difference) between the classes - Ordinal scale
      • Different values can mean different classes (categories) with ordering and specified distance between them (direction and magnitude/distance of difference) - Interval and Ratio scale
    3. Checking if a real origin (“0”) exists for the variable in the particular scale. Origin (“0”) should mean absolute absence of the variable.

    4. Designing a device which will measure the variable.

    5. Validating the measurement from the device.

    Prerequisites for Ratio Scale

    There are two prerequisites for a measurement scale to be a Ratio Scale:

    1. Presence of “Real Origin”.
    2. Scale is uniformly spaced across its full domain.

    What happens when the above prerequisites are met?

    Let us assume that we have made numerical observations \( A_{ratio} \) and \( B_{ratio} \) for a variable in ratio scale and that \( B_{ratio} > A_{ratio} \). There are two valid ways to denote the difference between A and B:

    1. Arithmetic difference between \( A_{ratio} \) and \( B_{ratio} \): It is denoted by \( B_{ratio} - A_{ratio} \). It is a valid measure of difference because of the fact that the scale is uniformly spaced across the domain.

    2. Ratio difference between \( A_{ratio} \) and \( B_{ratio} \): It is denoted by \( B_{ratio}/A_{ratio} \). It indicates that \( B_{ratio} \) is \( B_{ratio}/A_{ratio} \) times larger than \( A_{ratio} \). We say this as a valid measure of difference because the origin is an absolute one and is same for both observations. Note that there is no unit as the result is a ratio. It is also equivalent to arithmetic difference of log transformation of observations, \( log(B_{ratio}) - log(A_{ratio}) \).

    3. Location transformation: If we shift the observations by \( x \) units, we get \( Ax_{ratio} = A_{ratio} + x \) and \( Bx_{ratio} = B_{ratio} + x \). Arithmetic difference between the two transformed observations, \( Bx_{ratio} - Ax_{ratio} = B_{ratio} - A_{ratio} \), which is the same as original observations.

    4. Scale transformation: If we multiply each of the observations by \( x \) units, we get \( Ax_{ratio} = A_{ratio} \cdot x \) and \( Bx_{ratio} = B_{ratio} \cdot x \). Ratio difference between the two transformed observations, \( Bx_{ratio}/Ax_{ratio} = B_{ratio}/A_{ratio} \), which is the same as original observations.

    So, for ratio scale, both arithmetic and ratio difference are valid measures of difference between observations and the difference remain same after both location and scale transformations.

    General transformation of measuring scale

    Any transformation (\( X_{trans} \)) of the original ratio scale, say \( X_{ratio} \) can be depicted as follows

    \[ X_{ratio} = f(X_{trans},S(X_{trans}),L(X_{trans})) \]

    where, \( S(X_{trans}) \) denotes scale transformation parameter as a function wrt location in transformed scale and \( L(X_{trans}) \) denotes location transformation parameter as a function wrt location in transformed scale.

    If we assume constant \( S \) and \( L \) wrt location in transformed scale, one of the simplest scale transformation will be:

    \[ X_{ratio} = (X_{trans} + L) \cdot S \]

    where, \( S \neq 0 \)

    and interval scale of measurement (\( X_{int} \)) will be the one with \( L \neq 0 \) in addition to the above constraints.

    What happens in Interval Scale?

    In interval scale, the zero doesnot mean absolute nothingness, but it is an arbitrarily chosen one and corresponds to a distance of \( L \) from the real origin in ratio scale.

    We continue our example from the above section:

    Let us say that we make two observations in interval scale, \( A_{int} \) and \( B_{int} \), and want to assess difference between both the observations as done earlier.

    Observation \( A_{int} \) will be mapped as \( (A_{int} + L) \cdot S \) and observation \( B_{int} \) will be mapped as \( (B_{int} + L) \cdot S \) in ratio scale. We will have to use values in ratio scale for comparision, as it has got “real origin”.

    1. Arithmetic difference between \( A_{int} \) and \( B_{int} \): It shows that the arithmetic difference measured in interval scale is linearly related to the difference measured in ratio scale and that the arithmetic difference in ratio scale is independent of absolute values of \( A_{int} \) and \( B_{int} \). Moreover, interval scale is uniformly spaced across the full domain. Because of the above reasons, arithmetic difference measured in interval scale is a valid way of representing difference between observations \( A_{int} \) and \( B_{int} \). \[ [B_{int} - A_{int}]_{int\ scale} = [(B_{int} + L) \cdot S - (A_{int} + L) \cdot S]_{ratio\ scale} = [(B_{int} - A_{int}) \cdot S]_{ratio\ scale} \]
    2. Ratio difference between \( A_{int} \) and \( B_{int} \): Unlike the arithmetic difference, ratio difference in interval scale is dependent on the absolute values of \( A_{int} \) and \( B_{int} \), with ratio approaching \( B_{int}/A_{int} \) with \( B_{int}, A_{int} >> L \). So, ratio difference is not a valid measure of difference between two observations in interval scale. \[ [B_{int}/A_{int}]_{int\ scale} = [(B_{int} + L) \cdot S/(A_{int} + L) \cdot S]_{ratio\ scale} = [(B_{int} + L)/(A_{int} + L)]_{ratio\ scale} \]
    3. Location transformation: If we shift the observations by x units on interval scale, we get \( Ax_{int} = ((A_{int} + x) + L) \cdot S \) and \( Bx_{int} = ((B_{int} + x) + L) \times S \). Arithmetic difference between the two transformed observations, \( [Bx_{int} - Ax_{int}]_{int\ scale} = [(B_{int} - A_{int}) \cdot S]_{ratio\ scale} = [B_{int} - A_{int}]_{int\ scale} \), remains the same as original observations on interval scale. So, arithmetic difference remains same with location transformation.
    4. Scale transformation: If we multiply each of the observations by x units on original scale, we get \( Ax_{int} = (A_{int} \cdot x + L) \cdot S \) and \( Bx_{int} = (B_{int} \cdot x + L) \cdot S \). Ratio difference between the two transformaed observations, \( [Bx_{int}/Ax_{int}]_{int\ scale} = [(B_{int} \cdot x + L)/(A_{int} \cdot x + L)]_{ratio\ scale} \neq [B_{int}/A_{int}]_{int\ scale} \). With scale transformation, the ratio difference becomes different from the the ratio difference of the original observations in interval scale.

    So, for interval scale, only arithmetic difference is a valid measure of difference between observations.

    Conclusion

    The aim of this post is to express what I understand of the interval and ratio scale of measurements. Comments, suggestions and criticisms are welcome.

    Bye.

    1

    View comments

  8. Is difference in proportion appropriate measure to compare performance of a drug over another one?

    Is difference in proportion appropriate measure to compare performance of a drug over another one?

    Introduction

    For past few weeks, a question lingered in my mind that “Is the traditional approach of assessing difference in proportion (both in ways of arithmetic difference and ratio) between intervention A and intervention B as a way to ascertain the performance of intervention A and intervention B appropiate?”. The question came in my mind after observing practitioners dogmatically approving a particular intervention over another one based on so called statistically significant difference (p << 0.05 or miniscule confidence interval not including value of no effect, usually 0 for difference or 1 for ratio) in proportion of success (say patients surviving at a certain fixed time point) between the two groups.

    I will be discussing about the actual meaning of p value and confidence intervals of difference in proportions. I will also try to highlight ill effects and limitations of using confidence intervals in deciding which of the two interventions is better in a properly conducted experimental design (no fallacy in study design and conduct).

    I will be using simulation techniques to explain my point.

    A clinical scenario

    Let us assume that we have discovered a new drug which is more effective than the standard of care drug A and interested in proving that drug B is more effective than drug A. We assume that our outcome of interest is “proportion of patients dead at the end of 6 months of starting the treatment”. Let us assume that patients on Drug A have 6 months mortality of 60% and patients on Drug B have 6 months mortality of 40% (reduction of mortality rate by 20%)(population parameters, usually unknown to us).

    We take 2n number of patients and randomly assign Drug A and Drug B to n patients each and observe them for 6 months (we assume appropiate randomisation, complete follow up and independence of all patients) to observe proportion of death in each group.

    We will use 95% CI for all the measures throughout this post.

    library(plyr)
    library(ggplot2)
    
    ## Package SparseM (0.99) loaded.
    ##     To cite, see citation("SparseM")
    
    library(Hmisc, quietly = T)
    
    ## Loading required package: splines
    ## 
    ## Attaching package: 'Hmisc'
    ## 
    ## The following objects are masked from 'package:plyr':
    ## 
    ##     is.discrete, summarize
    ## 
    ## The following objects are masked from 'package:base':
    ## 
    ##     format.pval, round.POSIXt, trunc.POSIXt, units
    
    N <- list(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 1000, 2000, 10000)
    
    res.ci <- function(n, seed, prob.a1, prob.b1) {
    
      samp.n <- function(n, prob.a1, prob.b1) {
      a <- sample(c(0,1), size = n, replace = T, prob = c(1 - prob.a1, prob.a1))
      b <- sample(c(0,1), size = n, replace = T, prob = c(1 - prob.b1, prob.b1))
      df.wide <- data.frame(deathA = a, deathB = b)
      df.narrow <- data.frame(treat = c(rep("a", length = n), rep("b", length = n)), death = c(a, b))
      return(list(df.wide = df.wide, df.narrow = df.narrow))
    }
    
      samp <- samp.n(n, prob.a1, prob.b1)
    
      # calculate ci of propA and propB separately
    
      bt <- function(x) {
    
        r <- binom.test(table(factor(x, levels = c(1,0))))$conf.int
        return(c(lcl = r[1], ucl = r[2]))
      }
    
      ciA <- bt(samp$df.wide$deathA)
      ciB <- bt(samp$df.wide$deathB)
    
      # compute ci of difference in proportion and p value
    
      matAB <- laply(.data = samp$df.wide, .fun = function(x) return(c(`1` = sum(x), `0` = sum(x == 0))))
    
      propDiff <- prop.test(matAB)
    
      # Get number of patients who died in A but not in B
    
      # logistic regression analysis: OR and their CIs
    
      l.mod <- glm(death ~ treat, data = samp$df.narrow, family = binomial(link = "logit"), x = T, y = T)
    
      or.b <- exp(coef(l.mod)[2])
      conf.int.b <- exp(confint(l.mod)[2,])
    
      # Prediction of l.mod for "a" and "b" with CI
    
      nd <- data.frame(treat = c("a", "b"))
    
      pred.lp <- predict(l.mod, nd, se.fit = T, type = "link")
    
      pred.prop <- plogis(pred.lp$fit)
      pred.ucl.prop <- plogis(pred.lp$fit + 1.96*pred.lp$se.fit)
      pred.lcl.prop <- plogis(pred.lp$fit - 1.96*pred.lp$se.fit)
    
      res.df <- data.frame(num = nrow(samp$df.wide), lclA = ciA[1], uclA = ciA[2], lclB = ciB[1], uclB = ciB[2], lclDiff = propDiff$conf.int[1], uclDiff = propDiff$conf.int[2], pDiff = propDiff$p.value, orB = or.b, lclOR = conf.int.b[1], uclOR = conf.int.b[2], pred.propA = pred.prop[1], pred.lcl.propA = pred.lcl.prop[1], pred.ucl.propA = pred.ucl.prop[1], pred.propB = pred.prop[2], pred.lcl.propB = pred.lcl.prop[2], pred.ucl.propB = pred.ucl.prop[2])
    
      names(res.df) <- Cs(num, lclA, uclA, lclB, uclB, lclDiff, uclDiff, pDiff, orB, lclOR, uclOR, pred.propA, pred.lcl.propA, pred.ucl.propA, pred.propB, pred.lcl.propB, pred.ucl.propB)
    
      return(res.df)
    
    }
    
    pa1 <- 0.6
    pb1 <- 0.4
    s <- 3333
    set.seed(s)
    sim.res <- ldply(.data = N, .fun = res.ci, prob.a1 = pa1, prob.b1 = pb1)
    

    What do confidence intervals and p values depict?

    Below, we have plot graphs depicting different measures of difference in proportion and their confidence intervals with respect to sample size.

    library(gridExtra)
    
    p <- ggplot(aes(x = factor(num)), data = sim.res) + xlab("Sample size")  + theme_bw()
    
    p1 <-  p + geom_linerange(aes(ymin = lclA, ymax = uclA), alpha = 0.4, lwd = 1.5) + geom_linerange(aes(ymin = lclB, ymax = uclB), color = "red", alpha = 0.4, lwd = 1.5) + geom_hline(yintercept = c(pa1,pb1), color = c("black", "red"), linetype = "dashed") +  ylab("Proportion dead")
    
    p2 <- p + geom_linerange(aes(ymin = lclDiff, ymax = uclDiff), lwd = 1.5, colour = "blue") + geom_hline(yintercept = c(0, pa1 - pb1), color = c("red", "blue"), linetype = "dashed") +  ylab("Difference in proportion")
    
    p3 <- p + geom_point(aes(y = pDiff), pch = "x", size = 5) + geom_hline(yintercept = 0.05, color = "red", linetype = "dashed") +  ylab("p value of difference in Proportion") + scale_y_log10()
    
    p4 <- p + geom_crossbar(aes(y = pred.propA, ymin = pred.lcl.propA, ymax = pred.ucl.propA), alpha = 0.4, lwd = 0.2, fill = "gray") + geom_crossbar(aes(y = pred.propB, ymin = pred.lcl.propB, ymax = pred.ucl.propB), fill = "red", alpha = 0.4, lwd = 0.2) + geom_hline(yintercept = c(pa1,pb1), color = c("black", "red"), linetype = "dashed") +  ylab("Proportion dead \n as predicted from \n logistic regression model")
    
    grid.arrange(p1, p2, p3, p4, nrow = 4)
    

    plot of chunk unnamed-chunk-2

    Depiction of each of the figures

    X axis of each of the figures depict sample size (as factor) from 10 to 10000.

    1. Topmost figure: It depicts 95% confidence intervals of proportion dead for drug A (black) and drug B (red). The dashed horizontal lines depict actual proportion of deads for each of the groups.

    2. Second figure: It depicts the difference between the proportions of deads for patients receiving drug A and drug B. The blue dashed line depicts the actual difference (0.2) and red dashed line depicts the line of no difference (difference of 0).

    3. Third figure: It depicts the p value (in log scale) for difference between proportions dead for the two groups (drug A and drug B). Red dashed line depicts level of significance (p = 0.05).

    4. Lowermost figure: It depicts the predicted proportion deads for drug A and drug B and their confidence intervals from fitted logistic regression model. This figure is almost identical to the topmost figure.

    As can be seen from the above figures, confidence intervals depict how sure are we that the population proportion (or their difference) lie within a given range. CI becomes narrower with increasing sample size. So, with increasing sample size we just become more and more sure of the population parameters. Figures 2 and 3 depicts the confidence interval of difference between proportions of the two groups and the correspinding p values. We see that the confidence interval is surrounding actual difference in proportion more accurately with increasing sample size and is converging on to the actual difference (0.2). In doing so, the lower limit of the interval is getting further away from the line of no difference (0) with limit to the actual difference (0.2). Now, p value depends on the distance between the lower limit of confidence interval and the line of no effect. Its value reduces with increasing distance.

    We can see, that confidence intervals can be made narrow enough and p value can be made to be less than any given arbitrary significance level by increasing sample size appropriately. A research team who has invented a new drug and want that the new drug is proved to be better can do so with appropriately selecting the sample size. Also, all of the above mentioned measures only state that the population proportion of deads are 0.6 for drug A and 0.4 for drug B and the difference is 0.2, thereby meaning drug B is better than A. Proportion is a property of group of patients. None of the above parameters tells us any thing about the performance of drug B vs drug A when compared head to head.

    How to tackle the situation?

    Another way to state the problem will be “If drug A is given to a patient and drug B is given to an identical patient, what is the probability that only patient on drug A dies, only patient on drug B dies, patients on both the drugs die and none of them die?”.

    Mathematically, if we assume that patient receiving drug A (A) and drug B (B) behave as Bernoulli Trial with probability of death as pA and pB respectively and that both of them are independent events, then \( P(A \cap B) = P(A) \times P(B) \). So, the probability of various events will be as under:

    \( P(A = 1 \cap B = 0) = 0.6 * (1-0.4) = 0.36 \) (Drug A dies but drug B survives) … 1

    \( P(A = 0 \cap B = 1) = (1-0.6) * 0.4 = 0.16 \) (Drug A survives but drug A dies) … 2

    \( P(A = 1 \cap B = 1) = 0.6 * 0.4 = 0.24 \) (Both drug A and drug B die, equivalent response) … 3

    \( P(A = 0 \cap B = 0) = (1-0.6) * (1-0.4) = 0.24 \) (Both drug A and drug B survive, equivalent response) … 4

    We can see that only 36% of times patient taking drug A dies but drug B survives (eqn 1). So, drug B is better than drug A only 36% of times. Similarly, 16% of times patient taking drug B dies but drug A survives making drug A better than drug B. 48% of times performance of drug A and drug B is equivalent (either both die or both survive). So, despite the fact that drug B is significantly better than drug A (as depicted by highly significant p value and narrow confidence interval), only in 36% of times (less than 50%) does patient taking drug B survives but drug A dies.

    Also, despite the fact that drug B is better than drug A by 20% (40% death rate vs 60% death rate), on head to head comparision, both drugs are different only 52% of times and equivalent 48% of times. The following figures depict the same.

    set.seed(3333)
    
    foo <- function(nSamp, nsim, probA, probB) {
      sampA <- replicate(n = nsim, sample(c(0,1), size = nSamp, replace = T, prob = c(1 - probA, probA)))
      sampB <- replicate(n = nsim, sample(c(0,1), size = nSamp, replace = T, prob = c(1 - probB, probB)))
    
      deathOnlyA <- (sampA > sampB)
      deathOnlyB <- (sampB > sampA)
      deathBoth <- ((sampA == 1) & (sampB == 1))
      deathNone <- ((sampA == 0) & (sampB == 0))
      deathConcordant <- (sampA == sampB)
      deathDiscordant <- (sampA != sampB)
    
      L <- list(deathOnlyA, deathOnlyB, deathBoth, deathNone, deathConcordant, deathDiscordant)
    
      props <- lapply(L, function(y) colSums(y)/nSamp)
    
      meds <- lapply(props, function(x) median(x))
      lcls <- lapply(props, function(x) quantile(x, 0.025))
      ucls <- lapply(props, function(x) quantile(x, 0.975))
    
      return(data.frame(num = nSamp, medOnlyA = meds[[1]], medOnlyB = meds[[2]], medBoth = meds[[3]], medNone = meds[[4]], medConcordant = meds[[5]], medDiscordant = meds[[6]], 
                        lclOnlyA = lcls[[1]], lclOnlyB = lcls[[2]], lclBoth = lcls[[3]], lclNone = lcls[[4]], lclConcordant = lcls[[5]], lclDiscordant = lcls[[6]],
                        uclOnlyA = ucls[[1]], uclOnlyB = ucls[[2]], uclBoth = ucls[[3]], uclNone = ucls[[4]], uclConcordant = ucls[[5]], uclDiscordant = ucls[[6]]))
    }
    
    res <- ldply(.data = N, .fun = foo, nsim = 100, probA = 0.6, probB = 0.4)
    
    p <- ggplot(data = res, aes(x = factor(num))) + xlab("Sample size") + theme_bw()
    
    p + geom_linerange(aes(ymin = lclOnlyA, ymax = uclOnlyA), alpha = 0.4, colour = "red", lwd = 2) + ylab("Proportion deaths only with drug A") + geom_hline(yintercept = c(0.36, 0.5), colour = c("red", "black"), linetype = "dashed")
    

    plot of chunk unnamed-chunk-3

    p + geom_crossbar(aes(y = medConcordant, ymin = lclConcordant, ymax = uclConcordant), fill = "red", alpha = 0.4) + geom_crossbar(aes(y = medDiscordant, ymin = lclDiscordant, ymax = uclDiscordant), fill = "green", alpha = 0.5) + ylab("Proportion concordant performance (red)\n and discordant performance (green)") + geom_hline(yintercept = 0.5, linetype = "dashed")
    

    plot of chunk unnamed-chunk-3

    Conclusion

    The traditional way to analyse performance of a given drug based on difference in proportion is not adequate and more appropriate metric for the same should be used by the researchers and checked by peer reviewers.

    I am not sure about the adequacy and correctness of the above methology. Comments and criticisms are welcome.

    Bye.

    Suman.

    Session Information

    sessionInfo()
    
    ## R version 3.0.2 (2013-09-25)
    ## Platform: x86_64-pc-linux-gnu (64-bit)
    ## 
    ## locale:
    ##  [1] LC_CTYPE=en_IN.UTF-8       LC_NUMERIC=C              
    ##  [3] LC_TIME=en_IN.UTF-8        LC_COLLATE=en_IN.UTF-8    
    ##  [5] LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_IN.UTF-8   
    ##  [7] LC_PAPER=en_IN.UTF-8       LC_NAME=C                 
    ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    ## [11] LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       
    ## 
    ## attached base packages:
    ## [1] splines   grid      stats     graphics  grDevices utils     datasets 
    ## [8] methods   base     
    ## 
    ## other attached packages:
    ## [1] gridExtra_0.9.1 Hmisc_3.14-4    Formula_1.1-1   survival_2.37-7
    ## [5] lattice_0.20-24 ggplot2_1.0.0   plyr_1.8        knitr_1.6      
    ## 
    ## loaded via a namespace (and not attached):
    ##  [1] cluster_1.14.4      colorspace_1.2-4    dichromat_2.0-0    
    ##  [4] digest_0.6.4        evaluate_0.5.5      formatR_0.10       
    ##  [7] gtable_0.1.2        labeling_0.2        latticeExtra_0.6-26
    ## [10] lubridate_1.3.3     MASS_7.3-29         memoise_0.1        
    ## [13] munsell_0.4.2       proto_0.3-10        RColorBrewer_1.0-5 
    ## [16] reshape2_1.2.2      scales_0.2.3        SparseM_0.99       
    ## [19] stringr_0.6.2       tools_3.0.2
    
    4

    View comments

  9. Preventing escaping in HTML

    Preventing escaping in HTML

    library(xtable)
    
    ## 
    ## Attaching package: 'xtable'
    ## 
    ## The following objects are masked from 'package:Hmisc':
    ## 
    ##     label, label<-
    
    library(stringr)
    library(whisker)
    

    Problem statement

    Being a novice in R language, the problem I faced maight be a novice one, but I spent hours working on it.
    I was working on making a html based report from a database (PostgreSQL), which would gather text information from the database and put it in the report in html format. I was using Rhtml in RStudio and inserting text into the specified position inside the Rhtml document by using code chunks, as text layout is more controlled than Rmd.
    One of the database field contained text with multiple newline tags which indicated the places where I had inserted ENTER keystrokes. The example is shown below:
      string 1\nstring 2\nstring 3\nstring 4
    
    I faced problems when I was trying to render this text as following in the resulting html document.
      1.  string 1
      2.  string 2
      3.  string 3
      4.  string 4
    

    Method I was using in which I failed

    I assigned a variable to the whole string.
    s <- "string 1\nstring 2\nstring 3\nstring 4"
    
    I substituted the \n with <br>, html tag for line break using str_replace_all
    s1 <- str_replace_all(string = s, pattern = "\\n", replacement = "<br>")
    s1
    
    ## [1] "string 1<br>string 2<br>string 3<br>string 4"
    
    Then I tried to put the string s1 into the html document as follows. Actually I was working with dataframe with multiple rows and wanted to convert the data in table format.
    print(xtable(data.frame(s1)), type = "html")
    
    s1
    1 string 1&lt br&gt string 2&lt br&gt string 3&lt br&gt string 4
    I was not able to convert <br> in line breaks and the <br> came in the output verbatim. I cheked up the underlying html code and found the following:
      <TABLE border=1>
        <TR> <TH>  </TH> <TH> s1 </TH>  </TR>
        <TR>
          <TD align="right"> 1 </TD>
          <TD> string 1&lt br&gt  string 2&lt br&gt  string 3&lt br&gt  string 4</TD>
        </TR>
      </TABLE>
    
    What happened internally was that, while parsing the document html escaped < and > tags into &lt and &gt respectively. Problem I faced was how to prevent escaping the <br> and thereby inserting line breaks.

    Method 1

    I splitted the original string s and trimmed the resultant components.
    s2 <- str_trim(unlist(str_split(s, "\\n")))
    s2
    
    ## [1] "string 1" "string 2" "string 3" "string 4"
    
    I made dataframe out of the character vector and printed the required output.
    d <- data.frame(str_c(seq(from = 1, by = 1, along.with = s2), "."), s2)
    print(xtable(d), type = "html", include.colnames = F, include.rownames = F, 
        html.table.attributes = "style='border-width:0;'")
    
    1. string 1
    2. string 2
    3. string 3
    4. string 4
    which is the required output!!
    But, I have not done anything to prevent <br> from getting escaped, I have bypassed the issue.

    Method 2

    This method uses {{Mustache}} and its R implementation, whisker package.
    I have used the string s1 and take the following steps
    l <- list(s1 = s1)
    html.templ <- "<table><tr><td>{{{s1}}}</td></tr></table>"
    cat(whisker.render(template = html.templ, data = l))
    
    ## <table><tr><td>string 1<br>string 2<br>string 3<br>string 4</td></tr></table>
    
    {{{}}} prevents the <br> from getting escaped.
    The output is as follows:
    cat(whisker.render(template = html.templ, data = l))
    
    string 1
    string 2
    string 3
    string 4
    It is a much smaller and cleaner code.

    Concluding remarks

    I request if any more techniques are there to prevent escaping the < and > from html rendering.

    Session Information

    sessionInfo()
    
    ## R version 3.0.2 (2013-09-25)
    ## Platform: x86_64-pc-linux-gnu (64-bit)
    ## 
    ## locale:
    ##  [1] LC_CTYPE=en_IN.UTF-8       LC_NUMERIC=C              
    ##  [3] LC_TIME=en_IN.UTF-8        LC_COLLATE=en_IN.UTF-8    
    ##  [5] LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_IN.UTF-8   
    ##  [7] LC_PAPER=en_IN.UTF-8       LC_NAME=C                 
    ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    ## [11] LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       
    ## 
    ## attached base packages:
    ## [1] datasets  grid      grDevices splines   graphics  utils     stats    
    ## [8] methods   base     
    ## 
    ## other attached packages:
    ##  [1] whisker_0.3-2    xtable_1.7-1     knitr_1.5        mypackage_1.0   
    ##  [5] devtools_1.4.1   dplyr_0.1.2      ggplot2_0.9.3.1  rms_4.0-0       
    ##  [9] SparseM_0.99     Hmisc_3.13-0     Formula_1.1-1    cluster_1.14.4  
    ## [13] car_2.0-19       stringr_0.6.2    lubridate_1.3.3  lattice_0.20-24 
    ## [17] epicalc_2.15.1.0 nnet_7.3-7       MASS_7.3-29      survival_2.37-4 
    ## [21] foreign_0.8-57   deSolve_1.10-8  
    ## 
    ## loaded via a namespace (and not attached):
    ##  [1] assertthat_0.1     colorspace_1.2-4   dichromat_2.0-0   
    ##  [4] digest_0.6.4       evaluate_0.5.1     formatR_0.10      
    ##  [7] gtable_0.1.2       httr_0.2           labeling_0.2      
    ## [10] memoise_0.1        munsell_0.4.2      parallel_3.0.2    
    ## [13] plyr_1.8           proto_0.3-10       RColorBrewer_1.0-5
    ## [16] Rcpp_0.11.0        RCurl_1.95-4.1     reshape2_1.2.2    
    ## [19] scales_0.2.3       tools_3.0.2
    
    Bye and regards.
    3

    View comments

  10. Publishing in GitHub

    Publishing in GitHub

    I struggled to make my first repository in GitHub. I finally found out the steps to do so.
    • Make your folder in local host and add the required files in the folder (SRCFOLDER). One of the required files is README.md, which will contain overview of the project.
    • Add git to the SRCFOLDER. It will make a .git folder into the SRCFOLDER
    $ SRCFOLDER git init
    
    • Add the files into the git.
    $ SRCFOLDER git add *.*
    
    • Commit the files into git. The files will be committed with HEAD as master.
    $ SRCFOLDER git commit -m "initial commit"
    
    • Make new repository in GitHub. It should be totally empty (sumprain/XXX).
    • Copy the https address of GitHub repository obtained from the right bottom of the repository page. It is https://www.github.org/sumprain/XXX.
    • Clone the GitHub repository into SRCFOLDER. Cloning creates a copy of GitHub repository into SRCFOLDER as SCRFOLDER/XXX, and will have all the contents of the XXX into it including a .git folder. It gives a name origin to the remote source.
    $ SRCFOLDER git clone https://www.github.org/sumprain/XXX
    
    • Push the contents of the SRCFOLDER into the sumprain/XXX. It will copy the contents of SRCFOLDER into sumprain/XXX.
    $ SRCFOLDER git push origin master
    
    • In case of modification of contents in the SRCFOLDER, commit the changes and then run another push.
    Thanks.
    Update: The chapter on Git and Github in R packages by Hadley Wickham is one of the best guides available.
    0

    Add a comment

Subscribe
Subscribe
Blog Archive
My Blog List
My Blog List
About Me
About Me
I am a clinical hematologist practising in India with data analysis and R as my passions.
Loading
Dynamic Views theme. Powered by Blogger.