%%%%%%%%%%%%% % Sets up the document using my standard MDACC color scheme % for PDF presentations. Also loads my standard set of packages. \input{/Resources/myScheme} \def\rcode#1{\texttt{#1}} %%%%%%%%%%%%% % add PDF metadata \hypersetup{ pdftitle={Sweave: First Steps Toward Reproducible Analyses}, pdfsubject={Introduction to Sweave}, pdfauthor={Kevin R. Coombes, Department of Bioinformatics and Computational Biology, UT M.D. Anderson Cancer Center, }, pdfkeywords={microarray,R,Sweave,reproducible,standards}, pdfpagemode={None}, pdfpagetransition={Replace}, linkcolor=PaleBlue, citecolor=Gray7, pagecolor=PaleBlue, urlcolor=PaleBlue } %%%%%%%%%%%%% % set up standard headers \leftheader{\hyperlink{start}{\textsc{Introduction to Microarrays}}} \rightheader{\textsf{\thepage}} \MyLogo{{\copyright\ Copyright 2006 Kevin R.~Coombes}} % left footer \rightfooter{\textsc{Bioinformatics and Computational Biology}} %%%%%%%%%%%%% \begin{document} %%%%%%%%%%%%% % prepare title page \title{\textcolor{Lemon}{Sweave: First Steps Toward Reproducible Analyses}} \author{\textcolor{white}{Kevin R. Coombes}\\ \textcolor{white}{Department of Bioinformatics and Computational Biology}\\ \textcolor{white}{Division of Quantitative Sciences}\\ \textcolor{white}{UT M. D. Anderson Cancer Center}\\ \textcolor{Orange}{\tt kcoombes@mdanderson.org}} \date{\textcolor{Lemon}{5 February 2007}} \LogoOff \maketitle %%%%%%%%% \foilhead{\hypertarget{start}{The First Problem: Reproducibility}} \LogoOn Researcher contacts analyst: \textcolor{Lemon}{``I just read this interesting paper. Can you perform the same analysis on my data?''} Analyst reads paper. Finds algorithms described by biologists in English sentences that occupy minimal amount of space in the methods section. Analyst gets public data from the paper. Takes wild guesses at actual algorithms and parameters. Is unable to reproduce reported results. Analyst considers switching to career like bicycle repair, where reproducibility is less of an issue. %%%%%%%%% \foilhead{Alternate Forms of the Same Problem} \begin{enumerate} \item Remember that microarray analysis you did six months ago? We ran a few more arrays. Can you add them to the project and repeat the same analysis? \item The statistical analyst who looked at the data I generated previously is no longer available. Can you get someone else to analyze my new data set using the same methods (and thus producing a report I can expect to understand)? \item Please write/edit the methods sections for the abstract/paper/grant proposal I am submitting based on the analysis you did several months ago. \end{enumerate} %%%%%%%%% \foilhead{The Code/Documentation Mismatch} Most of our analyses are performed using R. We can usually find an R workspace in a directory containing the raw data, the report, and one or more R scripts. \textcolor{Orange}{There is no guarantee that the objects in the R workspace were actually produced by those R scripts. Nor that the report matches the code. Nor the R objects.} Because R is interactive, unknown commands could have been typed at the command line, or the commands in the script could have been cut-n-pasted in a different order. This problem is even worse if the software used for the analysis has a fancy modern GUI. It is impossible to document how you used the GUI in such a way that someone else could produce the exact same results---on the same data---six months later. %%%%%%%%% \foilhead{The Second Problem: Academic Anarchy} Every faculty member and (almost) every statistical analyst has his or her own favorite methods for analyzing each of the many kinds of data that we see in bioinformatics. This contributes to the reproducibility problem, since everyone uses different methods and different code. But it also causes analyses to take longer and makes it harder to shift resources (i.e., people) around. The obvious solution is to decide on standard methods for basic tasks. If we can also develop reusable templates that implement these standards, then we can speed up those analyses -- and have a chance to produce better documentation as well. %%%%%%%%% \foilhead{The Solution: Sweave} {\large \begin{center} Sweave = R $+$ LaTeX. \end{center} } This talk was prepared using Sweave. So was \href{basicTemplate.pdf}{this standard report}. If you already know both R and LaTeX, then the ten-second version of this talk takes only two slides: \begin{enumerate} \item Prepare a LaTeX document. Give it an ``Rnw'' extension instead of ``tex''. Say it is called ``myfile.Rnw'' \item Insert an R code chunk starting with \texttt{$<<>>=$} \item Terminate the R code chunk with an ``at'' sign (\texttt{@}) followed by a space. \end{enumerate} %%%%%%%%% \foilhead{Using Sweave} To produce the final document \begin{enumerate} \item In an R session, issue the command \begin{center}\texttt{Sweave("myfile.Rnw")}\end{center} This executes the R code, inserts input commands and output computations and figures into a LaTeX file called ``myfile.tex''. \item In the UNIX or DOS window (or using your favorite graphical interface), issue the command \begin{center}\texttt{pdflatex myfile}\end{center} This produces a PDF file that you can use as you wish. \end{enumerate} %%%%%%%%% \foilhead{Viewing The Results} Here is a simple example, showing how the R input commands can generate output that is automatically included in the LaTeX output of Sweave. <>= options(width=54) @ <>= x <- rnorm(30) y <- rnorm(30) mean(x) cor(x,y) @ %%%%%%%%%%%%% \foilhead{A Figure} Next, we are going to insert a figure. First, we can look at the R commands that are used to produce the figure. <>= x <- seq(0, 6*pi, length=450) par(bg="white", lwd=2, cex=1.3, mai=c(1.2, 1.2, 0.2, 0.2)) plot(x,sin(x), type='l') abline(h=0, col='blue') @ On the next slide, we can look at the actual figure. (Part of the point of this example is to illustrate that you can separate the input from the output. You can even completely hide the input in the source file and just include the output in the report.) %%%%%%%%%%%%% \foilhead{Sine Curve} \begin{center} <>= <> @ \end{center} %%%%%%%%%%%%% \foilhead{A Table} <>= library(xtable) x <- data.frame(matrix(rnorm(12), nrow=3, ncol=4)) dimnames(x) <- list(c('A', 'B', 'C'), c('C1', 'C2', 'C3', 'C4')) tab <- xtable(x, digits=c(0, 3, 3, 3, 3)) tab @ <>= <> @ %%%%%%%%%%%%% \foilhead{A Table, Repeated} Again, we want to point out that you can show the results---including tables---without showing the commands that generate them. <>= tab @ %%%%%%%%%%%%% \foilhead{Weaving R Into HTML} The \rcode{R2HTML} package includes a driver that allows you to weave R commands, output, and figures into HTML documents instead of LaTeX documents. Of course, you must prepare the HTML part of the source appropriately, but the R code is built using the command: \begin{verbatim} Sweave("myfile.Rnw", driver=RweaveHTML) \end{verbatim} It is worth noting that the \rcode{xtable} package can also generate HTML tables, using the commands \begin{verbatim} tab <- xtable(mydata) print(tab, type="html") \end{verbatim} \foilhead{Sweave Details} For the remainder of this talk, we will \begin{itemize} \item Look at the source ``Rnw'' file that produced this talk. \item Look at the \href{basicTemplate.pdf}{standard report} of an analysis of a simple Affymetrix experiment. \item Look at the source files that produced that report, and explain how they can be reused. \end{itemize} \end{document}