1.1 Tools
The process of preparing programs for a digital computer is especially attractive because it not only can be economically and scientifically rewarding, it can also be an aesthetic experience much like composing poetry or music. (Knuth 1968, v)
1.1.1 R
R is a free software environment for statistical computing and graphics. It was developed in the Department of Statistics at the University of Auckland, and its code is available under the GNU (GNU is Not Unix) GPL4 license. The R Foundation is currently based at the University of Economics and Business in Vienna, Austria. It was influenced by languages like S and Scheme following a minimalist object-oriented paradigm, which specifies a small default kernel accompanied by packages for language extension. As of the closing date of this material at 2025-07-28 there are 22498 official packages available.
R is an interpretive language, which means it compiles each command line, interactively, as it is given. This makes R excellent for exploring data and implementing solutions, but this same quality makes it slower for large data sets and for programs that involve many steps. For computing of this type, programs are commonly written and run in C or C++, but this need not be a concern now: refining the skills needed to optimize complex code is achieved with practice.
It is recommended to update R and its packages every work cycle. On Windows, it is also recommended to install the Rtools according to installed version of R. The packages used in this course can be installed and updated according to the code below. In case of using Unix-like operating system, it is recommended to run the instructions above in a terminal after running the sudo R
command followed by the system password. On macOS systems it may be necessary to install some additional components available at macOS Tools.
# packages to be installed on Linux (Debian, Ubuntu)
sudo apt-get install libfreetype6-dev libpng-dev libtiff5-dev \
libjpeg-dev libharfbuzz-dev libfribidi-dev libfontconfig1-dev \
libxml2-dev libcurl4-openssl-dev libmagick++-dev libgmp3-dev \
libgsl-dev glpk-utils libglpk-dev libgit2-dev cmake cargo \
libpoppler-cpp-dev libtesseract-dev tesseract-ocr-eng \
libleptonica-dev libavilter-dev libfftw3-dev libiodbc2-dev \
unixodbc unixodbc-dev
sudo apt autoremove
# additional packages used in the course
chooseCRANmirror(ind = 11) # https://cran.fiocruz.br
install.packages('arrangements', dep = TRUE)
install.packages('basedosdados', dep = TRUE)
install.packages('BFpack', dep = TRUE)
install.packages('bookdown', dep = TRUE)
install.packages('bootstrap', dep = TRUE)
install.packages('chisq.posthoc.test', dep = TRUE)
install.packages('coronavirus', dep = TRUE)
install.packages('DescTools', dep = TRUE)
install.packages('devtools', dep = TRUE)
install.packages('DT', dep = TRUE)
install.packages('edfReader', dep = TRUE)
install.packages('EnvStats', dep = TRUE)
install.packages('factoextra', dep = TRUE)
install.packages('fbst', dep = TRUE)
install.packages('FSA', dep = TRUE)
install.packages('HDInterval', dep = TRUE)
install.packages('hdrcde', dep = TRUE)
install.packages('HistData', dep = TRUE)
install.packages('klaR', dep = TRUE)
install.packages('LearnBayes', dep = TRUE)
install.packages('magick', dep = TRUE)
install.packages('markovchain', dep = TRUE)
install.packages('performance', dep = TRUE)
install.packages('philentropy', dep = TRUE)
install.packages('pracma', dep = TRUE)
install.packages('rgl', dep = TRUE)
install.packages('rje', dep = TRUE)
install.packages('SHELF', dep = TRUE)
install.packages('skimr', dep = TRUE)
install.packages('sos', dep = TRUE)
install.packages('symmetry', dep = TRUE)
install.packages('tidyverse', dep = TRUE)
install.packages('unitquantreg', dep = TRUE)
install.packages('VGAM', dep = TRUE)
install.packages('VIM', dep = TRUE)
# install.packages('voice', dep = TRUE)
install.packages('XML', dep = TRUE)
devtools::install_github('filipezabala/desempateTecnico')
devtools::install_github('filipezabala/jurimetrics')
devtools::install_github('filipezabala/voice')
devtools::install_github('kassambara/ggcorrplot')
devtools::install_github('stefano-meschiari/latex2exp')
devtools::install_github('gadenbuie/tweetrmd')
devtools::install_github('rstudio/webshot2')
update.packages(ask = FALSE) # update weekly
If installing one or more packages returns a non-zero exit status message, try running the installation again. If the problem persists, carefully read the messages displayed during execution, which indicate what may be missing. Copying and pasting the error messages into Google is an important step, as there are technical forums like Stackoverflow that provide great troubleshooting suggestions.
1.1.2 RStudio
RStudio is an integrated development environment (IDE) for R and Python. Enables the creation of automatic presentations and reports in various formats such as pdf, html and docx, mixing languages such as R, LaTeX, markdown, C, C++, Python, SQL, HTML, CSS, JavaScript, Stan and D3. It occupies about 740MB on disk, and is available in Desktop, Server along with their respective previews, bringing together R’s features in a parsimonious way.
1.1.2.1 Online R
- https://webr.r-wasm.org/latest/ (Version 4.5.1, Documentação)
- https://colab.research.google.com/#create=true&language=r (Version 4.5.0)
- https://www.jdoodle.com/execute-r-online (Version 4.3.2)
- https://www.programiz.com/r/online-compiler (Version 4.2.3)
- https://jupyter.org/try (Version 4.1.3, choose the ‘R’ tab)
- https://www.mycompiler.io/new/r (Version 4.1.2)
- https://rdrr.io/snippets (Version 4.0.3)
- https://filipezabala.com/r (Version 3.4.0)
1.1.2.2 CRAN Task Views
CRAN Task Views aim to provide information about CRAN (Comprehensive R Archive Network) packages related to a particular topic. It is recommended to check the subjects of interest in the CRAN Task Views for a more complete approach using the R language.
1.1.3 Python
Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, classes, dynamic typing, and very high-level dynamic data types. It supports several programming paradigms in addition to object-oriented programming, such as procedural and functional programming. It has interfaces to many system calls and libraries, as well as many window systems, and is extensible in C or C++. It can also be used as an extension language for applications that need a programmable interface. Finally, Python is portable: it runs on many Unix variants, including Linux and macOS, and on Windows.
The Python code in this material was generated and adapted from the original R code via Gemni Advanced 2.0 Flash (AI 2023) and DeepSeek-V3 (DeepSeek 2023). After testing, feel free to contribute suggestions and, ideally, code improvements. According to Gemni, the average line-to-line ratio of Python to R code was 2.45, meaning the Python code had, on average, 2.45 times the number of lines of the original R code (excluding blank lines and comments). When considering characters, the average was 1.86, meaning the Python code is 86% longer in terms of characters.
1.1.3.1 Asking for help
For more information, see https://docs.python.org/.
1.1.3.2 SymPy
SymPy is a Python library for symbolic mathematics. According to the documentation, the goal is to become a complete computer algebra system (CAS), keeping the code as simple as possible to be understandable and easily extensible. SymPy Gamma is a web application based on Google App Engine that executes and displays the results of SymPy expressions, as well as additional related calculations, in a similar manner to Wolfram|Alpha.
1.1.3.3 Python em R Markdown
The reticulate
package includes a Python engine for R Markdown that runs Python snippets in a single Python session embedded in your R session, allowing access to objects created in Python snippets from R and vice versa.
Exercise 1.2 Read the reticulate
documentation available at https://rstudio.github.io/reticulate/.
1.1.4 Jupyter
Jupyter is a platform that uses open standards and web services for interactive computing in a multitude of programming languages. There are several applications available at this link, which allow you to run different languages in a dynamic and customizable environment.
1.1.4.1 Google Colab
Colab is a product of Google Research. Colab allows writing and running Python code through the browser. It is a preset Jupyter notebook service that provides access to computational resources, including GPUs.
1.1.5 Stan
Stan is an open source platform for high-performance statistical computing and modeling. It is a tribute to the mathematician and physicist Stanislaw Ulam, one of the authors of the Monte Carlo Method. It is also used for data analysis and prediction in the social, biological and physical sciences, engineering and business. Stan’s math library provides probability functions and linear algebra. Additional R packages provide expression-based linear modeling, hindsight, and delete cross-validation. There are interfaces to several popular computing environments, such as RStan(R) and PyStan (Python), Stan.jl (Julia) among others. Using the language one can obtain:
- Full Bayesian statistical inference with MCMC sampling (NUTS, HMC)
- Approximate Bayesian Inference with Variational Inference (ADVI)
- Maximum likelihood estimation penalized with optimization (L-BFGS)
1.1.6 JASP
JASP (Jeffreys’s Amazing Statistics Program) is an open source project supported by the University of Amsterdam. With a friendly interface, it offers statistical analysis procedures with classical and Bayesian approaches. It occupies about 1.1GB on disk, and was developed for publication analysis. Among its main features are
- Dynamic update of all results
- Spreadsheet layout and a drag-and-drop interface
- Annotated output to communicate your results
- Integration with the Open Science Framework (OSF)
- Support for APA format (copy graphs and tables directly in Word)
1.1.8 JAMOVI
JAMOVI is an open source project referred to as a “3rd generation statistical spreadsheet”. The proposal is to be an alternative to expensive statistical products, such as SPSS and SAS, providing access to the latest developments in statistical methodology. It has integration with the statistical language R, and access can be done remotely or via desktop.
1.1.9 PSPP
PSPP is a program for statistical analysis of data. Interprets commands in the SPSS language and produces tabular output in ASCII, PostScript or HTML format. It allows you to perform descriptive statistics, t-tests, ANOVA, linear and logistic regression, measures of association, cluster analysis, reliability and factor analysis, non-parametric tests and much more. It occupies about 160MB on disk, and can be used with the graphical interface or syntax via the command line. A brief list of some of the PSPP features follows below:
- Support for more than 1 billion cases (rows)
- Support for more than 1 billion variables (columns)
- SPSS compatible syntax and data files
- A choice of terminal or graphical user interface
- A choice of text output formats, postscript, pdf, opendocument or html
- Interoperability with Gnumeric, LibreOffice and other free software
- Easy import of data from spreadsheets, text files and database sources
- The ability to open, analyze and edit two or more sets of data simultaneously. They can also be merged, joined or concatenated.
- A user interface that supports all common character sets and has been translated into many languages
- Fast statistical procedures, even on very large datasets
- No license fees or expiration period
- Portability: works on different computers and operating systems
1.1.10 LibreOffice Calc
LibreOffice is a free office suite and successor to OpenOffice(.org). It includes several applications, of which the spreadsheet program Calc stands out. It has the following functionalities:
- Functions, which can be used to create formulas to perform complex calculations in data
- Database functions to organize, store and filter data
- Statistics tools, to perform complex data analysis
- Dynamic graphics, including a wide range of 2D and 3D graphics
- Macros for recording and executing repetitive tasks; supported scripting languages include LibreOffice Basic, Python, BeanShell and JavaScript
- Ability to open, edit and save Microsoft Excel spreadsheets
- Import and export spreadsheets in multiple formats, including HTML (HyperText Markup Language), CSV (Comma Separated Value(s)), PDF (Portable Document Format) and DIF (Data Interchange Format)
1.1.11 Kedro
Kedro is an open-source Python framework hosted by the Linux Foundation (LF AI & Data).
1.1.12 EDF Browser
EDF Browser is a universal, cross-platform, free and open-source viewer, annotator, and toolbox intended for, but not limited to, storing time series files such as EEG, EMG, ECG, and BioImpedance.
(Vis 2019) reads EDF (European Data Format) and EDF+ files in R. See the vignette.
(Gramfort et al. 2013) presents MNE, an open-source Python package for exploring, visualizing, and analyzing human neurophysiological data such as MEG, EEG, sEEG, ECoG, and NIRS. Special mention goes to the function mne.io.read_raw_edf(), which reads EDF and EDF+ files.
More information
1.1.13 Tabula
Tabula is a tool to release data tables locked in PDF files. According to the documentation it will always be free and open source.
- Works on Mac, Windows and Linux
- Allows you to extract data into a CSV or Microsoft Excel spreadsheet using a simple interface
- Only works on text-based PDFs, not scanned documents
- All processing takes place on local machine
- Used to drive investigative reporting in news organizations of all sizes, including ProPublica, The Times of London, Foreign Policy and La Nación
- Researchers of all types use Tabula to transform PDF reports into Excel spreadsheets, CSVs and JSON files for use in analysis and database applications
1.1.15 Nomograms
(d’Ocagne 1899) is considered a milestone in the study of nomography. In its second edition, (d’Ocagne 1921, v) defines6 the theme as “the general study of the graphical representation with dimensions of equations with \(n\) variables, with a view to the construction of graphic tables that translate the laws (\(\eta\)ó\(\mu\)o\(\varsigma\)) mathematics of which these equations constitute the analytical expression. These tables, called nomograms, allow, through a simple reading, guided by the immediate observation of a certain position relationship between dimensioned geometric elements, to have the value of one of these \(n\) variables that corresponds to a system of values given by the other \(n-1\)”.
(Khovanskii 1979, 7) points out7 that “any nomogram is composed of simple elements: scales, binary fields, families of lines, lines and points. Scales are found on double decimeters, thermometers, in various physical devices. A typical example of a binary field is the grid of parallels and meridians on geographic maps.”
Example 1.1 The slide rule Pickett N525-ES StatRule Slide Rule is an example of the application of nomography methods to statistical calculations. It can be found at http://solo.dc3.com/VirtRule/n525es/virtual-n525-es.html.
Example 1.2 (Fagan 1975) presents a solution to the Bayes rule involving - \(Pr(D)\), the probability that the patient has the disease before testing - \(Pr(D|T)\), the probability that the patient has the disease after the positive test result - \(Pr(T|D)\), the probability of a positive test result if the patient has the disease - \(Pr(T|\bar{D})\), the probability of a positive test result if the patient does not have the disease
If \(Pr(T|D)=100\%\) and \(Pr(T|\bar{D})=10\%\), then \(\frac{Pr(T|D)}{Pr(T|\bar{D})}=10\). For \(Pr(D) = 10\%\), a line drawn between these values returns something close to \(Pr(D|T)=53\%\).

References
The GNU General Public License is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software.↩︎
Probabilistic programming is a programming paradigm in which probabilistic models are defined and the inference of these models is done automatically, usually using numerical methods.↩︎
La Nomographie a pour objet l’étude générale de la représentation graphique cotée des équations à n variables, en vue de la construction de tables graphiques traduisant les lois (vóuas) mathématiques dont ces équations constituent l’expression analytique. Ces tables, dites nomogrammes, permettent, au moyen d’une simple lecture, guidée par la constatation immédiate d’une certaine relation de position entre éléments géométriques cotés, d’avoir la valeur d’une de ces n variables qui correspond à un système de valeurs données pour les n-1 autres.↩︎
Tout abaque est constitué d’éléments simples : échelles, champs binaires, familles de lignes, lignes et points. On rencontre les échelles sur les doubles-décimètres, les thermomètres, dans divers appareils de physique. Un exemple type de champ binaire est le réseau de parallèles et de méridiens des cartes de géographie.↩︎