1.1 Tools

The process of preparing programs for a digital computer is especially attractive because it not only can be economically and scientifically rewarding, it can also be an aesthetic experience much like composing poetry or music. (Knuth 1968, v)

1.1.1 R

R is a free software environment for statistical computing and graphics. It was developed in the Department of Statistics at the University of Auckland, and its code is available under the GNU (GNU is Not Unix) GPL⁴ license. The R Foundation is currently based at the University of Economics and Business in Vienna, Austria. It was influenced by languages like S and Scheme following a minimalist object-oriented paradigm, which specifies a small default kernel accompanied by packages for language extension. As of the closing date of this material at 2025-04-15 there are 22323 official packages available.

R is an interpretive language, which means it compiles each command line, interactively, as it is given. This makes R excellent for exploring data and implementing solutions, but this same quality makes it slower for large data sets and for programs that involve many steps. For computing of this type, programs are commonly written and run in C or C++, but this need not be a concern now: refining the skills needed to optimize complex code is achieved with practice.

It is recommended to update R and its packages every work cycle. On Windows, it is also recommended to install the Rtools according to installed version of R. The packages used in this course can be installed and updated according to the code below. In case of using Unix-like operating system, it is recommended to run the instructions above in a terminal after running the sudo R command followed by the system password. On macOS systems it may be necessary to install some additional components available at macOS Tools.

# additional packages used in the course
chooseCRANmirror(ind = 11) # https://cran.fiocruz.br
install.packages('arrangements', dep = T) # sudo apt-get install libgmp3-dev
install.packages('basedosdados', dep = T)
install.packages('BFpack', dep = T)
install.packages('bookdown', dep = T)
install.packages('bootstrap', dep = T)
install.packages('chisq.posthoc.test', dep = T)
install.packages('coronavirus', dep = T)
install.packages('DescTools', dep = T)
install.packages('devtools', dep = T)
install.packages('DT', dep = T)
install.packages('EnvStats', dep = T)
install.packages('factoextra', dep = T)
install.packages('fbst', dep = T)
install.packages('FSA', dep = T)
install.packages('HDInterval', dep = T)
install.packages('hdrcde', dep = T)
install.packages('HistData', dep = T)
install.packages('klaR', dep = T)
install.packages('LearnBayes', dep = T)
install.packages('magick', dep = T)
install.packages('performance', dep = T)
install.packages('pracma', dep = T)
install.packages('rgl', dep = T) # sudo apt-get install libmagick++-dev
install.packages('rje', dep = T)
install.packages('SHELF', dep = T)
install.packages('skimr', dep = T)
install.packages('symmetry', dep = T)
install.packages('tidyverse', dep = T) # sudo apt-get install libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libharfbuzz-dev libfribidi-dev libfontconfig1-dev libxml2-dev libcurl4-openssl-dev
install.packages('unitquantreg', dep = T)
install.packages('VGAM', dep = T)
install.packages('VIM', dep = T) # sudo apt install cmake
install.packages('voice', dep = T)
install.packages('XML', dep = T)
devtools::install_github('filipezabala/jurimetrics')
devtools::install_github('filipezabala/desempateTecnico')
devtools::install_github('kassambara/ggcorrplot')
devtools::install_github('stefano-meschiari/latex2exp')
devtools::install_github('gadenbuie/tweetrmd')
devtools::install_github('rstudio/webshot2')
update.packages(ask = F) # update weekly

If installing one or more packages returns a non-zero exit status message, try running the installation again. If the problem persists, carefully read the messages displayed during execution, which indicate what may be missing. Copying and pasting the error messages into Google is an important step, as there are technical forums like Stackoverflow that provide great troubleshooting suggestions.

1.1.1.1 Online R

https://colab.research.google.com/#create=true&language=r (Version 4.3.2)
https://www.jdoodle.com/execute-r-online/ (Version 4.3.2)
https://www.programiz.com/r/online-compiler/ (Version 4.2.3)
https://jupyter.org/try (Version 4.1.3, escolha a aba ‘R’)
https://www.mycompiler.io/new/r (Version 4.1.2)
https://www.kaggle.com/ (Version 4.0.5)
https://rdrr.io/snippets/ (Version 4.0.3)
https://filipezabala.com/r/ (Version 3.4.0)
https://ideone.com (Version 3.3.2, on bottom left button change from ‘Java’ to ‘R’)

1.1.1.2 CRAN Task Views

CRAN Task Views aim to provide information about CRAN packages (Comprehensive R Archive Network) related to a particular topic. It is recommended to check the subjects of interest in the CRAN Task Views for a more complete approach using the R language.

1.1.2 RStudio

RStudio is an integrated development environment (IDE) for R and Python. Enables the creation of automatic presentations and reports in various formats such as pdf, html and docx, mixing languages such as R, LaTeX, markdown, C, C++, Python, SQL, HTML, CSS, JavaScript, Stan and D3. It occupies about 740MB on disk, and is available in Desktop, Server along with their respective previews, bringing together R’s features in a parsimonious way.

1.1.3 Python

Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, classes, dynamic typing, and very high-level dynamic data types. It supports several programming paradigms in addition to object-oriented programming, such as procedural and functional programming. It has interfaces to many system calls and libraries, as well as many window systems, and is extensible in C or C++. It can also be used as an extension language for applications that need a programmable interface. Finally, Python is portable: it runs on many Unix variants, including Linux and macOS, and on Windows.

1.1.3.1 SymPy

SymPy is a Python library for symbolic mathematics. According to the documentation, the goal is to become a complete computer algebra system (CAS), keeping the code as simple as possible to be understandable and easily extensible. SymPy Gamma is a web application based on Google App Engine that executes and displays the results of SymPy expressions, as well as additional related calculations, in a similar manner to Wolfram|Alpha.

1.1.3.2 Python em R Markdown

The reticulate package includes a Python engine for R Markdown that runs Python snippets in a single Python session embedded in your R session, allowing access to objects created in Python snippets from R and vice versa.

Exercise 1.1 Read the reticulate documentation available at https://rstudio.github.io/reticulate/.

1.1.3.3 PyMC3

Probabilistic programming library⁵ for Bayesian statistics.

1.1.4 Jupyter

Jupyter is a platform that uses open standards and web services for interactive computing in a multitude of programming languages. There are several applications available at this link, which allow you to run different languages in a dynamic and customizable environment.

1.1.5 Google Colab

Colab is a product of Google Research. Colab allows writing and running Python code through the browser. It is a preset Jupyter notebook service that provides access to computational resources, including GPUs.

1.1.6 JASP

JASP (Jeffreys’s Amazing Statistics Program) is an open source project supported by the University of Amsterdam. With a friendly interface, it offers statistical analysis procedures with classical and Bayesian approaches. It occupies about 1.1GB on disk, and was developed for publication analysis. Among its main features are

Dynamic update of all results
Spreadsheet layout and a drag-and-drop interface
Annotated output to communicate your results
Integration with the Open Science Framework (OSF)
Support for APA format (copy graphs and tables directly in Word)

1.1.7 JAMOVI

JAMOVI is an open source project referred to as a “3rd generation statistical spreadsheet”. The proposal is to be an alternative to expensive statistical products, such as SPSS and SAS, providing access to the latest developments in statistical methodology. It has integration with the statistical language R, and access can be done remotely or via desktop.

1.1.8 PSPP

PSPP is a program for statistical analysis of data. Interprets commands in the SPSS language and produces tabular output in ASCII, PostScript or HTML format. It allows you to perform descriptive statistics, t-tests, ANOVA, linear and logistic regression, measures of association, cluster analysis, reliability and factor analysis, non-parametric tests and much more. It occupies about 160MB on disk, and can be used with the graphical interface or syntax via the command line. A brief list of some of the PSPP features follows below:

Support for more than 1 billion cases (rows)
Support for more than 1 billion variables (columns)
SPSS compatible syntax and data files
A choice of terminal or graphical user interface
A choice of text output formats, postscript, pdf, opendocument or html
Interoperability with Gnumeric, LibreOffice and other free software
Easy import of data from spreadsheets, text files and database sources
The ability to open, analyze and edit two or more sets of data simultaneously. They can also be merged, joined or concatenated.
A user interface that supports all common character sets and has been translated into many languages
Fast statistical procedures, even on very large datasets
No license fees or expiration period
Portability: works on different computers and operating systems

1.1.9 LibreOffice Calc

LibreOffice is a free office suite and successor to OpenOffice(.org). It includes several applications, of which the spreadsheet program Calc stands out. It has the following functionalities:

Functions, which can be used to create formulas to perform complex calculations in data
Database functions to organize, store and filter data
Statistics tools, to perform complex data analysis
Dynamic graphics, including a wide range of 2D and 3D graphics
Macros for recording and executing repetitive tasks; supported scripting languages include LibreOffice Basic, Python, BeanShell and JavaScript
Ability to open, edit and save Microsoft Excel spreadsheets
Import and export spreadsheets in multiple formats, including HTML (HyperText Markup Language), CSV (Comma Separated Value(s)), PDF (Portable Document Format) and DIF (Data Interchange Format)

1.1.10 Stan

Stan is an open source platform for high-performance statistical computing and modeling. It is a tribute to the mathematician and physicist Stanislaw Ulam, one of the authors of the Monte Carlo Method. It is also used for data analysis and prediction in the social, biological and physical sciences, engineering and business. Stan’s math library provides probability functions and linear algebra. Additional R packages provide expression-based linear modeling, hindsight, and delete cross-validation. There are interfaces to several popular computing environments, such as RStan(R) and PyStan (Python), Stan.jl (Julia) among others. Using the language one can obtain:

Full Bayesian statistical inference with MCMC sampling (NUTS, HMC)
Approximate Bayesian Inference with Variational Inference (ADVI)
Maximum likelihood estimation penalized with optimization (L-BFGS)

1.1.11 Tabula

Tabula is a tool to release data tables locked in PDF files. According to the documentation it will always be free and open source.

Works on Mac, Windows and Linux
Allows you to extract data into a CSV or Microsoft Excel spreadsheet using a simple interface
Only works on text-based PDFs, not scanned documents
All processing takes place on local machine
Used to drive investigative reporting in news organizations of all sizes, including ProPublica, The Times of London, Foreign Policy and La Nación
Researchers of all types use Tabula to transform PDF reports into Excel spreadsheets, CSVs and JSON files for use in analysis and database applications

1.1.12 Maple Calculator

App to Android and iPhone.

1.1.13 Nomograms

(d’Ocagne 1899) is considered a milestone in the study of nomography. In its second edition, (d’Ocagne 1921, v) defines⁶ the theme as “the general study of the graphical representation with dimensions of equations with \(n\) variables, with a view to the construction of graphic tables that translate the laws (\(\eta\)ó\(\mu\)o\(\varsigma\)) mathematics of which these equations constitute the analytical expression. These tables, called nomograms, allow, through a simple reading, guided by the immediate observation of a certain position relationship between dimensioned geometric elements, to have the value of one of these \(n\) variables that corresponds to a system of values given by the other \(n-1\)”.

Example 1.1 The slide rule Pickett N525-ES StatRule Slide Rule is an example of the application of nomography methods to statistical calculations. It can be found at http://solo.dc3.com/VirtRule/n525es/virtual-n525-es.html.

(Khovanskii 1979, 7) points out⁷ that “any nomogram is composed of simple elements: scales, binary fields, families of lines, lines and points. Scales are found on double decimeters, thermometers, in various physical devices. A typical example of a binary field is the grid of parallels and meridians on geographic maps.”

Example 1.2 (Fagan 1975) presents a solution to the Bayes rule involving - \(Pr(D)\), the probability that the patient has the disease before testing - \(Pr(D|T)\), the probability that the patient has the disease after the positive test result - \(Pr(T|D)\), the probability of a positive test result if the patient has the disease - \(Pr(T|\bar{D})\), the probability of a positive test result if the patient does not have the disease

If \(Pr(T|D)=100\%\) and \(Pr(T|\bar{D})=10\%\), then \(\frac{Pr(T|D)}{Pr(T|\bar{D})}=10\). For \(Pr(D) = 10\%\), a line drawn between these values returns something close to \(Pr(D|T)=53\%\).

(Wikipedia 2021) defines nomogram as “a graphical calculating device, a two-dimensional diagram designed to allow the approximate graphical computation of a mathematical function.” According to ChatGPT ⁸ (OpenAI 2021) on 2023-07-06, “a nomogram is a graphical calculation tool that allows for the estimation of unknown values based on known values. It consists of a set of calibrated scales or lines that represent different variables or parameters. By aligning the scales or lines corresponding to known values, the nomogram provides a visual representation of the relationship between these variables and allows for the determination of the unknown value.”

Example 1.3 It was requested to ChatGPT on 2023-07-06:

The following was returned:

Certainly! I can provide you with a basic Python function to generate a statistical nomogram. However, please note that creating a complete and robust nomogram generator with all possible features would require more extensive development and understanding of the specific statistical model you want to represent. The function below will provide you with a starting point:

import matplotlib.pyplot as plt

def generate_nomogram(variables, coefficients):
    # variables: a list of variable names
    # coefficients: a list of corresponding coefficients
    
    # Calculate the maximum and minimum values for the nomogram axes
    min_value = min(coefficients)
    max_value = max(coefficients)
    
    # Create the nomogram plot
    fig, ax = plt.subplots()
    ax.set_xlim(min_value, max_value)
    ax.set_ylim(0, len(variables))
    ax.set_xlabel('Points')
    ax.set_ylabel('Variables')
    
    # Plot the variables and corresponding coefficients
    for i, variable in enumerate(variables):
        coefficient = coefficients[i]
        ax.plot([min_value, coefficient], [i, i], 'k-')
        ax.text(min_value - 0.1, i, variable, ha='right', va='center')
    
    plt.show()

# Example usage
variables = ['Variable 1', 'Variable 2', 'Variable 3']
coefficients = [0.5, 1.2, -0.8]
generate_nomogram(variables, coefficients)

In this example, the function generate_nomogram takes two lists as input: variables and coefficients. The variables list contains the names of the variables, and the coefficients list contains the corresponding coefficients associated with each variable. The function calculates the minimum and maximum values of the coefficients to determine the range of the nomogram axes.
The function then creates a plot using matplotlib.pyplot and sets the x-axis limits to the minimum and maximum coefficients and the y-axis limits to the number of variables. It labels the x-axis as “Points” and the y-axis as “Variables”.

Finally, the function loops through the variables and coefficients, plotting a line segment for each variable and annotating the variable name. The lines are drawn from the minimum value to the corresponding coefficient on the x-axis.

You can customize and expand this function based on your specific requirements, such as adding additional formatting, legends, or specific scaling for the axes.

Exercise 1.2 Validate the code of the Example 1.3.

References

d’Ocagne, Maurice. 1899. Traité de Nomographie. Gauthier-Villars, Imprimeur-Libraire, Paris. https://archive.org/details/traitdenomograp00ocaggoog/page/n11/mode/2up.

———. 1921. Traité de Nomographie, 2nd Edit. Gauthler-Villars et Cie, Éditeurs. https://archive.org/details/TraitDeNomographie.

Fagan, Letter TJ. 1975. “Nomogram for Bayes Theorem.” The New England Journal of Medicine 293 (5): 257. https://doi.org/10.1056/nejm197507312930513.

Khovanskii, Georgii Sergeevich. 1979. Éléments de Nomographie. Éditions Mir. https://archive.org/details/khovanski-elements-de-nomographie-mir-1979.

Knuth, Donald. 1968. The Art of Computer Programming. Addison-Wesley Publishing Company, Inc.

OpenAI. 2021. “GPT-3: Generative Pre-Trained Transformer 3.” OpenAI. https://chat.openai.com/.

Wikipedia. 2021. “Nomogram.” Wikimedia Foundation. https://en.wikipedia.org/wiki/Nomogram.

The GNU General Public License is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software.↩︎
Probabilistic programming is a programming paradigm in which probabilistic models are defined and the inference of these models is done automatically, usually using numerical methods.↩︎
La Nomographie a pour objet l’étude générale de la représentation graphique cotée des équations à n variables, en vue de la construction de tables graphiques traduisant les lois (vóuas) mathématiques dont ces équations constituent l’expression analytique. Ces tables, dites nomogrammes, permettent, au moyen d’une simple lecture, guidée par la constatation immédiate d’une certaine relation de position entre éléments géométriques cotés, d’avoir la valeur d’une de ces n variables qui correspond à un système de valeurs données pour les n-1 autres.↩︎
Tout abaque est constitué d’éléments simples : échelles, champs binaires, familles de lignes, lignes et points. On rencontre les échelles sur les doubles-décimètres, les thermomètres, dans divers appareils de physique. Un exemple type de champ binaire est le réseau de parallèles et de méridiens des cartes de géographie.↩︎
please, define a nomogram↩︎