1.1 Tools
The process of preparing programs for a digital computer is especially attractive because it not only can be economically and scientifically rewarding, it can also be an aesthetic experience much like composing poetry or music. (Knuth 1968, v)
1.1.1 R
R is a free software environment for statistical computing and graphics. It was developed in the Department of Statistics at the University of Auckland, and its code is available under the GNU (GNU is Not Unix) GPL4 license. The R Foundation is currently based at the University of Economics and Business in Vienna, Austria. It was influenced by languages like S and Scheme following a minimalist object-oriented paradigm, which specifies a small default kernel accompanied by packages for language extension. As of the closing date of this material at 2023-11-24 there are 20095 official packages available.
R is an interpretive language, which means that it compiles each command line, interactively, as it is given. This makes R great for exploring data and implementing solutions, but this same quality makes it slower for large datasets and for programs that involve many steps. For computing of this kind, programs are commonly written and run in C or C++, but this need not be a concern now. The refinement of the skills needed to optimize complex codes comes with practice, as processes must be increasingly transparent, auditable and therefore reliable.
It is recommended to update R and its packages every work cycle. On Windows, it is also recommended to install the Rtools according to installed version of R. The packages used in this course can be installed and updated according to the code below. In case of using Unix-like operating system, it is recommended to run the instructions above in a terminal after running the sudo R
command followed by the system password. On macOS systems it may be necessary to install some additional components available at macOS Tools.
# additional packages used in the course
chooseCRANmirror(ind = 11) # Brazil
install.packages('arrangements', dep = T) # sudo apt-get install libgmp3-dev
install.packages('bookdown', dep = T)
install.packages('bootstrap', dep = T)
install.packages('chisq.posthoc.test', dep = T)
install.packages('coronavirus', dep = T)
install.packages('DescTools', dep = T)
install.packages('devtools', dep = T)
install.packages('DT', dep = T)
install.packages('EnvStats', dep = T)
install.packages('factoextra', dep = T)
install.packages('FSA', dep = T)
install.packages('hdrcde', dep = T)
install.packages('HistData', dep = T)
install.packages('klaR', dep = T)
install.packages('LearnBayes', dep = T)
install.packages('performance', dep = T)
install.packages('pracma', dep = T)
install.packages('rgl', dep = T)
install.packages('rje', dep = T)
install.packages('SHELF', dep = T)
install.packages('skimr', dep = T)
install.packages('symmetry', dep = T)
install.packages('tidyverse', dep = T)
install.packages('unitquantreg', dep = T)
install.packages('VGAM', dep = T)
install.packages('voice', dep = T)
install.packages('XML', dep = T)
devtools::install_github('filipezabala/jurimetrics')
devtools::install_github('filipezabala/desempateTecnico')
devtools::install_github('kassambara/ggcorrplot')
devtools::install_github('stefano-meschiari/latex2exp')
devtools::install_github('gadenbuie/tweetrmd')
update.packages(ask = F) # update weekly
If installing one or more packages returns a non-zero exit status message, try running the installation again. If the problem persists, carefully read the messages displayed during execution, which indicate what may be missing. Copying and pasting the error messages into Google is an important step, as there are technical forums like Stackoverflow that provide great troubleshooting suggestions.
Online R
- https://colab.to/r (Version 4.3.1)
- https://www.programiz.com/r/online-compiler/ (Version 4.2.3)
- https://jupyter.org/try (Version 4.1.3, choose the ‘R’ tab)
- https://www.mycompiler.io/new/r (Version 4.1.2)
- https://www.jdoodle.com/execute-r-online/ (Version 4.1.2)
- https://www.kaggle.com/ (Version 4.0.5)
- https://rdrr.io/snippets/ (Version 4.0.3)
- https://ideone.com (Version 3.3.2, on bottom left button change from ‘Java’ to ‘R’)
CRAN Task Views
CRAN Task Views aim to provide information about CRAN packages (Comprehensive R Archive Network) related to a particular topic. It is recommended to check the subjects of interest in the CRAN Task Views for a more complete approach using the R language.
1.1.2 RStudio
RStudio is an integrated development environment (IDE) for R and Python. Enables the creation of automatic presentations and reports in various formats such as pdf, html and docx, mixing languages such as R, LaTeX, markdown, C, C++, Python, SQL, HTML, CSS, JavaScript, Stan and D3. It occupies about 740MB on disk, and is available in Desktop, Server along with their respective previews, bringing together R’s features in a parsimonious way.
1.1.3 Python
Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, classes, dynamic typing, and very high-level dynamic data types. It supports several programming paradigms in addition to object-oriented programming, such as procedural and functional programming. It has interfaces to many system calls and libraries, as well as many window systems, and is extensible in C or C++. It can also be used as an extension language for applications that need a programmable interface. Finally, Python is portable: it runs on many Unix variants, including Linux and macOS, and on Windows.
Python em R Markdown
The reticulate
package includes a Python engine for R Markdown that runs Python snippets in a single Python session embedded in your R session, allowing access to objects created in Python snippets from R and vice versa.
Exercise 1.1 Read the reticulate
documentation available at https://rstudio.github.io/reticulate/.
1.1.4 Jupyter
Jupyter is a platform that uses open standards and web services for interactive computing in a multitude of programming languages. There are several applications available at this link, which allow you to run different languages in a dynamic and customizable environment.
1.1.5 Google Colab
Colab is a product of Google Research. Colab allows writing and running Python code through the browser. It is a preset Jupyter notebook service that provides access to computational resources, including GPUs.
1.1.6 JASP
JASP (Jeffreys’s Amazing Statistics Program) is an open source project supported by the University of Amsterdam. With a friendly interface, it offers statistical analysis procedures with classical and Bayesian approaches. It occupies about 1.1GB on disk, and was developed for publication analysis. Among its main features are
- Dynamic update of all results
- Spreadsheet layout and a drag-and-drop interface
- Annotated output to communicate your results
- Integration with the Open Science Framework (OSF)
- Support for APA format (copy graphs and tables directly in Word)
1.1.7 JAMOVI
JAMOVI is an open source project referred to as a “3rd generation statistical spreadsheet”. The proposal is to be an alternative to expensive statistical products, such as SPSS and SAS, providing access to the latest developments in statistical methodology. It has integration with the statistical language R, and access can be done remotely or via desktop.
1.1.8 PSPP
PSPP is a program for statistical analysis of data. Interprets commands in the SPSS language and produces tabular output in ASCII, PostScript or HTML format. It allows you to perform descriptive statistics, t-tests, ANOVA, linear and logistic regression, measures of association, cluster analysis, reliability and factor analysis, non-parametric tests and much more. It occupies about 160MB on disk, and can be used with the graphical interface or syntax via the command line. A brief list of some of the PSPP features follows below:
- Support for more than 1 billion cases (rows)
- Support for more than 1 billion variables (columns)
- SPSS compatible syntax and data files
- A choice of terminal or graphical user interface
- A choice of text output formats, postscript, pdf, opendocument or html
- Interoperability with Gnumeric, LibreOffice and other free software
- Easy import of data from spreadsheets, text files and database sources
- The ability to open, analyze and edit two or more sets of data simultaneously. They can also be merged, joined or concatenated.
- A user interface that supports all common character sets and has been translated into many languages
- Fast statistical procedures, even on very large datasets
- No license fees or expiration period
- Portability: works on different computers and operating systems
1.1.9 LibreOffice Calc
LibreOffice is a free office suite and successor to OpenOffice(.org). It includes several applications, of which the spreadsheet program Calc stands out. It has the following functionalities:
- Functions, which can be used to create formulas to perform complex calculations in data
- Database functions to organize, store and filter data
- Statistics tools, to perform complex data analysis
- Dynamic graphics, including a wide range of 2D and 3D graphics
- Macros for recording and executing repetitive tasks; supported scripting languages include LibreOffice Basic, Python, BeanShell and JavaScript
- Ability to open, edit and save Microsoft Excel spreadsheets
- Import and export spreadsheets in multiple formats, including HTML (HyperText Markup Language), CSV (Comma Separated Value(s)), PDF (Portable Document Format) and DIF (Data Interchange Format)
1.1.10 Stan
Stan is an open source platform for high-performance statistical computing and modeling. It is a tribute to the mathematician and physicist Stanislaw Ulam, one of the authors of the Monte Carlo Method. It is also used for data analysis and prediction in the social, biological and physical sciences, engineering and business. Stan’s math library provides probability functions and linear algebra. Additional R packages provide expression-based linear modeling, hindsight, and delete cross-validation. There are interfaces to several popular computing environments, such as RStan(R) and PyStan (Python), Stan.jl (Julia) among others. Using the language one can obtain:
- Full Bayesian statistical inference with MCMC sampling (NUTS, HMC)
- Approximate Bayesian Inference with Variational Inference (ADVI)
- Maximum likelihood estimation penalized with optimization (L-BFGS)
1.1.11 Tabula
Tabula is a tool to release data tables locked in PDF files. According to the documentation it will always be free and open source.
- Works on Mac, Windows and Linux
- Allows you to extract data into a CSV or Microsoft Excel spreadsheet using a simple interface
- Only works on text-based PDFs, not scanned documents
- All processing takes place on local machine
- Used to drive investigative reporting in news organizations of all sizes, including ProPublica, The Times of London, Foreign Policy and La Nación
- Researchers of all types use Tabula to transform PDF reports into Excel spreadsheets, CSVs and JSON files for use in analysis and database applications
1.1.13 Nomograms
(d’Ocagne 1899) is considered a milestone in the study of nomography. In its second edition, (d’Ocagne 1921, v) defines6 the theme as “the general study of the graphical representation with dimensions of equations with \(n\) variables, with a view to the construction of graphic tables that translate the laws (\(\eta\)ó\(\mu\)o\(\varsigma\)) mathematics of which these equations constitute the analytical expression. These tables, called nomograms, allow, through a simple reading, guided by the immediate observation of a certain position relationship between dimensioned geometric elements, to have the value of one of these \(n\) variables that corresponds to a system of values given by the other \(n-1\)”.
Example 1.1 The slide rule Pickett N525-ES StatRule Slide Rule is an example of the application of nomography methods to statistical calculations. It can be found at http://solo.dc3.com/VirtRule/n525es/virtual-n525-es.html.
(Khovanskii 1979, 7) points out7 that “any nomogram is composed of simple elements: scales, binary fields, families of lines, lines and points. Scales are found on double decimeters, thermometers, in various physical devices. A typical example of a binary field is the grid of parallels and meridians on geographic maps.”
Example 1.2 (Fagan 1975) presents a solution to the Bayes rule involving - \(Pr(D)\), the probability that the patient has the disease before testing - \(Pr(D|T)\), the probability that the patient has the disease after the positive test result - \(Pr(T|D)\), the probability of a positive test result if the patient has the disease - \(Pr(T|\bar{D})\), the probability of a positive test result if the patient does not have the disease
If \(Pr(T|D)=100\%\) and \(Pr(T|\bar{D})=10\%\), then \(\frac{Pr(T|D)}{Pr(T|\bar{D})}=10\). For \(Pr(D) = 10\%\), a line drawn between these values returns something close to \(Pr(D|T)=53\%\).
(Wikipedia 2021) defines nomogram as “a graphical calculating device, a two-dimensional diagram designed to allow the approximate graphical computation of a mathematical function.” According to ChatGPT8 (OpenAI 2021) on 2023-07-06, “a nomogram is a graphical calculation tool that allows for the estimation of unknown values based on known values. It consists of a set of calibrated scales or lines that represent different variables or parameters. By aligning the scales or lines corresponding to known values, the nomogram provides a visual representation of the relationship between these variables and allows for the determination of the unknown value.”
Example 1.3 It was requested to ChatGPT on 2023-07-06:
The following was returned:
Certainly! I can provide you with a basic Python function to generate a statistical nomogram. However, please note that creating a complete and robust nomogram generator with all possible features would require more extensive development and understanding of the specific statistical model you want to represent. The function below will provide you with a starting point:
import matplotlib.pyplot as plt
def generate_nomogram(variables, coefficients):
# variables: a list of variable names
# coefficients: a list of corresponding coefficients
# Calculate the maximum and minimum values for the nomogram axes
min_value = min(coefficients)
max_value = max(coefficients)
# Create the nomogram plot
fig, ax = plt.subplots()
ax.set_xlim(min_value, max_value)
ax.set_ylim(0, len(variables))
ax.set_xlabel('Points')
ax.set_ylabel('Variables')
# Plot the variables and corresponding coefficients
for i, variable in enumerate(variables):
coefficient = coefficients[i]
ax.plot([min_value, coefficient], [i, i], 'k-')
ax.text(min_value - 0.1, i, variable, ha='right', va='center')
plt.show()
# Example usage
variables = ['Variable 1', 'Variable 2', 'Variable 3']
coefficients = [0.5, 1.2, -0.8]
generate_nomogram(variables, coefficients)
In this example, the function generate_nomogram
takes two lists as input: variables
and coefficients
. The variables
list contains the names of the variables
, and the coefficients
list contains the corresponding coefficients associated with each variable. The function calculates the minimum and maximum values of the coefficients to determine the range of the nomogram axes.
The function then creates a plot using matplotlib.pyplot
and sets the x-axis limits to the minimum and maximum coefficients and the y-axis limits to the number of variables. It labels the x-axis as “Points” and the y-axis as “Variables”.
Finally, the function loops through the variables and coefficients, plotting a line segment for each variable and annotating the variable name. The lines are drawn from the minimum value to the corresponding coefficient on the x-axis.
You can customize and expand this function based on your specific requirements, such as adding additional formatting, legends, or specific scaling for the axes.
Exercise 1.2 Validate the code of the Example 1.3.
References
The GNU General Public License is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software.↩︎
Probabilistic programming is a programming paradigm in which probabilistic models are defined and the inference of these models is done automatically, usually using numerical methods.↩︎
La Nomographie a pour objet l’étude générale de la représentation graphique cotée des équations à n variables, en vue de la construction de tables graphiques traduisant les lois (vóuas) mathématiques dont ces équations constituent l’expression analytique. Ces tables, dites nomogrammes, permettent, au moyen d’une simple lecture, guidée par la constatation immédiate d’une certaine relation de position entre éléments géométriques cotés, d’avoir la valeur d’une de ces n variables qui correspond à un système de valeurs données pour les n-1 autres.↩︎
Tout abaque est constitué d’éléments simples : échelles, champs binaires, familles de lignes, lignes et points. On rencontre les échelles sur les doubles-décimètres, les thermomètres, dans divers appareils de physique. Un exemple type de champ binaire est le réseau de parallèles et de méridiens des cartes de géographie.↩︎
please, define a nomogram↩︎