| Title: | Summarise and Explore Continuous, Categorical and Date Variables |
|---|---|
| Description: | Explore continuous, date and categorical variables with summary statistics, visualisations, and frequency tables. Brings the ease and simplicity of the sum and tab commands from 'Stata' to 'R', including support for two-way cross-tabulations, hypothesis tests, duplicate and missing data exploration, and automated HTML or PDF exploratory reports. |
| Authors: | Alexander Stockdale [aut, cre, cph] |
| Maintainer: | Alexander Stockdale <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-05-17 08:58:25 UTC |
| Source: | https://github.com/alstockdale/sumvar |
Summarises the minimum, maximum, median, and interquartile range of a date variable, optionally stratified by a grouping variable. Produces a histogram and (optionally) a density plot.
dist_date(data, var, by = NULL)dist_date(data, var, by = NULL)
data |
A data frame or tibble. |
var |
The date variable to summarise. |
by |
Optional grouping variable. |
A tibble with summary statistics for the date variable.
dist_sum for continuous variables.
# Example ungrouped df <- tibble::tibble( dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE) ) dist_date(df, dt) # Example grouped df2 <- tibble::tibble( dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE), grp = sample(1:2, 100, TRUE) ) dist_date(df2, dt, grp)# Example ungrouped df <- tibble::tibble( dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE) ) dist_date(df, dt) # Example grouped df2 <- tibble::tibble( dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE), grp = sample(1:2, 100, TRUE) ) dist_date(df2, dt, grp)
Summarises a continuous variable, returning a tibble of descriptive statistics and a plot. When a grouping variable is supplied, results are stratified by group.
For two groups, a t-test and Wilcoxon rank-sum test are reported. For three or more groups, a one-way ANOVA and Kruskal-Wallis test are reported.
dist_sum(data, var, by = NULL)dist_sum(data, var, by = NULL)
data |
The data frame or tibble |
var |
The continuous variable to summarise |
by |
Optional grouping variable |
A tibble with one row per group (or one row when ungrouped) containing the following columns:
nNumber of non-missing observations.
n_missNumber of missing (NA) values.
medianMedian value.
p25, p75
25th and 75th percentiles (interquartile range boundaries).
meanArithmetic mean.
sdStandard deviation.
ci_lower, ci_upper
Lower and upper bounds of the 95% confidence interval for the mean. Uses the t-distribution when n < 30, and the Z-distribution when n >= 30.
min, max
Minimum and maximum observed values.
n_outliersCount of values more than 1.5 x IQR below Q1 or above Q3 (Tukey fence method).
shapiro_pP-value from the Shapiro-Wilk test of normality. Returns
NA when n < 3 or n > 5000 (outside the valid range of the test).
normalLogical. TRUE if shapiro_p > 0.05, indicating no
significant departure from normality at the 5% level.
p_ttestShown when two groups are compared. P-value from an independent
samples t-test, testing whether the means of the two groups differ. Assumes
approximately normal distributions or large samples. All p-values are reported on
the first row only; remaining rows contain NA.
p_wilcoxShown when two groups are compared. P-value from the Wilcoxon
rank-sum test (Mann-Whitney U test), a non-parametric alternative to the t-test.
Preferred over p_ttest when data are skewed, ordinal, or contain outliers,
as it compares ranks rather than means and makes no distributional assumptions.
p_anovaShown when three or more groups are compared. P-value from a one-way analysis of variance (ANOVA) F-test, testing whether at least one group mean differs from the others. Assumes approximately normal distributions and equal variances across groups.
p_kruskalShown when three or more groups are compared. P-value from the
Kruskal-Wallis test, a non-parametric alternative to one-way ANOVA. Preferred over
p_anova when data are skewed or the normality assumption is not met, as it
compares rank distributions and makes no distributional assumptions.
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10), group = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE)) dist_sum(example_data, age, group) example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10), sex = sample(c("male", "female"), size = 100, replace = TRUE)) dist_sum(example_data, age, sex) summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10), group = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE)) dist_sum(example_data, age, group) example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10), sex = sample(c("male", "female"), size = 100, replace = TRUE)) dist_sum(example_data, age, sex) summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.
Provides an integer value for the number of duplicates found within a variable The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
eg. example_data %>% dup(variable)
dup(data, var = NULL)dup(data, var = NULL)
data |
The data frame or tibble |
var |
The variable to assess |
A tibble with the number and percentage of duplicate values found, and the number of missing values (NA), together with percentages.
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0)) example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing. dup(example_data, age) # It is also possible to pass a whole database to dup and it will explore all variables. example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0), sex = sample(c("Male", "Female"), 200, TRUE), favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE)) example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing. example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing. dup(example_data)example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0)) example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing. dup(example_data, age) # It is also possible to pass a whole database to dup and it will explore all variables. example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0), sex = sample(c("Male", "Female"), 200, TRUE), favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE)) example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing. example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing. dup(example_data)
Analyses a data frame or tibble, summarising all continuous, date and categorical variables, missing data and duplicate values, and produces an HTML or PDF report.
explorer( data, output_file = NULL, format = c("html", "pdf"), progress = TRUE, id_var = NULL )explorer( data, output_file = NULL, format = c("html", "pdf"), progress = TRUE, id_var = NULL )
data |
A data frame or tibble to explore. |
output_file |
The name of the output file. Default uses |
format |
Output format, either |
progress |
If |
id_var |
Character vector of column names to treat as IDs (not summarised). |
Outputs an html or PDF summary. Output in PDF typically takes longer.
For PDF output, a LaTeX distribution must be installed. TinyTeX is recommended. To install, run:
install.packages("tinytex")
tinytex::install_tinytex()
## Not run: # Build example data from mtcars with some factors and a date column: cars_example <- mtcars %>% dplyr::mutate( across(c(vs, am, gear, carb, cyl), as.factor), date_var = as.Date("2025-06-01") + sample(-300:300, nrow(mtcars), replace = TRUE), id = dplyr::row_number() ) # To run explorer: explorer(mtcars) # with progress bar explorer(mtcars, progress = FALSE) # omit progress bar explorer(mtcars, format = "pdf") # PDF output explorer(mtcars, format = "pdf", id_var = "id") # Identify ID variable ## End(Not run)## Not run: # Build example data from mtcars with some factors and a date column: cars_example <- mtcars %>% dplyr::mutate( across(c(vs, am, gear, carb, cyl), as.factor), date_var = as.Date("2025-06-01") + sample(-300:300, nrow(mtcars), replace = TRUE), id = dplyr::row_number() ) # To run explorer: explorer(mtcars) # with progress bar explorer(mtcars, progress = FALSE) # omit progress bar explorer(mtcars, format = "pdf") # PDF output explorer(mtcars, format = "pdf", id_var = "id") # Identify ID variable ## End(Not run)
The sumvar package explores continuous and categorical variables. sumvar brings the ease and simplicity of the "sum" and "tab" functions from Stata to R.
To explore a continuous variable, use dist_sum(). You can stratify by a grouping variable: df %>% dist_sum(var, group)
To explore dates, use dist_date(); usage is the same as dist_sum().
To summarise a single categorical variable use tab1(), e.g. df %>% tab1(var). For a two-way table, use tab(), e.g. df %>% tab(var1, var2). Both include options for frequentist hypothesis tests.
Explore duplicates and missing values with with dup().
All functions are tidyverse/dplyr-friendly and accept the %>% pipe, outputting results as a tibble. You can save outputs for further manipulation, e.g. summary <- df %>% dist_sum(var).
Maintainer: Alexander Stockdale [email protected] [copyright holder]
Useful links:
Report bugs at https://github.com/alstockdale/sumvar/issues
Creates a cross-tabulation of two categorical variables with row or column
percentages, row and column totals, and optional hypothesis tests. Prints a
formatted table to the console (similar to Stata's tab command) with
the column variable name displayed as a spanning header above its levels.
tab( data, variable1, variable2, show = c("row", "col", "n"), test = c("both", "chi", "exact", "none"), totals = TRUE, dp = 1 )tab( data, variable1, variable2, show = c("row", "col", "n"), test = c("both", "chi", "exact", "none"), totals = TRUE, dp = 1 )
data |
The data frame or tibble. |
variable1 |
The row variable (first categorical variable). |
variable2 |
The column variable (second categorical variable). |
show |
What to display in each cell: |
test |
Hypothesis test(s) to report: |
totals |
If |
dp |
Number of decimal places for percentages. Default is |
A wide-format tibble (invisibly) with:
First column: levels of variable1 (plus "Total" if
totals = TRUE).
For each level of variable2: {level}_n (integer count)
and, when show != "n", {level}_pct (numeric percentage).
total_n: row totals (when totals = TRUE).
p_chi, p_fisher: p-values on the first row, NA
elsewhere (when test = "both"; individual tests add test,
statistic, p_value instead).
example_data <- dplyr::tibble( group1 = sample(c("a", "b", "c"), 100, replace = TRUE), group2 = sample(c("male", "female"), 100, replace = TRUE) ) tab(example_data, group1, group2) tab(example_data, group1, group2, show = "col") tab(example_data, group1, group2, test = "none") result <- tab(example_data, group1, group2)example_data <- dplyr::tibble( group1 = sample(c("a", "b", "c"), 100, replace = TRUE), group2 = sample(c("male", "female"), 100, replace = TRUE) ) tab(example_data, group1, group2) tab(example_data, group1, group2, show = "col") tab(example_data, group1, group2, test = "none") result <- tab(example_data, group1, group2)
Summarises frequencies and percentages for a categorical variable.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble. eg. example_data %>% tab1(variable)
tab1(data, variable, ..., dp = 1)tab1(data, variable, ..., dp = 1)
data |
The data frame or tibble |
variable |
The categorical variable you would like to summarise |
... |
Not used. Passing additional variables raises an informative
error suggesting |
dp |
The number of decimal places for percentages (default=1) |
A tibble with frequencies and percentages
example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE)) example_data$group[sample(1:100, size = 10)] <- NA # Replace 10 with missing tab1(example_data, group) summary <- tab1(example_data, group) # Save summary statistics as a tibble.example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE)) example_data$group[sample(1:100, size = 10)] <- NA # Replace 10 with missing tab1(example_data, group) summary <- tab1(example_data, group) # Save summary statistics as a tibble.