Introducing {samplezoo}

By Nicholas Vietto

December 4, 2024

While preparing lecture materials for undergraduate statistics, I often found myself repeatedly looking up code documentation on how to generate distributions. This might just be a ‘me problem’ β€” having the memory of a goldfish πŸ˜… β€” but I figured, “Hey, maybe someone else has this issue too.” So, I developed {samplezoo} to save time and help educators focus more on creatively teaching statistical concepts rather than spending that time on setup.

Install

install.packages("samplezoo")

library(samplezoo)

How to use

The three dataset sizes are:

data <- samplezoo("small")
  • small: Generates a data frame with 100 samples.
data <- samplezoo("medium")
  • medium: Generates a data frame with 1,000 samples.
data <- samplezoo("large")
  • large: Generates a data frame with 10,000 samples.

What’s in each column

The output will be a data frame with the following columns:

norm: Normal distribution with mean and standard deviation parameters

norm2: Slight variation of the normal distribution

norm3: Another slight variation of the normal distribution

binom: Binomial (Bernoulli) distribution

neg: Negative binomial distribution

pois: Poisson distribution

exp: Exponential distribution

unif: Uniform distribution

beta: Beta distribution

gamma: Gamma distribution

chisq: Chi-squared distribution

t_dist: Student’s t-distribution

Example - Negative Binomial Distribution

library(tidyverse)

set.seed(123)

data <- samplezoo("medium")

data |> 
  ggplot(aes(neg)) +
  geom_histogram(binwidth = 1, fill = "orange", alpha = 0.7, color = "black") + 
  theme_minimal() +
  labs(y = "Frequency") +
  theme(axis.title.x = element_blank())

Final Remarks

I initially considered adding a parameter to introduce missing data into the data frames, but I decided against it to keep the package simple. For those interested in adding features like missing values, I recommend checking out the {messy} package.

Finally, there’s a lot that could be added to {samplezoo} (e.g., more columns to the ’large’ dataset to better utilize its size or the spacing I forgot to add in the norm_2 and norm_3 columns πŸ™„), but I’m going to hold off for a bit and see if people find it valuable.

Thanks for reading!

Posted on:
December 4, 2024
Length:
2 minute read, 317 words
See Also: