9.2 Chapter 2

Bar plot: A graph of multiple rectangles where the height shows the counts or proportion of observations within each category of one categorical variable; the width of the rectangles is arbitrary and does not encoded data.

Boxplot: An alternative graphical summary for a quantitative variable that consists of the box (25th to 75th percentile), the line in the box (median), the tails or whiskers (to the most extreme observed values within 1.5*IQR), and points (outliers that are further than 1.5*IQR from the box)

Center of a Distribution: The typical value which could be measured with the mode, mean, or median.

Conditional distribution: The relative frequencies within subsets, typically calculated based on a two-way table for two categorical variables.

Direction: Indicating whether a relationship is positive, negative, or neutral

Distribution: The way something is spread out (the way in which values vary).

Form: The type of relationship we observe (linear, curved, none)

Frequency: The count of observations

Histogram: A graph to show the distribution of a quantitative variable. The x-axis is a number line that is divided into intervals called bins. The height of the bars shows either the frequency within intervals (counts of units that fall into that bin/interval) or the density (fraction of units that fall into that bin/interval). Gaps between bars are meaningful. They indicate absence of values within an interval.

Interquartile Range: 75th percentile - 25th percentile

Marginal Distribution: If you are looking at the numerical summary of two categorical variables, the marginal distribution is the relative frequency of one of the variables, ignoring the other.

Mean: The average value calculated as the sum of values divided by the total number of values.

Median: The middle value if you ordered all of the values.

Mode: A “peak”/“bump” in the distribution.

Mosaic plot: A graph of rectangles where the relative height of the bars shows the conditional distribution (relative frequency within subsets), and the width of the bars shows the marginal distribution (relative frequency of the X variable, ignoring the other variable).

Outlier: A point far from the rest of the observations

Relative Frequency: The count of observations divide by the total number of observations

Scatterplot: With two quantitative variables, the graphical display with points reflecting the pair of values on a coordinate plane.

Shape: When describing a distribution of a quantitative variable, you should comment on whether it is symmetric or skewed and how many modes it has.

Side-by-side bar plot: A graph of many rectangles where the height shows the count of the categories within groups in the data (for two categorical variables). The rectangles are typically colored according to the grouping categorical variable.

Simpson’s Paradox: A situation in which you come to two different conclusions if you look at results overall versus within subsets.

Skewed Left: A distribution is left-skewed if there is a long left tail.

Skewed Right: A distribution is right-skewed if it has a long right tail.

Spread (or variation): The measure of how much the values vary. Are the values concentrated around one or more values or spread out?

Stacked bar plot: A graph of rectangles where the height of the entire bar shows the marginal distribution (frequency of the X variable, ignoring the other variable), and the relative heights within one rectangle show conditional distributions (frequencies within subsets).

Stacked bar plot based on proportions: A graph of rectangles where the relative heights within one rectangle show conditional distributions (frequencies within subsets).

Standard deviation: Root mean squared deviations from mean

Strength: The compactness of points around the average relationship.

Symmetric Distribution: A distribution is symmetric if you fold it in half and the sides match up.

Trimmed means: Drop the lowest and highest k% and take the mean of the rest.

Variance: Square of the standard deviation

Z-score: A standardized value that gives the number of standard deviation a value is from the mean.