Descriptive Statistics

Understanding Descriptive Statistics

Descriptive statistics is a branch of statistics that focuses on summarizing and describing the features of a dataset. It provides simple summaries about the sample and the measures. These summaries can either form the basis of the initial description of the data as part of a more extensive data analysis, or they can be sufficient for a particular intended use.

Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, which are used to identify the center of a data set. Measures of variability include the standard deviation, variance, minimum and maximum variables, and the kurtosis and skewness.

Measures of Central Tendency

Central tendency refers to the idea that there is one number that best summarizes the entire set of measurements, a number that is in some way “central” to the set.

Mean: The mean, often referred to as the average, is the sum of all values divided by the total number of values. It is highly sensitive to outliers.
Median: The median is the middle value when the data is ordered from the smallest to the largest. It is less affected by outliers and skewed data.
Mode: The mode is the most frequently occurring value in a data set. There can be more than one mode in a dataset if two or more values appear with the same frequency.

Measures of Variability

Variability refers to how spread out the scores in a data set are, or how much the scores vary from each other and from the mean.

Standard Deviation: The standard deviation is a measure of the amount of variation or dispersion in a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.
Variance: Variance is the average of the squared differences from the mean. It is a measure of how far each value in the data set is from the mean.
Range: The range is the difference between the highest and lowest values in a data set.
Quartiles: Quartiles divide the data set into four equal parts. The first quartile (Q1) is the median of the lower half of the data set, and the third quartile (Q3) is the median of the upper half of the data set.
Interquartile Range (IQR): The IQR is the range between the first and the third quartiles (Q3 - Q1) and represents the middle 50% of the data.
Kurtosis: Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. Data with high kurtosis tend to have heavy tails, or outliers.
Skewness: Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values, while negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

Graphical Summaries

In addition to numerical measures, descriptive statistics can also be represented graphically. Common graphs used in descriptive statistics include:

Histograms: A histogram is a graphical representation of the distribution of numerical data, where the data is divided into bins or intervals.
Box Plots: Box plots, or box-and-whisker plots, show the distribution of quantitative data and highlight the mean, median, quartiles, and outliers.
Scatter Plots: Scatter plots display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

Use of Descriptive Statistics

Descriptive statistics are used in a wide range of fields including business, data science, research, and more. They are the first step in data analysis, often used to present the basic features of the data in a study. They provide simple summaries and offer a way to present quantitative descriptions in a manageable form. In a research study, descriptive statistics can help provide an overview of the data and serve as a foundation for further analysis, such as inferential statistics.

Descriptive statistics are also useful for identifying trends, which can be important for decision-making in businesses and organizations. For example, a company might use descriptive statistics to learn about the average age of its customers, the median income of a community, or the most common purchase patterns.

In summary, descriptive statistics are a critical part of initial data analysis, providing a useful summary that can support and inform further data analysis and decision-making processes.