EasyUnitConverter.com

Scatter Plot Calculator

Analyze scatter plot data to find correlation, regression line, centroid, and outliers. Enter x,y pairs to get comprehensive statistical analysis of the relationship between two variables. See also our Correlation Calculator, Linear Regression Calculator, and Outlier Calculator.

How to Use the Scatter Plot Calculator

A scatter plot displays the relationship between two quantitative variables by plotting data points on a coordinate plane. This calculator analyzes your scatter plot data to determine the strength and direction of the relationship, fit a regression line, identify the centroid (center of mass), and detect potential outliers using residual analysis.

Enter your data as x,y pairs using the input rows. You can add or remove rows as needed. The calculator requires at least 3 points but works best with 10 or more. After clicking Calculate, you'll see the correlation coefficient, regression equation, centroid coordinates, and any outliers (points with residuals more than 2 standard deviations from the regression line).

The centroid (x̄, ȳ) is the point through which the regression line always passes. It represents the "average" location of all data points. Outliers are identified by their residuals — the vertical distance from each point to the regression line. Points with unusually large residuals may indicate measurement errors, special cases, or the need for a non-linear model.

When interpreting scatter plot results, consider both the correlation strength and the visual pattern. A high correlation with a clear linear trend confirms a linear relationship. If the correlation is moderate but the scatter shows a curved pattern, a non-linear model (quadratic, exponential) may be more appropriate. Always examine outliers carefully — they may represent important phenomena or data quality issues that deserve investigation.

Formula

Centroid:

(x̄, ȳ) = (Σxᵢ/n, Σyᵢ/n)

Regression Line:

ŷ = b₀ + b₁x (passes through centroid)

Residual:

eᵢ = yᵢ - ŷᵢ = yᵢ - (b₀ + b₁xᵢ)

Outlier Detection (residual method):

Outlier if |eᵢ - ē| > 2 × s_e

where s_e = std dev of residuals

Correlation:

r = (nΣxy - ΣxΣy) / √[(nΣx² - (Σx)²)(nΣy² - (Σy)²)]

Example Calculation

Data: (1,2.1), (2,3.8), (3,6.2), (4,7.9), (5,10.5)

Centroid: (3.0, 6.1)

Slope: b₁ = 2.08, Intercept: b₀ = -0.14

Equation: y = -0.14 + 2.08x

Correlation: r = 0.999 (very strong positive)

R² = 0.998 (99.8% variance explained)

Residuals: 0.16, -0.14, 0.26, -0.14, 0.26

Residual std dev: 0.22

No outliers detected (all |residuals| < 2×0.22)

Reference Table

Patternr RangeDescription
Strong Positive Linear0.8 to 1.0Points cluster tightly around upward line
Moderate Positive0.5 to 0.8Upward trend with noticeable scatter
Weak Positive0.2 to 0.5Slight upward trend, much scatter
No Correlation-0.2 to 0.2Random cloud, no discernible pattern
Weak Negative-0.5 to -0.2Slight downward trend
Strong Negative-1.0 to -0.8Points cluster around downward line
Non-Linear (Quadratic)≈ 0U-shape or inverted U, r misleading
HeteroscedasticVariesSpread increases with x (fan shape)

Frequently Asked Questions

What is a scatter plot?

A scatter plot (or scatter diagram) is a graph that displays the relationship between two quantitative variables. Each observation is represented as a point with coordinates (x, y). The pattern of points reveals the type (linear, curved), direction (positive, negative), and strength of the relationship. Scatter plots are the first step in correlation and regression analysis, helping you visually assess whether a linear model is appropriate.

What is the centroid of a scatter plot?

The centroid is the point (x̄, ȳ) — the mean of all x-values paired with the mean of all y-values. It represents the "center of gravity" of the data cloud. The least-squares regression line always passes through the centroid. This property is useful for quickly sketching the regression line: plot the centroid, then use the slope to draw the line through it. The centroid divides the scatter plot into four quadrants.

How are outliers detected in scatter plots?

This calculator uses the residual method: a point is flagged as an outlier if its residual (vertical distance from the regression line) exceeds 2 standard deviations of all residuals. Other methods include leverage (unusual x-values), Cook's distance (influence on the regression), and studentized residuals. Outliers can be legitimate extreme values, measurement errors, or indicators that the linear model is inappropriate for certain data ranges.

What if my scatter plot shows a curved pattern?

If points follow a curve rather than a line, linear regression is inappropriate. The correlation coefficient r may be near zero even though a strong non-linear relationship exists. Consider: (1) polynomial regression (quadratic, cubic), (2) exponential or logarithmic models, (3) transforming variables (log, square root), or (4) non-parametric methods. Always plot your data before computing statistics — a scatter plot reveals patterns that numbers alone cannot.

What is heteroscedasticity in a scatter plot?

Heteroscedasticity means the spread (variance) of y-values changes across different x-values — the scatter plot looks like a fan or cone shape. This violates a key assumption of linear regression (constant variance). It doesn't bias the regression coefficients but makes standard errors unreliable. Solutions include: weighted least squares, log transformation of y, or robust standard errors. A residual plot (residuals vs. x) makes heteroscedasticity easier to detect.

How many data points do I need for a meaningful scatter plot?

While you can plot any number of points, meaningful statistical analysis requires at least 10-20 observations. With fewer than 10 points, correlation estimates are unstable and outliers have excessive influence. For detecting moderate correlations (r ≈ 0.3) with statistical significance, you need about 30+ points. Larger samples (50-100+) give more reliable estimates of the true relationship and better outlier detection. Quality matters too — ensure data covers the full range of both variables.