Scatter Plot Maker

Create scatter plots with trend lines and correlation analysis. See also our Correlation Calculator, Linear Regression Calculator, and Bar Graph Maker.

Title

X Label

Y Label

Show Trend Line (Linear Regression)

10.

My Scatter Plot

Trend: y = 0.927x + 1.600 | r = 0.9655 | r² = 0.9321

How to Use the Scatter Plot Maker

Enter X and Y values for each data point. Click Add Point for more data. Enable the trend line to see a linear regression fit through your data. The chart displays the regression equation (y = mx + b) along with the correlation coefficient (r) and coefficient of determination (r²). The plot auto-scales to fit all data points with grid lines for reference.

Features

SVG scatter plot with data point markers
Linear regression trend line (dashed red)
Correlation coefficient (r) calculation
R-squared value for goodness of fit
Grid lines and axis labels
Auto-scaling axes
Add/remove data points dynamically

Interpreting Results

The correlation coefficient r ranges from -1 to +1. Values near +1 indicate strong positive correlation (both variables increase together). Values near -1 show strong negative correlation (one increases as the other decreases). Values near 0 suggest no linear relationship. R² tells you what percentage of the variation in Y is explained by X.

A common mistake is interpreting correlation as causation. Even a strong r value (0.9+) does not prove that changes in X cause changes in Y. There may be confounding variables, reverse causality, or coincidental relationships. Always consider the logical mechanism behind any observed correlation.

The trend line equation (y = mx + b) can be used for prediction. The slope (m) tells you how much Y changes for each unit increase in X. The intercept (b) is the predicted Y value when X equals zero. However, predictions outside your data range (extrapolation) become increasingly unreliable.

Look for outliers — points that are far from the trend line. Outliers can significantly affect the regression calculation. A single extreme point can pull the trend line toward it, potentially misrepresenting the relationship in the rest of your data. Consider whether outliers represent errors or genuine extreme observations.

Frequently Asked Questions

What is a scatter plot used for?

Scatter plots visualize the relationship between two numeric variables. They help identify patterns, trends, correlations, clusters, and outliers in data. Each point represents one observation.

How is the trend line calculated?

The trend line uses ordinary least squares (OLS) linear regression, which minimizes the sum of squared vertical distances between points and the line. The formula finds the slope and intercept that best fit your data.

What is a good r-squared value?

It depends on the field. In physical sciences, r² > 0.9 is often expected. In social sciences, r² > 0.3 may be meaningful. Values closer to 1 indicate the model explains more of the data variation.

Can I have multiple groups of data?

This tool plots a single set of data points. For multiple groups, use our Line Graph Maker where you can create separate series with different colors.

Does correlation imply causation?

No. A strong correlation between X and Y does not mean X causes Y. There could be confounding variables, reverse causation, or coincidental relationships. Scatter plots show association, not causation.

What if my data isn't linear?

If data follows a curved pattern, a linear trend line won't fit well (low r²). Consider transforming variables (log, square root) or using polynomial regression for non-linear relationships.

About Scatter Plots and Regression Analysis

Scatter plots are fundamental to statistical analysis, revealing relationships between variables that are invisible in summary statistics alone. Anscombe's Quartet famously demonstrated that four datasets with identical means, variances, and correlations can have completely different scatter plot patterns, proving the essential role of visualization.

Linear regression finds the best-fit straight line through data points by minimizing the sum of squared residuals (vertical distances between points and the line). The ordinary least squares (OLS) method has closed-form solutions: slope = (nΣxy - ΣxΣy) / (nΣx² - (Σx)²) and intercept = (Σy - slope × Σx) / n. This tool calculates these values automatically.

The correlation coefficient (Pearson's r) measures linear association strength and direction. Values of ±0.7 to ±1.0 indicate strong correlation, ±0.4 to ±0.7 moderate correlation, and ±0.0 to ±0.4 weak correlation. However, r only detects linear relationships — data with a perfect quadratic or circular pattern might show r ≈ 0 despite a clear relationship.

R-squared (r²) represents the proportion of variance in Y explained by X. An r² of 0.85 means that 85% of the variation in Y can be predicted from X using the linear model. The remaining 15% is unexplained variance due to other factors, measurement error, or inherent randomness. Higher r² values indicate better model fit.

Outliers deserve special attention in scatter plot analysis. A single extreme point can dramatically influence the regression line and inflate or deflate the correlation coefficient. Always examine your scatter plot visually to identify potential outliers and consider whether they represent data errors, special cases, or legitimate extreme observations.

Non-linear patterns in scatter plots require different analytical approaches. If your data curves upward (exponential growth), consider logging the Y values. If it follows a U-shape or parabola, polynomial regression (quadratic) may be appropriate. The residuals (differences between observed and predicted values) reveal whether a linear model is appropriate.

Sample size affects the reliability of correlation estimates. With very few data points (under 10), correlation values can be misleading — random chance may produce high r values. Generally, 30+ observations provide more stable correlation estimates. For publishable research, statistical significance testing determines if the correlation is likely real.

Heteroscedasticity occurs when the spread of Y values changes as X increases. For example, income variation might increase with age (young people have similar low incomes, but older people range from low to very high). This violates assumptions of standard regression and may require weighted regression or transformation of variables.

In practical applications, scatter plots help answer questions like: Does advertising spend correlate with sales? Does study time predict exam scores? Is there a relationship between temperature and ice cream consumption? The visual pattern combined with the correlation statistic provides both intuitive and quantitative answers.

Multiple regression extends simple linear regression to include more than one predictor variable. While this tool shows the relationship between two variables, real-world outcomes usually depend on many factors simultaneously. Scatter plots of each predictor against the outcome provide initial insights before building more complex statistical models.

The concept of residual plots (plotting predicted vs. actual values or residuals vs. X) helps diagnose whether a linear model is appropriate. If residuals show a pattern rather than random scatter, the linear model may be inadequate. This diagnostic check is essential in rigorous statistical analysis.

Ecological fallacy is a common error when interpreting scatter plots of aggregated data. A correlation found at the group level may not apply to individuals within those groups. For example, countries with higher average income may have higher average life expectancy, but individual wealthy people aren't necessarily healthier than individual middle-class people.

Modern data science uses scatter plot matrices (pairs plots) to visualize relationships between all combinations of variables simultaneously. While this tool handles one X-Y pair, the principles of correlation and regression apply identically when examining multiple variable pairs in larger datasets.

Leverage points are data points with extreme X values that have outsized influence on the regression line. A leverage point that is also an outlier (far from the trend line) is called an influential point. Identifying and appropriately handling such points is crucial for reliable regression analysis in scientific and business contexts.