Regression is a statistical method that attempts to determine the strength and behaviour of the relationship between one dependent variable (usually denoted by Y) and a set of one or more other variables (known as independent variables). Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between the variables by minimizing the sum of squared differences between the observed and predicted values of the dependent variable.
回归是一种统计方法,试图确定一个因变量(通常由Y表示)与一组一个或多个其他变量(称为自变量)之间关系的强度和行为。 普通最小二乘(OLS)回归是一种统计分析方法,它通过最小化因变量的观察值与预测值之间的平方差之和来估计变量之间的关系。
If your data shows a linear relationship between the X and Y variables, it is useful to find the line that best fits that relationship. The Least Squares Regression Line is the line that makes the vertical distance from the data points to the regression line as small as possible. It’s called a “least squares” because the best line of fit is one that minimizes the sum of squares of the errors (akathe variance). Another name for the line is “Linear regression equation” (because the resulting equation gives you a linear equation). R² measures how well a linear regression line fits the data and has the equation ŷ= a+ b x. adenotes the intercept, bis the slop, xis the independent variable and ŷ is the dependent variable. Once the intercept and slope have been estimated using least squares, various indices are studied to determine the reliability of these estimates. One of the most popular of these reliability indices is the correlation coefficient.
如果您的数据显示了X和Y变量之间的线性关系,则找到最适合该关系的线会很有用。 最小二乘回归线是使从数据点到回归线的垂直距离尽可能小的线。 之所以称为“最小二乘”,是因为最佳拟合线是使误差平方和( 即方差)最小化的线。 该线的另一个名称是“线性回归方程式”(因为结果方程式为您提供了线性方程式)。 R²衡量线性回归线拟合数据的程度,并具有方程ŷ= a + b x。 a表示截距, b是斜率, x是自变量,ŷ是因变量。 一旦使用最小二乘法估计了截距和斜率,便会研究各种指标以确定这些估计的可靠性。 这些可靠性指标中最受欢迎的一种是相关系数。
Correlation quantifies the direction and strength of the relationship between two numeric variables, X and Y. The correlation coefficient, or simply the correlation, is an index that always lies between -1 and 1. When the value is near zero, there is no linear relationship. As the correlation gets closer to plus or minus one, the relationship is stronger. A value of +1 indicates a perfect positive linear relationship and -1 indicates a perfect negative linear relationship between two variables.¹
相关性可量化两个数值变量X和Y之间关系的方向和强度。相关系数(或简称为相关性)是始终位于-1和1之间的索引。当值接近零时,没有线性关系。关系。 随着相关性接近正负1,关系更强。 值+1表示两个变量之间的正线性关系完美,-1表示两个变量之间的负线性关系完美。¹
The correlation squared (R²) has special meaning in simple linear regression. It represents the proportion of variation in Y explained by X (accounted by the variation in X). It is defined as the sum of squares due to the regression divided by the adjusted total sum of squares of Y. R² does not measure the magnitude of the slopes and does not measure the appropriateness of a linear model. It measures the strength of the linear component of the model. When there is an intercept in the regression, correlation magnitude= sqrt(R²) and sign (corr) = sign (regression slope of Y on X). So if the correlation magnitude is positive, then the regression slope of Y on X is positive too.
相关平方(R²)在简单线性回归中具有特殊含义。 它代表由X解释的Y的变化比例(由X的变化解释)。 它定义为因回归而得到的平方和除以调整后的Y的平方和之和。R2不测量斜率的大小,也不测量线性模型的适用性。 它测量模型线性分量的强度。 当回归中存在截距时,相关幅度= sqrt(R²),符号(corr)=符号(Y在X上的回归斜率)。 因此,如果相关幅度为正,则Y在X上的回归斜率也为正。
Python中的普通最小二乘实现 (Ordinary least squares implementation in Python)
OLS can be carried out in various Python packages such as in stats models, numpy, pandas and scipy. For this article, we will be exploring the stats models package.
OLS可以在各种Python软件包中执行,例如在stats模型,numpy,pandas和scipy中 。 对于本文,我们将探索统计数据模型包。
The data used is the Life Expectancy data from Kaggle. It has 22 columns for Year, Country, Life Expectancy and features that might affect life expectancy. For this project, we studied the influence of Alcohol on Life Expectancy for Nigeria from 2005 till 2013. Using stats models, we’re working with pct_change() or diff not the original number because numbers could rise or fall but not be correlated. The pct_change() function computes the percentage change from the immediately previous row by default. This is useful in comparing the percentage of change in a time series of elements.
所使用的数据是来自Kaggle的预期寿命数据 。 它具有22列,分别用于Year , Country , 寿命和可能影响寿命的功能。 在此项目中,我们研究了酒精对2005年至2013年尼日利亚人的预期寿命的影响。使用统计模型,我们使用pct_change()或diff而不是原始数字,因为数字可能会上升或下降但没有关联。 默认情况下,pct_change()函数计算前一行的百分比变化。 这在比较元素时间序列中的变化百分比时很有用。
Use add constant method of sm to add a column of 1s to aid in calculating intercept. The value of the correlation coefficient is unchanged if either X or Y is multiplied by a constant or if a constant is added.
使用sm的add constant方法添加1s列以帮助计算截距。 如果X或Y乘以一个常数或添加一个常数,则相关系数的值不变。
The first element in the parameters array, alpha, signifies the intercept and the second element is the slope.
参数数组alpha中的第一个元素表示截距,第二个元素是斜率。
To see more details about the model such as the R², adjusted R², F-statistic, log-likelihood and other relevant statistics, you can print a summary of the model.
要查看有关模型的更多详细信息,例如R²,调整后的R²,F统计量,对数似然率和其他相关统计量,您可以打印模型摘要。
fit.summary()
For the full notebook for this article, check out the GitHub gist: https://gist.github.com/AniekanInyang/06a7fbb8940b59371e5fb5d7f1f6af88
有关本文的完整笔记本,请查看GitHub要点: https : //gist.github.com/AniekanInyang/06a7fbb8940b59371e5fb5d7f1f6af88
I hope this was helpful and you are able to apply OLS on your time series data to see how they correlate over time. Feel free to reach out on LinkedIn, Twitter or send an email: contactaniekan at gmail dot com if you want to chat about this or anything.
我希望这会有所帮助,并且您可以将OLS应用于时间序列数据,以查看它们随时间的相关性。 如果您想聊天,请随时通过LinkedIn , Twitter或发送电子邮件:gmail dot com与contactaniekan联系。
Stay safe. ?
注意安全。 ?