Personally, I think that the hardest thing in starting the journey in data science is actually to answer the question: where to start? The majority of people are concerned with mathematics - how much I actually need it? What kind of math? How deep I should go? Almost all resources agree that you should know some stats, linear algebra, calculus, optimization and usually they throw at you a few books because reading books doesn't hurt right? I think that they could actually hurt because if you dedicate too much mental energy and time understanding the books you don't have time to focus on things that actually drive your value as a data scientist - coding, data wrangling skills, familiarity with basic tech (git, SQL, docker...). 5 hours practicing pandas could have an incredible impact on the value you can bring as a data scientist, 5 hours spending on determinants have on the other hand doubtful value at this stage. But is the recommendation really never look at math because you don't need it?

Not at all, but we should take into account something that economist call the law of diminishing returns (https://en.wikipedia.org/wiki/Diminishing_returns). Simply put, any additional hour you put into learning some particular skill will be worth less and less - just because you already picked all low hanging fruit. Take for example the pandas data wrangling skills, if you spend 1 additional hour with it after a couple of years you are using it, you will not change the way how you use it, and probably not discover anything new because you already mastered most of it. How is it related to learning math for data science? In the beginning, it has just quite a bad ratio in cost-benefit analyses, but over time you learn pandas, sklearn, sql... and the math will again be in a better position so you can finally pick up the book about Linear Algebra from Gilbert Strang so heard so much about. This would unfortunately mean that you maybe don't touch any math in the first few months at all. This philosophy of hands-on first experience starts to be popular with projects such as fastai which helps you to build state-of-the-art DL models with a reasonable level of understanding without any math.

Although I like the approach, in general, I think that it would be nice to have at least a little bit of math which could have a better cost/benefit ratio at the beginning so it would be worth learning it already as a beginner. This is my attempt to provide the "math" basics with a good cost/benefit ratio which was useful for my data science practice and I would like to have it when I started.

Everything that follows is the basic stats together with the basics of linear models. If you understand it deep enough the relationships between these two, your intuition about linear models and data analyses will grow rapidly and help you to discover flaws in your analyses/modeling more quickly. It can also help to guide modeling and interpretation.

Simple statistics

I would recommend simply memorize the following few basic statiscs


Sample mean

$$ \bar{x}=\frac{1}{n} \sum_{i=1}^{n} x_{i} $$

Sample varience

$$ s_{y}^{2}=\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2} $$

Sample standard deviation

$$ s_{y}={\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}} $$

Sample covarience

$$ s_{x y}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right) $$

Pearson correlation coeficient

$$ r_{x y}=\frac{s_{x y}}{s_{x} s_{y}} $$

Pearson correlation coeficient in long form

$$ r_{x y} = \frac{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}} $$

Z - score (standardization of variable to zero mean and unit varience)

$$ z_{i}=\frac{x_{i}-\overline{x}}{s_{x}} $$

Pearson correlation coeficient could be also written as the average of the sum of the cross-products of z-scores

$$ r_{x y}=\frac{1}{n-1} \sum_{i=1}^{n} z_{x, i} \cdot z_{y, i} $$

Nice way how to explore correlation and enhance your intuition https://rpsychologist.com/d3/correlation/


Normal equations for multiple linear regression

Multiple regression

$$ y_{i} = \beta_{1}+\beta_{2} x_{2 i}+\beta_{3} x_{3 i}+\cdots+\beta_{k} x_{k i}+\varepsilon_{i} \quad(i=1, \cdots, n) $$

Multiple regression - matrix notation

$$ y=X \beta+\varepsilon $$

If b is estimate of $\beta$

$$ y=X b+\varepsilon $$

Residuals

$$ e=y-X b $$

Sum of squares of the residuals

$$ \begin{aligned} S(b) &=\sum e_{i}^{2}=e^{\prime} e=(y-X b)^{\prime}(y-X b) \\ &=y^{\prime} y-y^{\prime} X b-b^{\prime} X^{\prime} y+b^{\prime} X^{\prime} X b \end{aligned} $$

The least squares estimator is obtained by minimizing S(b)

$$ \frac{\partial S}{\partial b}=-2 X^{\prime} y+2 X^{\prime} X b $$

We set these derivatives equal to zero, which gives the normal equation

$$ X^{\prime} X b=X^{\prime} y $$

Solving this for b we get normal equation

$$ b=\left(X^{\prime} X\right)^{-1} X^{\prime} y $$

Relationship between regression and correlation coeficient for univariate case

b for simple regression (univariate x) could be written as:

$$ b=\frac{\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sum\left(x_{i}-\overline{x}\right)^{2}} $$
$$ b= \frac{s_{x y}}{s_{x}^{2}} $$
$$ b=r_{x y} \frac{s_{y}}{s_{x}} $$
$$ \frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum\left(x_{i}-\bar{x}\right)^{2}}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}} * \frac{\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}}{\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}} $$

If they have the same variance (are scaled):

$$ \sum\left(x_{i}-\overline{x}\right)^{2} = \sum\left(y_{i}-\overline{y}\right)^{2} $$

Then

$$ b=r_{x y} $$

Check empirically that b == coeficient of correalation if x std. == y std

df = tm.makeTimeDataFrame(freq='M')
df[['A', 'B']].corr()
A B
A 1.000000 0.042426
B 0.042426 1.000000
y_scaled, X_scaled  = StandardScaler().fit_transform(df[['B']]), StandardScaler().fit_transform(df[['A']])
lin_reg = LinearRegression().fit(X_scaled, y_scaled)
lin_reg.coef_
array([[0.04242638]])

Relationship between regression coeficient and coeficient of determination

Simple regression (univariate x) could be written as:

$$ b=\frac{\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sum\left(x_{i}-\overline{x}\right)^{2}} $$
$$ y_{i}=a+b x_{i}+e_{i} $$

The difference from the mean (yi - y) can be decomposed as a sum of two components:

  • a component corresponding to the difference from the mean of the explanatory variable (xi - x)
  • an unexplained component described by the residual
$$ y_{i}-\overline{y}=b\left(x_{i}-\overline{x}\right)+e_{i} $$
$$ \begin{array}{c}{\sum\left(y_{i}-\overline{y}\right)^{2}=b^{2} \sum\left(x_{i}-\overline{x}\right)^{2}+\sum e_{i}^{2}} \\ {S S T=S S E+S S R}\end{array} $$

Note: SST = total sum of squares, SSE = explained sum of squares, SSR the sum of squared residuals, but sometimes you can encouter that the meaning is switched and SSE is sum of squared errors and SSR as explained varience

$$ R^{2}=\frac{S S E}{S S T}=\frac{b^{2} \sum\left(x_{i}-\overline{x}\right)^{2}}{\sum\left(y_{i}-\overline{y}\right)^{2}} $$

Coeficient of determination is really just squated of correlation between x and y (holds true just for simple regression)

$$ R^{2}=\frac{\left(\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)\right)^{2}}{\sum\left(x_{i}-\overline{x}\right)^{2} \sum\left(y_{i}-\overline{y}\right)^{2}} $$
$$ r_{x y} = \frac{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}} $$
$$ r_{x y} = \sqrt{ R^{2}} $$
$$ R^{2}=1-\frac{\sum e_{i}^{2}}{\sum\left(y_{i}-\overline{y}\right)^{2}} $$
$$ R^{2}=1-\frac{S S R}{S S T} $$

Check empirically that squared correlation coef. == coef. of determination

df[['A', 'B']].corr() ** 2
A B
A 1.0000 0.0018
B 0.0018 1.0000
X, y = df[['A']], df['B']
lin_reg = LinearRegression().fit(X, y)
r2_score(y,lin_reg.predict(X))
0.0017999979073189953
df_r = pd.DataFrame(data = np.linspace(-1, 1, 10), columns = ['r']).assign(r2 = lambda x: x['r']**2)
alt.Chart(df_r).mark_line().encode(x='r',y='r2').interactive() 

Relationship between coeficient of determination and mean squared error

Coeficient of determination is just mean squared error devided by it's standard deviation

$$ R^{2}(y, \hat{y})=1-\frac{\sum_{i=0}^{n_{\text { sanples }}-1}\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum_{i=0}^{n_{\text { samples }}-1}\left(y_{i}-\overline{y}\right)^{2}} $$
$$ \operatorname{MSE}(y, \hat{y})=\frac{1}{n_{\text { samples }}} \sum_{i=0}^{n_{\text { samples }}-1}\left(y_{i}-\hat{y}_{i}\right)^{2} $$

Check empirically that mean squared error == coef. of determination scaled by varience

lin_reg = LinearRegression().fit(X, y)
r2_score(y,lin_reg.predict(X))
0.0017999979073189953
1-(mean_squared_error(y,lin_reg.predict(X)))/np.var(y)
0.0017999979073189953