Study/Today_I_Learned

[Datacamp] Introduction to Regression with statsmodels

Regression 이란?

explanatory variable(설명변수 혹은 독립변수, X)와 response variable(반응변수 혹은 종속변수, Y) 간의 관계를 통계적으로 예측하는 모델.

Types of regression

Linear regression : response variable이 실수형인 경우
Logistic regression : response variable이 논리형, 즉 참/거짓으로 판정되는 경우

#watch the relation between two variables

import seaborn as sns

sample_df = sns.load_dataset('taxis', cache=True, data_home=None) # 설명 t.ly/aymd
sns.regplot(x='distance', y='fare', data=sample_df, ci=None)

#ci shows confidence interval in graph

Python packages for regression

statsmodels : insight에 최적화, 본 데이터캠프 수업에서는 이 패키지 위주로 진행
scikit-learn : prediction에 최적화

#sample of linear regression by statsmodel

#import
from statsmodels.formula.api import ols #ordinary least squares

# Create the model object
linear_model = ols('response_var ~ expla_var', data=dataset)

# Fit the model
linear_model = linear_model.fit()

# Print the parameters of the fitted model
print(linear_model.params)

ols 에서 수식 계산 결과인 intercept는 bias, 즉 y 절편을 말하고, slope는 기울기를 말한다.

Categorical explanatory variables

sns.displot 을 통해 각 카테고리별 무게 분포 확인

# col : categorical feature 
sns.displot(data=sample_df,
         x='fare',
         col='pickup_zone', #split the data by categories
         bins=10)

# Show the plot
plt.show()

Extrapolating

불가능한 값들에 대한 예측할 때 주의해야 한다. 몸무게가 음수가 나오는 등 전혀 쓸모 없는 예측이 될 수 있기 때문이다.

Attributes of Models

# prediction on original dataset
model.fittedvalues # = model.predict(data)

# measuring inaccuracy : 실제값-예측값
model.resid

# coefficients/parameters
model.params

# summary method
model.summary()

Regression to mean

Residual(오차)이 발생하는 이유

model의 불완전함 - 모델에 explanatory value가 설명할 수 없는 항목이 반영되지 않음
randomness

평균으로의 회귀는 extreme value들에 대한 예측값들은 왜 극단적이지 않는지 설명함 (참고 포스트)

Transforming variables

데이터가 선형 관계를 이루지 않을 때, 데이터의 범위가 어느 한쪽으로 치우칠 때 시도해볼 수 있는 방법들

Square or Cubic - 제곱 혹은 세제곱, e.g 부피와 관련된 예측을 할 때 길이 데이터를 사용하는 경우
Square root: right - skewed data, 데이터의 왼쪽 꼬리가 길 때 적용

Quantifying model fit

모델의 평가 방법

Coefficient of determination = R-squared
Residual standard error
Root-mean-square error

Visualizing model fit

residual plot, Q-Q plot, Scale-location plot을 통해 residual (error)의 분포를 눈으로 확인할 수 있다.

Leverage & Influence

leverage는 설명 변수가 각 관찰들에서 얼마나 흔하지 않거나 극단적인 값인지 측정한다. 그림에서 $x$ 값들이 20에 몰려있다는 점을 생각하면, 빨갛게 표시된 점들을 high leverage라고 할 수 있다.

influence는 regression model로부터 각 값들이 얼마나 떨어져있는지 측정한다.

High leverage 혹은 high influence를 가진 값들을 우리는 outlier로 볼 수 있다.

regression model은 scikit-learn을 먼저 배웠던 터라 statsmodel 의 진행방식이 조금 어색했습니다. (두 패키지를 비교한 포스팅 ) 두 번째 난관은 다름아닌 수학과 관련된 영어 표현이었는데.. raise N to power of 3 (혹은 raise N to the 3rd power) 가 $2^3$ 이네요? exponential만 알고 있었는데, 더 쉬운 표현을 하나 알아가는 챕터였습니다.

저작자표시 비영리 변경금지 (새창열림)

Contents

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

새소식

인기 검색어