예측손실함수 : loss function
평균 잔차 제곱합
빅데이터 분석에서 MSE(Mean Squared Error)라고도 부른다.
Log-linear model
종속변수 y에 로그를 취한 선형회귀모형
log(y)를 사용하는 이유는 y변수와 x변수의 변수 간 비선형 관계를 가정하기 때문
종속변수에 로그를 취했어도 여전히 예측의 목적은 y이다.
y >0 인경우에만 log-linear model 사용 가능)
OLS 추정량 를 얻은후 log(y)의 예측값은
오차항 를
로 가정하면,
는 log-normal distribution이 된다.
로그 노말 분포를 따르는 y의 평균은,
# R에서 .dta 파일을 읽기위해 haven 패키지 사용
#install.packages("haven")
library(haven)
## Warning: package 'haven' was built under R version 3.5.3
data1<- haven::read_dta(file = "B_data2_1.dta")
# linear model
plot(price~crime, data = data1)
reg1<-lm(price ~ crime, data=data1)
summary(reg1)
##
## Call:
## lm(formula = price ~ crime, data = data1)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -16.937
-5.449 -1.987 2.545
29.827
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.01338 0.40978
58.600 <2e-16 ***
## crime
-0.41585 0.04401 -9.449
<2e-16 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.496 on 504 degrees of
freedom
## Multiple R-squared:
0.1505, Adjusted R-squared:
0.1488
## F-statistic: 89.28 on 1 and 504 DF, p-value: < 2.2e-16
abline(reg1, col= "red")
# loss function (MSE : Mean Squared Errors)
#str(reg1) #여기서 reg1에 들어있는 값들 확인
data1$predicted <- reg1$fitted.values
#data1$predicted
model1_MSE <- mean((data1$price - data1$predicted)^2)
;model1_MSE
## [1] 71.89939
reg2<-lm(price ~ rooms, data=data1)
#str(reg2) #여기서 reg2에 들어있는 값들 확인
data1$predicted <- reg2$fitted.values
model1_2_MSE <- mean((data1$price - data1$predicted)^2)
;model1_2_MSE
## [1] 43.66259
# model1 모형과 model2 모형은 다른 독립변수에 대한 비교
# multiple linear regression model
reg3<-lm(price ~ rooms + crime,
data=data1)
#str(reg3) #여기서 reg3에 들어있는 값들 확인
data1$predicted <- reg3$fitted.values
model2_MSE <- mean((data1$price - data1$predicted)^2)
;model2_MSE
## [1] 38.72546
# log-linear
regression model
# 종속변수와 독립변수가 비선형 관계라고 예측
reg2 <- lm(log(price)~ crime, data=data1)
summary(reg2)
##
## Call:
## lm(formula = log(price) ~ crime, data = data1)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -1.1736 -0.2041 -0.0301 0.1750
1.4538
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
3.124063 0.016786 186.11
<2e-16 ***
## crime
-0.025131 0.001803 -13.94
<2e-16 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.348 on 504 degrees of
freedom
## Multiple R-squared:
0.2783, Adjusted R-squared:
0.2768
## F-statistic: 194.3 on 1 and 504 DF, p-value: < 2.2e-16
# Method 1 : yhat =
exp(alpha + beta*x)
# 잘못된 방법임
data1$yhat1 <- exp(reg2$fitted.values)
# Method 2: yhat = exp(alpha + beta*x + sigma^2/2)
summary(reg2)
##
## Call:
## lm(formula = log(price) ~ crime, data = data1)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -1.1736 -0.2041 -0.0301 0.1750
1.4538
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
3.124063 0.016786 186.11
<2e-16 ***
## crime
-0.025131 0.001803 -13.94
<2e-16 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.348 on 504 degrees of
freedom
## Multiple R-squared:
0.2783, Adjusted R-squared:
0.2768
## F-statistic: 194.3 on 1 and 504 DF, p-value: < 2.2e-16
# Residual standard
error: 0.348 on 504 degrees of freedom
# sigma^2 추정치
sig <- 0.348
data1$yhat2 <- exp(reg2$coefficients[1]
+reg2$coefficients[2]*data1$crime + (sig^2)/2)
model3_MSE <- mean((data1$price - data1$yhat2)^2)
;model3_MSE
## [1] 68.65176
# model 1 : 71.89
# model 3 : 68.65 따라서 model 4 가 더 적합
###################################################
# out of sample prediction
# random sampling
set.seed(1234) # 무작위로 뽑아도 결과가 같게 나오도록 함
bs <- sample(1:504,354,
replace =F)
# step 2 : construct train and test sets
train <- data1[bs,]
test<-data1[-bs,]
# step 3 : Estimate LRM
reg4 <- lm(price ~ crime,
data=train)
summary(reg4)
##
## Call:
## lm(formula = price ~ crime, data = train)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -17.243
-5.823 -2.235 3.099
29.315
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.31532 0.51585
47.136 < 2e-16 ***
## crime
-0.39315 0.05314 -7.398 1.03e-12 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.963 on 352 degrees of
freedom
## Multiple R-squared:
0.1346, Adjusted R-squared:
0.1321
## F-statistic: 54.73 on 1 and 352 DF, p-value: 1.025e-12
# step 4 : obtrain
the predicted values from test set
yhat_test <- predict(object = reg4,newdata = test) # 모델을 적용
model_1_prediction <- mean((test$price-yhat_test)^2)
; model_1_prediction
## [1] 53.92957
# 로그 모형 적용
# 어디서 test set 쓰고 train set 쓰는지
reg5 <- lm(log(price)~ crime, data=train)
summary(reg5)
##
## Call:
## lm(formula = log(price) ~ crime, data = train)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -1.17990 -0.21665 -0.03749 0.19752
1.23997
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
3.129977 0.020761
150.76 <2e-16 ***
## crime
-0.022794 0.002139 -10.66
<2e-16 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3607 on 352 degrees of
freedom
## Multiple R-squared:
0.2439, Adjusted R-squared:
0.2418
## F-statistic: 113.6 on 1 and 352 DF, p-value: < 2.2e-16
sig <- 0.3607
test$yhat4 <- exp(reg5$coefficients[1]
+reg5$coefficients[2]*test$crime + (sig^2)/2)
# 로그 모형 적용한 MSE
# test set에 적합시켜봄
model_4_prediction <- mean((test$price -test$yhat4)^2)
;model_4_prediction
## [1] 51.74055
'IT,인터넷 관련 학습 > R언어 학습' 카테고리의 다른 글
[R] R 자료형 (Data Type) (0) | 2019.05.11 |
---|---|
빅데이터 통계학 (4) (0) | 2019.04.05 |
빅데이터 통계학(2) (0) | 2019.04.05 |
빅데이터 통계학 (1) (0) | 2019.04.05 |
R언어 : 다양한 통계차트 (0) | 2019.03.02 |