예측손실함수 : loss function

평균 잔차 제곱합 

 

빅데이터 분석에서 MSE(Mean Squared Error)라고도 부른다.



Log-linear model

종속변수 y에 로그를 취한 선형회귀모형

log(y)를 사용하는 이유는 y변수와 x변수의 변수 간 비선형 관계를 가정하기 때문

종속변수에 로그를 취했어도 여전히 예측의 목적은 y이다.

y >0 인경우에만 log-linear model 사용 가능)

OLS 추정량 를 얻은후 log(y)의 예측값은

오차항 를 로 가정하면, 는 log-normal distribution이 된다.


로그 노말 분포를 따르는 y의 평균은,



# R에서 .dta 파일을 읽기위해 haven 패키지 사용
#install.packages("haven")
library(haven)

## Warning: package 'haven' was built under R version 3.5.3

data1<- haven::read_dta(file = "B_data2_1.dta")


# linear model
plot(price~crime, data = data1)
reg1<-lm(price ~ crime, data=data1)
summary(reg1)

##
## Call:
## lm(formula = price ~ crime, data = data1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -16.937  -5.449  -1.987   2.545  29.827
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.01338    0.40978  58.600   <2e-16 ***
## crime       -0.41585    0.04401  -9.449   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.496 on 504 degrees of freedom
## Multiple R-squared:  0.1505, Adjusted R-squared:  0.1488
## F-statistic: 89.28 on 1 and 504 DF,  p-value: < 2.2e-16

abline(reg1, col= "red")


# loss function (MSE : Mean Squared Errors)

#str(reg1) #여기서 reg1 들어있는 값들 확인
data1$predicted <- reg1$fitted.values
#data1$predicted
model1_MSE <- mean((data1$price - data1$predicted)^2) ;model1_MSE

## [1] 71.89939

reg2<-lm(price ~ rooms, data=data1)
#str(reg2) #
여기서 reg2 들어있는 값들 확인
data1$predicted <- reg2$fitted.values
model1_2_MSE <- mean((data1$price - data1$predicted)^2) ;model1_2_MSE

## [1] 43.66259

# model1 모형과 model2 모형은 다른 독립변수에 대한 비교


# multiple linear regression model
reg3<-lm(price ~ rooms + crime, data=data1)
#str(reg3) #
여기서 reg3 들어있는 값들 확인
data1$predicted <- reg3$fitted.values
model2_MSE <- mean((data1$price - data1$predicted)^2) ;model2_MSE

## [1] 38.72546

# log-linear regression model
#
종속변수와 독립변수가 비선형 관계라고 예측
reg2 <- lm(log(price)~ crime, data=data1)
summary(reg2)

##
## Call:
## lm(formula = log(price) ~ crime, data = data1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.1736 -0.2041 -0.0301  0.1750  1.4538
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.124063   0.016786  186.11   <2e-16 ***
## crime       -0.025131   0.001803  -13.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.348 on 504 degrees of freedom
## Multiple R-squared:  0.2783, Adjusted R-squared:  0.2768
## F-statistic: 194.3 on 1 and 504 DF,  p-value: < 2.2e-16

# Method 1 : yhat = exp(alpha + beta*x)
#
잘못된 방법임
data1$yhat1 <- exp(reg2$fitted.values)

# Method 2: yhat = exp(alpha + beta*x + sigma^2/2)
summary(reg2)

##
## Call:
## lm(formula = log(price) ~ crime, data = data1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.1736 -0.2041 -0.0301  0.1750  1.4538
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.124063   0.016786  186.11   <2e-16 ***
## crime       -0.025131   0.001803  -13.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.348 on 504 degrees of freedom
## Multiple R-squared:  0.2783, Adjusted R-squared:  0.2768
## F-statistic: 194.3 on 1 and 504 DF,  p-value: < 2.2e-16

# Residual standard error: 0.348 on 504 degrees of freedom
# sigma^2
추정치

sig <- 0.348
data1$yhat2 <- exp(reg2$coefficients[1]
                   +reg2$coefficients[2]*data1$crime + (sig^2)/2)


model3_MSE <- mean((data1$price - data1$yhat2)^2) ;model3_MSE

## [1] 68.65176

# model 1 : 71.89
# model 3 : 68.65
따라서 model 4 적합

###################################################
# out of sample prediction
# random sampling
set.seed(1234) #
무작위로 뽑아도 결과가 같게 나오도록
bs <- sample(1:504,354, replace =F)

# step 2 : construct train and test sets
train <- data1[bs,]
test<-data1[-bs,]


# step 3 : Estimate LRM

reg4 <- lm(price ~ crime, data=train)
summary(reg4)

##
## Call:
## lm(formula = price ~ crime, data = train)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -17.243  -5.823  -2.235   3.099  29.315
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.31532    0.51585  47.136  < 2e-16 ***
## crime       -0.39315    0.05314  -7.398 1.03e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.963 on 352 degrees of freedom
## Multiple R-squared:  0.1346, Adjusted R-squared:  0.1321
## F-statistic: 54.73 on 1 and 352 DF,  p-value: 1.025e-12

# step 4 : obtrain the predicted values from test set
yhat_test <- predict(object = reg4,newdata = test) #
모델을 적용
model_1_prediction <- mean((test$price-yhat_test)^2) ; model_1_prediction

## [1] 53.92957

# 로그 모형 적용
#
어디서 test set 쓰고 train set 쓰는지
reg5 <- lm(log(price)~ crime, data=train)
summary(reg5)

##
## Call:
## lm(formula = log(price) ~ crime, data = train)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.17990 -0.21665 -0.03749  0.19752  1.23997
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.129977   0.020761  150.76   <2e-16 ***
## crime       -0.022794   0.002139  -10.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3607 on 352 degrees of freedom
## Multiple R-squared:  0.2439, Adjusted R-squared:  0.2418
## F-statistic: 113.6 on 1 and 352 DF,  p-value: < 2.2e-16

sig <- 0.3607
test$yhat4 <- exp(reg5$coefficients[1]
                  +reg5$coefficients[2]*test$crime + (sig^2)/2)

#
로그 모형 적용한 MSE
# test set
적합시켜봄
model_4_prediction <- mean((test$price -test$yhat4)^2) ;model_4_prediction

## [1] 51.74055



'IT,인터넷 관련 학습 > R언어 학습' 카테고리의 다른 글

[R] R 자료형 (Data Type)  (0) 2019.05.11
빅데이터 통계학 (4)  (0) 2019.04.05
빅데이터 통계학(2)  (0) 2019.04.05
빅데이터 통계학 (1)  (0) 2019.04.05
R언어 : 다양한 통계차트  (0) 2019.03.02

+ Recent posts