R언어 : R 통계 관련 함수

2019. 3. 2. 16:23

mean(x)

평균계산

cumsum(x)

#cumulative sum

누적덧셈 계산

1
2

> cumsum(1:10)
 [1]  1  3  6 10 15 21 28 36 45 55

cs

var(x)

표본분산 계산

sum((x-mean(x))^2)/(length(x)-1)와 같은 계산

sd(x)

#standard deviation

표본표준편차 계산

quantile(x, prob)

4분위수 계산

1
2
3

> quantile(x)
  0%  25%  50%  75% 100% 
   7   11   15   19   42 

cs

fivenum(x)

최소값, Q1, Q2, Q3, 최대값

1
2

> fivenum(x)
[1]  7 11 15 19 42

cs

summary(x)

최소값, Q1, Q2, 평균, Q3, 최대값

1
2
3

> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.00   11.00   15.00   17.52   19.00   42.00 

cor(x,y)

#correlation

상관계수

1
2
3

> cor(x,y) # 상관계수
[1] 0.9008112
 

cs

runif (n,startnum,endnum)

# random uniform

난수를 균등분포로 생성

1
2
3
> runif(10,1,10)
 [1] 4.448011 9.894589 1.112249 2.571152 4.286473 8.525339
 [7] 8.576296 2.287501 3.849232 3.099007
cs

rnorm (n,mean,sd)

# random normal distribution

난수를 평균과 분산 값에 해당하는 정규분포로 생성

1
2
3
4
5
> x <- rnorm(20, 5, 3); x
 [1]  1.275738  2.122108  2.395099  2.289289  3.317601
 [6]  5.415063  4.130932  8.699943  4.211740 10.140457
[11]  2.873645  3.661168 11.720456  3.069014  4.998389
[16]  1.160205  4.620774  4.379887  4.600128  5.893047
cs

pnorm(x,평균,표준편차)

정규분포 평균P(X<=x)인 확률을 구하는 함수

> pnorm(2,0,1)

[1] 0.9772499

qnorm(확률,평균,표준편차)

확률 = P(X<=x)를 만족하는 x 를 구하는 함수

> qnorm(0.05,0,1)
[1] -1.644854

qqnorm(x)

x가 정규분포를 따르는지 Q-Q plot을 그린다.

qqline(x)

정규분포의 QQplot에서 1Q와 3Q를 지나는 선을 그리는 함수

apply(x,1or2,function)

2차원 이상의 데이터 형태에 함수를 적용시킨다.

1은 row방향을 의미한다.

2는 column방향을 의미한다.

1
2
3
4
> mat <- matrix(rnorm(50,10,2),ncol=5 )
> apply(mat,1,mean)
 [1]  9.343895  9.037576 10.275932  9.432533 10.404708
 [6] 11.799625 10.756710  9.740977 10.368899 11.231161
cs

t.test(X ... , var.equal = T or F, paired = T or F, conf.level = 0.95)

여러 표본에 대해 t검정을 실시한다.

옵션

var.equal = T 는 등분산성 가정

paired = T 는 쌍체비교

conf.level = 0.95 유의수준 0.05 수준에서 검정

# 두 표본 X 와 Y에 대해 귀무가설 = 0 에 대해 t검정을 실시

A <- c(79.98, 80.04, 80.02, 80.04, 80.03, 80.03, 80.04, 79.97, 80.05, 80.03, 80.02, 80.00, 80.02)

B <- c(80.02, 79.94, 79.98, 79.97, 79.97, 80.03, 79.95, 79.97)

1
2
3
4
5
6
7
8
9
10
11
12
> t.test(A,B, var.equal=T)
 
    Two Sample t-test
 
data:  A and B
t = 3.4722, df = 19, p-value = 0.002551
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01669058 0.06734788
sample estimates:
mean of x mean of y 
 80.02077  79.97875 
Colored by Color Scripter
cs

검정 통계량 t 값은 3.4722

자유도는 (n1 + n2 - 2) 인 19

p-value가 0.05미만으로 0.05 유의수준에서 귀무가설 기각.

따라서 두 모집단의 모평균은 같지 않다고 할 수 있다.

# 두 표본 X 와 Y에 대해 쌍체비교 t검정 실시

x <- c(70, 80, 72, 76, 76, 76, 72, 78, 82, 64, 74, 92, 74, 68, 84)

y <- c(68, 72, 62, 70, 58, 66, 68, 52, 64, 72, 74, 60, 74, 72, 74)

1
2
3
4
5
6
7
8
9
10
11
12
> t.test (x,y, paired=T, conf.level=0.95)
 
    Paired t-test
 
data:  x and y
t = 3.1054, df = 14, p-value = 0.007749
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.722083 14.877917
sample estimates:
mean of the differences 
                    8.8 
Colored by Color Scripter
cs

검정 통계량 t 는 3.1054

자유도는 15 - 1

p-value가 0.05보다 작은 값으로 0.05 유의수준에서 귀무가설 기각

쌍체비교 검정결과 x와 y의 모평균은 유의미한 차이가 있음

prop.test(x=c(x1,x2 , ...) ,n = (n1,n2, ...))

모비율 추정에 대한 검정

# 두 모비율 에 대해서 귀무가설 =0에 대한 검정

1
2
3
4
5
6
7
8
9
10
11
12
13
> prop.test (x=c(88,126), n=c(100,150))
 
    2-sample test for equality of proportions with
    continuity correction
 
data:  c(88, 126) out of c(100, 150)
X-squared = 0.48811, df = 1, p-value = 0.4848
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.05492737  0.13492737
sample estimates:
prop 1 prop 2 
  0.88   0.84 
Colored by Color Scripter
cs

p-value값이 매우 크기 때문에 귀무가설을 기각할 수 없다.

두 모비율은 같다고 할 수 있다.

lm(수식)

#Lineal model

단순선형모형으로 적합시키는 함수

# 종속변수 y, 독립변수 x 인 단순선형회귀모형으로 적합시키기

# ~ 는 ..에 대해서 적합시켜라 라는 의미

x <- c(3, 3, 4, 5, 6, 6, 7, 8, 8, 9)

y <- c(9, 5, 12, 9, 14, 16, 22, 18, 24, 22)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> fit <- lm(y~x)
> summary(fit)
 
Call:
lm(formula = y ~ x)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-3.6333 -2.0128 -0.3741  2.0428  3.8851 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.0709     2.7509  -0.389 0.707219    
x             2.7408     0.4411   6.214 0.000255 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 2.821 on 8 degrees of freedom
Multiple R-squared:  0.8284,    Adjusted R-squared:  0.8069 
F-statistic: 38.62 on 1 and 8 DF,  p-value: 0.0002555
cs

intercept 는 절편을 나타낸다.

절편의 P값이 매우 크기 때문에 = 0 이라는 귀무가설을 기각하지 못한다.

x의 P값이 매우 작기 때문에 = 0 이라는 귀무가설을 기각할 수 있다.

resid(적합된 회귀식)

residuals(적합된 회귀식)

적합된 회귀식에서 잔차를 추출한다.

1
2
3
4
5
> resid (fit)
         1          2          3          4          5 
 1.8484108 -2.1515892  2.1075795 -3.6332518 -1.3740831 
         6          7          8          9         10 
 0.6259169  3.8850856 -2.8557457  3.1442543 -1.5965770 
cs

fitted(적합된 회귀식)

적합된 회귀식에서 적합시킨 y값을 추출한다.

# 실제 y 값과 yhat의 비교

1
2
3
4
5
6
7
> y
 [1]  9  5 12  9 14 16 22 18 24 22
> fitted(fit)
        1         2         3         4         5 
 7.151589  7.151589  9.892421 12.633252 15.374083 
        6         7         8         9        10 
15.374083 18.114914 20.855746 20.855746 23.596577 
cs

coef(적합된 회귀식)

적합된 회귀식에서 회귀계수 를 반환한다.

> coef(fit)
(Intercept)           x 
  -1.070905    2.740831 
> fit$coefficients
(Intercept)           x 
  -1.070905    2.740831 

fit$coefficents도 같은 결과를 보인다.

confint(적합된 회귀식, level = 0.95)

회귀계수들의 신뢰구간을 나타낸다.

1
2
3
4
> confint(fit, level=0.95)
                2.5 %   97.5 %
(Intercept) -7.414523 5.272713
x            1.723735 3.757928
cs

anova(적합된회귀식)

적합된 회귀식의 ANOVA table을 나타낸다.

1
2
3
4
5
6
7
8
9
10
> anova(fit)
Analysis of Variance Table
 
Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)    
x          1 307.247 307.247  38.615 0.0002555 ***
Residuals  8  63.653   7.957                      
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
cs

abline(적합된 회귀식)
# add straight line to a plot

plot차트에 회귀선 긋기

> abline(fit)

chisq.test(x,p = c(...), correct = T or F)

카이제곱 검정

x = matrix데이터

p = 피어슨 적합도 검정을 위한 확률 나열

correct = F 이면 일양성 검정 수행

# 피어슨 카이제곱 적합도 검정

x<-matrix(c(773,231,238,59),nrow=1,ncol=4); x

1
2
3
4
5
6
> chi<-chisq.test(x,p=c(9/16,3/16,3/16,1/16)); chi
 
    Chi-squared test for given probabilities
 
data:  x
X-squared = 9.2714, df = 3, p-value = 0.02589
cs

# 일양성 검정 수행
x<-matrix(c(31,17,109,122),nc=2); x
1
2
3
4
5
6
7
> chi<-chisq.test(x,correct=FALSE); chi
 
    Pearson's Chi-squared test
 
data:  x
X-squared = 4.8114, df = 1, p-value = 0.02827
 
Colored by Color Scripter
cs


#독립성 검정 수행

1
2
3
4
5
6
> chisq.test(table)
 
    Pearson's Chi-squared test
 
data:  table
X-squared = 1.6851, df = 3, p-value = 0.6402
cs

chi$observed

관측치확인

chi$expected

기대값확인

chi$residuals

잔차확인

chi$statistic

검정통계량

일원배치 분산분석(oneway.test)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#일원배치 분산분석(oneway.test)
 
a<-c(64,72,68,77,56,95)
b<-c(78,91,97,82,85,77)
c<-c(75,93,78,71,63,76)
d<-c(55,66,49,64,70,68)
 
 
data<-data.frame(a,b,c,d); data
 
data.stack<-stack(data); data.stack
 
oneway.test(values~ind, data=data.stack, var.equal=T)
boxplot(values~ind, data=data.stack)
 
#일원배치 분산분석(aov)
type<-c("a","a","a","a","a","a","b","b","b","b","b","b","c","c","c","c","c","c","d","d","d","d","d","d")
y<-c(64,72,68,77,56,95,78,91,97,82,85,77,75,93,78,71,63,76,55,66,49,64,70,68)
 
type.factor<-as.factor(type); type.factor
data.aov<-aov(y~type.factor); data.aov
summary(data.aov)
Colored by Color Scripter
cs

다중 신뢰구간

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
 
type<-c("a","a","a","b","b","b","c","c")
y<-c(92.4,91.6,92.8,91.3,91.0,91.7,93.1,93.5)
type.factor<-as.factor(type)
data.aov<-aov(y~type.factor)
summary(data.aov)
boxplot(y~type.factor)
 
 
#다중 t-신뢰구간
t<-qt(1-(0.05/(2*3)),8-3); t
c((92.267-91.333)-t*0.464*sqrt(1/3+1/3),(92.267-91.333)+t*0.464*sqrt(1/3+1/3))
c((92.267-93.300)-t*0.464*sqrt(1/3+1/2),(92.267-93.300)+t*0.464*sqrt(1/3+1/2))
c((91.333-93.300)-t*0.464*sqrt(1/3+1/2),(91.333-93.300)+t*0.464*sqrt(1/3+1/2))
 
 
#개별신뢰구간
t<-qt(1-(0.05/2),8-3); t
 
c((92.267-91.333)-t*0.464*sqrt(1/3+1/3),(92.267-91.333)+t*0.464*sqrt(1/3+1/3))
c((92.267-93.300)-t*0.464*sqrt(1/3+1/2),(92.267-93.300)+t*0.464*sqrt(1/3+1/2))
c((91.333-93.300)-t*0.464*sqrt(1/3+1/2),(91.333-93.300)+t*0.464*sqrt(1/3+1/2))
 
 
Colored by Color Scripter
cs

윌콕슨 순위검정

a<-c(31.8,39.1)

b<-c(35.5,27.6,21.3)

wilcox.test(a,b,alternative="greater",correct=FALSE)

부호순위 검정

wilcox.test(a,b,alternative="greater",paired=TRUE)

스피어만 순위상관계수

cor.test(x,y,method="spearman")

저작자표시 비영리 변경금지 (새창열림)

'IT,인터넷 관련 학습 > R언어 학습' 카테고리의 다른 글

빅데이터 통계학(2) (0)	2019.04.05
빅데이터 통계학 (1) (0)	2019.04.05
R언어 : 다양한 통계차트 (0)	2019.03.02
R 내장 data 설명 (0)	2019.03.02
R 데이터 구조 7가지 : 스칼라, 벡터, 팩터, 행렬, 배열, 리스트, 데이터프레임 (0)	2019.02.25

학습러의 라이브러리

R언어 : R 통계 관련 함수

'IT,인터넷 관련 학습 > R언어 학습' 카테고리의 다른 글

+ Recent posts

티스토리툴바