cor相關(guān)性:
cor():
cor(x, y , method = c("pearson", "kendall", "spearman"))
計算x,y的相關(guān)性,x,y需要為向量,若為矩陣則計算x列,y列的相關(guān)性。method指定相關(guān)性計算的方法,默認情況下是Pearson相關(guān)性系數(shù)。
Pearson,spearman計算方法見筆記本
cor.test():

cor.test()使用方法.png
cor()只給出相關(guān)性值,cor.test()同時給出檢驗的p值。
lm()線性擬合
lm()函數(shù)返回擬合結(jié)果的對象,可以用summary()函數(shù)查看其內(nèi)容
> test
height weight gender BMI
tom 180 75 male 23.14815
cindy 165 58 female 21.30395
jimmy 175 72 male 23.51020
sam 173 68 male 22.72044
lucy 160 60 female 23.43750
lily 165 55 female 20.20202> result=lm(height~weight,data=test)
> result
Call:
lm(formula = height ~ weight, data = test)
Coefficients:
(Intercept) weight
115.34 0.84
#擬合方程為height=115.34+0.84weight
> summary(result)
Call:
lm(formula = height ~ weight, data = test)
Residuals:
tom cindy jimmy sam Lucy lily
1.6529 0.9336 -0.8270 0.5332 -5.7465 3.4537
Coefficients:
Estimate Std. Error t value Pr(>|t|)#Pr(>|t|)為顯著性
(Intercept) 115.3441 12.5825 9.167 0.000786 ***
weight 0.8400 0.1933 4.346 0.012198 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.519 on 4 degrees of freedom
Multiple R-squared: 0.8252, Adjusted R-squared: 0.7815
#擬合優(yōu)度和矯正后的擬合優(yōu)度,此項數(shù)值越大越好
F-statistic: 18.89 on 1 and 4 DF, p-value: 0.0122
#檢驗整個方程的p值,<0.05為宜
檢驗一套數(shù)據(jù)是否符合正態(tài)分布
符合正態(tài)分布的數(shù)據(jù)情況
法一
>data=rnorm(1000)
> hist(data,prob=T)
> lines(density(data))

輸出結(jié)果.png
繪制數(shù)據(jù)的統(tǒng)計圖,大致符合鐘形分布即可大致判斷符合正態(tài)分布
法二
qqnorm(data)
qqline(data)
截屏2021-06-28 下午4.21.52.png
直線大致滿足y=x即可判斷為符合正態(tài)分布
法三
> shapiro.test(data)
Shapiro-Wilk normality test
data: data
W = 0.99786, p-value = 0.229
#>0.05符合正態(tài)分布,<0,05不符合正態(tài)分布
不符合正態(tài)分布的數(shù)據(jù)情況
法一
> a=c(rep(1,10),rep(2,5),rep(3,4),6,8,10,12,20)
> hist(a,breaks=seq(0.5,21,by=1),prob=TRUE)
> lines(density(a),col="blue")#明顯看出是偏態(tài)分布
> abline(v=median(a),col="red")#添加中值,紅色線
> abline(v=mean(a),col="green")#添加平均數(shù),綠色線
> #對于偏態(tài)分布,中值比平均數(shù)更有意義

法一輸出結(jié)果
法二
> qqnorm(a)
> qqline(a)

法二輸出結(jié)果
明顯為偏態(tài)分布
法三
> shapiro.test(a)
Shapiro-Wilk normality test
data: a
W = 0.63575, p-value = 1.589e-06
#pvalue<0.05,偏態(tài)分布
不符合正態(tài)分布的數(shù)據(jù),不能使用t檢驗,應(yīng)該使用秩和檢驗。
> a=c(rep(1,10),rep(2,5),rep(3,4),6,8,10,12,20)
> b=c(rep(2,7),rep(3,5),rep(5,8),8,10,18,25)
#a,b不符合正態(tài)分布
> t.test(a,b)
Welch Two Sample t-test
data: a and b
t = -1.2025, df = 44.761, p-value = 0.2355
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.681665 1.181665
sample estimates:
mean of x mean of y
3.666667 5.416667
#不符合正態(tài)分布的數(shù)據(jù),不能用T檢驗,此時P value>0.05,無顯著差異
> wilcox.test(a,b,exact = FALSE)#秩和檢驗,非參數(shù)檢驗
Wilcoxon rank sum test with continuity correction
data: a and b
W = 162.5, p-value = 0.008673
alternative hypothesis: true location shift is not equal to 0
#不符合正態(tài)分布的數(shù)據(jù),用秩和檢驗,此時P value<0.05,有顯著差異
百分比檢驗
假設(shè)我們知道全球流感對死亡率是10%,在美國對調(diào)查發(fā)現(xiàn)400名流感病人中有51人死亡了,請問美國流感對死亡率是否顯著高于全球流感對死亡率。
> prop.test(51,400,p=0.1,alternative = "greater")
1-sample proportions test with continuity correction
data: 51 out of 400, null probability 0.1
X-squared = 3.0625, df = 1, p-value = 0.04006
alternative hypothesis: true p is greater than 0.1
95 percent confidence interval:
0.101422 1.000000
sample estimates:
p
0.1275
卡方檢驗和fisher精確檢驗
卡方檢驗例子——判斷吸煙和不吸煙患氣管炎是否有顯著差異PNG
> data=rbind(c(50,250),c(8,10))
> rownames(data)=c("cig","non-cig")
> colnames(data)=c("qiguanyan","non-qiguanyan")
> chisq.test(data)#卡方檢驗,注意輸入的值直接為數(shù)據(jù)框
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 7.0225, df = 1, p-value = 0.008049
Warning message:
In chisq.test(data) : Chi-squared approximation may be incorrect
#此時會產(chǎn)生報錯,產(chǎn)生此項錯誤的原因是吸煙患者樣本數(shù)太少,出現(xiàn)此類報錯,可用下文的fisher精確檢驗做統(tǒng)計檢驗
> fisher.test(data)#fisher精確檢驗
Fisher's Exact Test for Count Data
data: data
p-value = 0.007489
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.08452923 0.77224505
sample estimates:
odds ratio
0.25151
卡方檢驗例子——判斷男性女性癌癥的不同階段的人數(shù)是否有差異.PNG
方差分析示例
step1:大致查看數(shù)據(jù)
> cholesterol
trt response
1 1time 3.8612
2 1time 10.3868
3 1time 5.9059
4 1time 3.0609
5 1time 7.7204
6 1time 2.7139
7 1time 4.9243
8 1time 2.3039
9 1time 7.5301
10 1time 9.4123
11 2times 10.3993
12 2times 8.6027
13 2times 13.6320
14 2times 3.5054
15 2times 7.7703
16 2times 8.6266
17 2times 9.2274
18 2times 6.3159
19 2times 15.8258
20 2times 8.3443
21 4times 13.9621
22 4times 13.9606
23 4times 13.9176
24 4times 8.0534
25 4times 11.0432
26 4times 12.3692
27 4times 10.3921
28 4times 9.0286
29 4times 12.8416
30 4times 18.1794
31 drugD 16.9819
32 drugD 15.4576
33 drugD 19.9793
34 drugD 14.7389
35 drugD 13.5850
36 drugD 10.8648
37 drugD 17.5897
38 drugD 8.8194
39 drugD 17.9635
40 drugD 17.6316
41 drugE 21.5119
42 drugE 27.2445
43 drugE 20.5199
44 drugE 15.7707
45 drugE 22.8850
46 drugE 23.9527
47 drugE 21.5925
48 drugE 18.3058
49 drugE 20.3851
50 drugE 17.3071
> boxplot(response~trt,data=cholesterol)

輸出結(jié)果.png
大致可以看出不同處理間y是有差異的。
step2:檢查正態(tài)分布及方差齊性
> shapiro.test(cholesterol$response)
#檢驗是否為正態(tài)分布,>0.05沒有顯著差異,為正態(tài)分布
Shapiro-Wilk normality test
data: cholesterol$response
W = 0.97722, p-value = 0.4417
> bartlett.test(response~trt,data=cholesterol)
#檢查方差齊性,>0.05方差沒有顯著差異。
Bartlett test of homogeneity of variances
data: response by trt
Bartlett's K-squared = 0.57975, df = 4, p-value = 0.9653
step3方差分析:
> fit<-aov(response~trt,data=cholesterol)
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
trt 4 1351.4 337.8 32.43 9.82e-13 ***
Residuals 45 468.8 10.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Pr(>F) 即p值小于0.001,極顯著,即處理組間有差異,可以用于事后多重比較
step4:事后多重比較
> TukeyHSD(fit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = response ~ trt, data = cholesterol)
$trt
diff lwr upr p adj
2times-1time 3.44300 -0.6582817 7.544282 0.1380949
4times-1time 6.59281 2.4915283 10.694092 0.0003542
drugD-1time 9.57920 5.4779183 13.680482 0.0000003
drugE-1time 15.16555 11.0642683 19.266832 0.0000000
4times-2times 3.14981 -0.9514717 7.251092 0.2050382
drugD-2times 6.13620 2.0349183 10.237482 0.0009611
drugE-2times 11.72255 7.6212683 15.823832 0.0000000
drugD-4times 2.98639 -1.1148917 7.087672 0.2512446
drugE-4times 8.57274 4.4714583 12.674022 0.0000037
drugE-drugD 5.58635 1.4850683 9.687632 0.0030633
#diff:兩組平均值之間的差異
#lwr,upr:置信區(qū)間的上下端點為95%(默認值)
#p adj:調(diào)整后的多個比較的p值。
#如本例,除了2times-1time,2times-1time,drugD-4times沒有顯著差異,其余組別均有顯著差異
step5 圖例展示
> library(multcomp)
> par(mar=c(5,4,6,2))
> tuk<-glht(fit,linfct=mcp(trt="Tukey"))
> plot(cld(tuk,level=0.05),col="lightgreen")

輸出結(jié)果
有相同字母的組別表示差異不顯著
