library(tidyverse)
# 显示每一类的数量
%>%
diamonds count(cut)
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
count()
计算各类别数量tab()
计算各类别数量、百分比、累计百分比tab1()
同时列出多个变量的上述统计量summarise()
统计个别统计量summ()
快速生成大量变量的多个统计量count()
library(tidyverse)
# 显示每一类的数量
%>%
diamonds count(cut)
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
tab()
和 tab1()
这两个函数来自于我正在开发的 statart 包,暂时只能通过 source()
函数调用。
# 在 count() 的基础上,多了很多信息
%>%
diamonds tab(cut)
# A tibble: 6 × 4
cut n percent cum
<chr> <int> <dbl> <dbl>
1 Fair 1610 2.98 2.98
2 Good 4906 9.10 12.1
3 Very Good 12082 22.4 34.5
4 Premium 13791 25.6 60.0
5 Ideal 21551 40.0 100
6 total 53940 100 NA
# 自动转换成列联表
%>%
diamonds tab(cut, clarity)
# A tibble: 6 × 10
cut I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF total
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Fair 210 466 408 261 170 69 17 9 1610
2 Good 96 1081 1560 978 648 286 186 71 4906
3 Very Good 84 2100 3240 2591 1775 1235 789 268 12082
4 Premium 205 2949 3575 3357 1989 870 616 230 13791
5 Ideal 146 2598 4282 5071 3589 2606 2047 1212 21551
6 total 741 9194 13065 12258 8171 5066 3655 1790 53940
# 同时 tab1() 多个变量
%>%
diamonds tab1(cut:clarity)
$cut
# A tibble: 6 × 4
value n percent cum
<chr> <int> <dbl> <dbl>
1 Fair 1610 2.98 2.98
2 Good 4906 9.10 12.1
3 Very Good 12082 22.4 34.5
4 Premium 13791 25.6 60.0
5 Ideal 21551 40.0 100
6 total 53940 100 NA
$color
# A tibble: 8 × 4
value n percent cum
<chr> <int> <dbl> <dbl>
1 D 6775 12.6 12.6
2 E 9797 18.2 30.7
3 F 9542 17.7 48.4
4 G 11292 20.9 69.3
5 H 8304 15.4 84.7
6 I 5422 10.1 94.8
7 J 2808 5.21 100
8 total 53940 100 NA
$clarity
# A tibble: 9 × 4
value n percent cum
<chr> <int> <dbl> <dbl>
1 I1 741 1.37 1.37
2 SI2 9194 17.0 18.4
3 SI1 13065 24.2 42.6
4 VS2 12258 22.7 65.4
5 VS1 8171 15.1 80.5
6 VVS2 5066 9.39 89.9
7 VVS1 3655 6.78 96.7
8 IF 1790 3.32 100
9 total 53940 100 NA
# 同时 tab1() 多个变量,并返回成一个数据
%>%
diamonds tab1(cut:clarity, .append = TRUE)
# A tibble: 23 × 5
variable value n percent cum
<chr> <chr> <int> <dbl> <dbl>
1 cut Fair 1610 2.98 2.98
2 cut Good 4906 9.10 12.1
3 cut Very Good 12082 22.4 34.5
4 cut Premium 13791 25.6 60.0
5 cut Ideal 21551 40.0 100
6 cut total 53940 100 NA
7 color D 6775 12.6 12.6
8 color E 9797 18.2 30.7
9 color F 9542 17.7 48.4
10 color G 11292 20.9 69.3
11 color H 8304 15.4 84.7
12 color I 5422 10.1 94.8
13 color J 2808 5.21 100
14 color total 53940 100 NA
15 clarity I1 741 1.37 1.37
16 clarity SI2 9194 17.0 18.4
17 clarity SI1 13065 24.2 42.6
18 clarity VS2 12258 22.7 65.4
19 clarity VS1 8171 15.1 80.5
20 clarity VVS2 5066 9.39 89.9
21 clarity VVS1 3655 6.78 96.7
22 clarity IF 1790 3.32 100
23 clarity total 53940 100 NA
summarise()
# 平均钻石价格
%>%
diamonds summarise(mean(price))
# A tibble: 1 × 1
`mean(price)`
<dbl>
1 3933.
# 指定变量名
%>%
diamonds summarise(price = mean(price))
# A tibble: 1 × 1
price
<dbl>
1 3933.
summ()
为了解决 summarise()
太原始的问题,我尽可能兼顾简便和实用,设计了一个新的函数 summ()
。下面我来演示一些基础功能:
这个函数来自于我正在开发的 statart 包,暂时只能通过 source()
函数调用。
# 统计 price 的变量类型、非缺失数、类别数、均值、标准差、最小值和最大值
%>%
diamonds summ(price)
# A tibble: 1 × 8
variable type n unique mean sd min max
<chr> <chr> <int> <int> <dbl> <dbl> <int> <int>
1 price int 53940 11602 3933. 3989. 326 18823
解释一下上面的统计量:
variable
是变量名type
是变量类型(如变量有单位,则为单位)n
是该变量非缺失值的数量unique
是变量的类别数(有多少个不重复的取值)
unique
\(\leq\) n
unique
越大,意味着这个变量越“连续”,信息越丰富unique
数也最大。mean
是均值sd
是标准差min
是最小值max
是最大值# 同时统计多个变量
%>%
diamonds summ(x:z)
# A tibble: 3 × 8
variable type n unique mean sd min max
<chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 x dbl 53940 554 5.73 1.12 0 10.7
2 y dbl 53940 552 5.73 1.14 0 58.9
3 z dbl 53940 375 3.54 0.706 0 31.8
在 x, y, z 这三个维度上,钻石的 z(高度)最小,而 x 和 y 的均值几乎一样。
# 展示所有统计量
%>%
diamonds summ(x:z, .detail = TRUE) %>%
print(width = Inf)
# A tibble: 3 × 18
variable type n unique miss_n valid_pct min q1 median mean mad
<chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x dbl 53940 554 0 1 0 4.71 5.7 5.73 1.38
2 y dbl 53940 552 0 1 0 4.72 5.71 5.73 1.36
3 z dbl 53940 375 0 1 0 2.91 3.53 3.54 0.845
sd q3 max iqr skew kurtosis se
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.12 6.54 10.7 1.83 0.379 -0.618 0.00483
2 1.14 6.54 58.9 1.82 2.43 91.2 0.00492
3 0.706 4.04 31.8 1.13 1.52 47.1 0.00304
可以看到,其实 x 和 y 在四分位数(q1 和 q3)上没有什么差别,但是它们的最大值(max)、偏度(skew)和峰度(kurtosis)上相差很多。我猜想,对于一些不规则的钻石,可能会把短边选为 x,而长边选为 y。
# 自己选择一些统计量
%>%
diamonds summ(x:z, .stat = c("valid_pct", "mean", "se"))
# A tibble: 3 × 5
variable type valid_pct mean se
<chr> <chr> <dbl> <dbl> <dbl>
1 x dbl 1 5.73 0.00483
2 y dbl 1 5.73 0.00492
3 z dbl 1 3.54 0.00304
valid_pct 表示非缺失值的比例,这里为1,说明完全无缺失。se 是标准误,也就是对均值的误差估计。
# 一次性选择所有变量
%>%
diamonds summ(everything())
Warning in summ(., everything()):
cut, color, clarity are factor variables.
They are summarised (***), but the statistics may be misleading.
Consider using `tab1()` instead.
# A tibble: 10 × 8
variable type n unique mean sd min max
<chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 carat dbl 53940 273 0.798 0.474 0.2 5.01
2 cut*** ord 53940 5 3.90 1.12 1 5
3 color*** ord 53940 7 3.59 1.70 1 7
4 clarity*** ord 53940 8 4.05 1.65 1 8
5 depth dbl 53940 184 61.7 1.43 43 79
6 table dbl 53940 127 57.5 2.23 43 95
7 price int 53940 11602 3933. 3989. 326 18823
8 x dbl 53940 554 5.73 1.12 0 10.7
9 y dbl 53940 552 5.73 1.14 0 58.9
10 z dbl 53940 375 3.54 0.706 0 31.8
请注意,因为 cut, color, clarity 都是定序变量,所以 summ()
它们比较牵强。对它们使用前文的 tab1()
会更为合适。
Rows: 1,000
Columns: 14
$ date <date> 2024-02-08, 2024-02-07, 2024-02-06, 2024-02-05, 2024-02-04,…
$ time <dttm> 2024-02-09 11:59:59, 2024-02-09 11:59:58, 2024-02-09 11:59:…
$ duration1 <drtn> 3 hours, 4 hours, 5 hours, 6 hours, 7 hours, 8 hours, 9 hou…
$ duration2 <time> 12:34:56, 12:34:56, 12:34:56, 12:34:56, 12:34:56, 12:34:56,…
$ string <chr> "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC…
$ logical <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FA…
$ unit1 [m] 1 [m], 2 [m], 3 [m], 4 [m], 5 [m], 6 [m], 7 [m], 8 [m], 9 [m],…
$ unit2 [m^2] 1 [m^2], 2 [m^2], 3 [m^2], 4 [m^2], 5 [m^2], 6 [m^2], 7 [m^2…
$ factor <fct> a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, …
$ order <ord> a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, a, b, …
$ double <dbl> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, …
$ integer <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ half_miss <dbl> 1, NA, 3, NA, 5, NA, 7, NA, 9, NA, 11, NA, 13, NA, 15, NA, 1…
$ all_miss <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
%>%
complex_tb codebook()
Warning in un[i] <- at$units: number of items to replace is not a multiple of
replacement length
Warning in un[i] <- at$units: number of items to replace is not a multiple of
replacement length
# A tibble: 14 × 4
variable type n unique
<chr> <chr> <int> <int>
1 date date 1000 1000
2 time datetime 1000 1000
3 duration1 duration [hours] 1000 1000
4 duration2 time [secs] 1000 1
5 string character 1000 1
6 logical logical 1000 2
7 unit1 units [m] 1000 1000
8 unit2 units [m^2] 1000 1000
9 factor factor 1000 2
10 order ordered 1000 2
11 double double 1000 1000
12 integer integer 1000 1000
13 half_miss double 500 500
14 all_miss double 0 0
%>%
complex_tb summ(everything())
Warning in summ(., everything()):
string is non-numeric.
Use `tab()` instead.
Warning in summ(., everything()):
date, time are date or datetime variables.
Use `summ_date()` or `summ_all()` instead.
Warning in summ(., everything()):
factor, order are factor variables.
They are summarised (***), but the statistics may be misleading.
Consider using `tab1()` instead.
Warning in summ(., everything()):
all_miss is entirely missing and thus removed.
# A tibble: 10 × 8
variable type n unique mean sd min max
<chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 duration1 [hours] 1000 1000 502. 289. 3 1002
2 duration2 [secs] 1000 1 45296 0 45296 45296
3 logical lgl 1000 2 0.5 0.500 0 1
4 unit1 [m] 1000 1000 500. 289. 1 1000
5 unit2 [m^2] 1000 1000 500. 289. 1 1000
6 factor*** fct 1000 2 1.5 0.500 1 2
7 order*** ord 1000 2 1.5 0.500 1 2
8 double dbl 1000 1000 50.0 28.9 0.1 100
9 integer int 1000 1000 500. 289. 1 1000
10 half_miss dbl 500 500 500 289. 1 999