Rによるデータサイエンス（03/21） - 「大人の教養・知識・気付き」を伸ばすブログ

　Rについて

Rによるデータサイエンス(第2版):データ解析の基礎から最新手法まで

作者:金明哲
森北出版

Amazon

をベースに学んでいく。

前回
4.　データの視覚化
補足　スペック情報
次回

前回

power-of-awareness.com

4.　データの視覚化

　データの背景にある構造を考察する助けとしてグラフ化がある。以下、基本的なグラフを紹介する。

4.1　棒グラフ

####################
### グラフの作成 ###
####################

library(ggplot2)
library(ggsci)
library(reshape2)

### 棒グラフ

# VADeathsを利用：1940年代の米ヴァージニア州における100人当たりの死亡率
# 行：年齢のbin
# 列：地域×性別
data(VADeaths)

# データを一部目視
VADeaths_melt <- melt(VADeaths)
colnames(VADeaths_melt) <- c("Age","Region","Mortality")

g <- ggplot(data = VADeaths_melt, aes(x = Region, y = Mortality, fill = Age))
g <- g + geom_col(position = "dodge")
g <- g + scale_fill_nejm()
g <- g + ylab("Mortality[%]")
g <- g + theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
               legend.title=element_text(size = 7),
               legend.text=element_text(size = 7))

plot(g)

4.2　円グラフ

library("ggplot2")
library("stringr")

data(iris)

# 積み上げ棒グラフを極座標変換する
g <- ggplot(data = iris, aes(fill = factor(round(Sepal.Length)))) +
  geom_bar(stat = "count",
           aes(x = 1),
           position = position_stack(reverse = TRUE), color = "black") +
  geom_text(stat = "count",
            aes(x = 1.1, y = ..count..,
                label = ..count..),
            position = position_stack(vjust = 0.5, reverse = TRUE)) +
  geom_text(stat = "count",
            aes(x = 1.55, y = ..count..,
                label = str_c(..fill.., "cm")),
            position = position_stack(vjust = 0.5, reverse = TRUE)) +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        axis.title = element_blank(),
        axis.text  = element_blank(),
        axis.ticks = element_blank()) +
  coord_polar(theta = "y") # 極座標表示

4.3　ヒストグラム

　データを階級ごとにまとめ棒グラフで表現したものである。

library("ggplot2")

data(iris)

g <- ggplot(data = iris, aes(x = Sepal.Length)) + theme_minimal() + 
  geom_histogram(position = "identity",colour = "gray10", fill = "green")
plot(g)

4.4　折れ線グラフ

library("ggplot2")
library("tidyverse")

data(VADeaths)

colnames(VADeaths) <- gsub(" ","",colnames(VADeaths))

df_VADeaths <- data.frame()

# データを適当な形に組成：今回は自力で
for(i in 1:ncol(VADeaths)){
  df_VADeaths <- rbind(df_VADeaths,data.frame("Att" = rep(colnames(VADeaths)[i],nrow(VADeaths)),
                                              "Age" = as.factor(rownames(VADeaths)),
                                              "Mortality" = VADeaths[,i]))
}

# 折れ線グラフ
g <- ggplot(data = df_VADeaths,aes(x = Age, y = Mortality, color = Att, group = Att)) + geom_line() +
  geom_point() + theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
                       legend.title=element_text(size = 7),
                       legend.text=element_text(size = 7))
plot(g)

4.5　箱ひげ図

　特定の分布形を仮定することなくデータの分布状況(データの中心、離散度合い、異常値)を考察するための方法として箱ひげ図がある。

箱ひげ図の模式図

なおひげの端は「中央値 $\pm1.5\times\mathrm{IQR}$ 」としており、これよりも小さい(大きい)と見なすものの、デフォルトの設定がそうだというだけで変えることも可能である。

data(iris)

ggplot(iris,aes(y = Sepal.Length, x = Species)) + geom_boxplot() + theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
        legend.title=element_text(size = 7),
        legend.text=element_text(size = 7))
ggplot(iris,aes(y = Sepal.Width, x = Species)) + geom_boxplot() + theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
        legend.title=element_text(size = 7),
        legend.text=element_text(size = 7))
ggplot(iris,aes(y = Petal.Length, x = Species)) + geom_boxplot() + theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
        legend.title=element_text(size = 7),
        legend.text=element_text(size = 7))
ggplot(iris,aes(y = Petal.Width, x = Species)) + geom_boxplot() + theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
        legend.title=element_text(size = 7),
        legend.text=element_text(size = 7))

4.6　散布図

　2変数または3変数間の関係を考察すべく各標本の変数値を座標として2(3)次元空間にプロットした図である。特に両者の相関関係(有無)を考察するのに用いる。

data(iris)

# Sepal.LengthとSepal.Widthの散布図
g1 <- ggplot(data = iris, aes(x = Sepal.Length,y = Sepal.Width, colour = Species)) + 
  geom_point() + theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
        legend.title=element_text(size = 7),
        legend.text=element_text(size = 7))
plot(g1)

# Petal.LengthとPetal.Widthの散布図
g2 <- ggplot(data = iris, aes(x = Petal.Length,y = Petal.Width, colour = Species)) + 
  geom_point() + theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
        legend.title=element_text(size = 7),
        legend.text=element_text(size = 7))
plot(g2)

また $\mathrm{GGally}$ パッケージを用いると散布図行列を簡単に書ける。

library("ggplot2")
library("GGally")

data(iris)
ggpairs(iris,aes_string(colour="Species"))

補足　スペック情報

エディション	Windows 10 Home
バージョン	20H2
プロセッサ	Intel(R) Core(TM) i5-1035G4 CPU @ 1.10GHz 1.50 GHz
実装 RAM	8.00 GB
システムの種類	64 ビットオペレーティングシステム、x64 ベースプロセッサ
R　バージョン	3.6.3 (2020-02-29)
RStudio　バージョン	1.2.5033

次回

power-of-awareness.com

前回

4. データの視覚化

4.1 棒グラフ

4.2 円グラフ

4.3 ヒストグラム

4.4 折れ線グラフ

4.5 箱ひげ図

4.6 散布図

補足 スペック情報

次回