Rによるデータサイエンス（12/21） - 「大人の教養・知識・気付き」を伸ばすブログ

　Rについて

Rによるデータサイエンス(第2版):データ解析の基礎から最新手法まで

作者:金明哲
森北出版

Amazon

をベースに学んでいく。
　今回は線形判別分析（PP.161-169）を扱う。

前回
13.　線形判別分析
- 13.1　線形判別分析の基礎
- 13.2　ケーススタディ
補足　スペック情報

前回

power-of-awareness.com

13.　線形判別分析

　識別・認識に相当する能力を機械的に実現する研究分野をパターン認識という。ここではパターン認識でも最も古典的な手法である判別分析を扱う。判別分析は個体がどの集団に属するかが明確である学習データを用いて判別モデルを構築し、そのモデルを用いて所属不明の個体がどのグループに帰属するかを判別する手法である。判別分析は線形判別分析と非線形判別分析の大別される。

13.1　線形判別分析の基礎

　線形判別分析は群*1を分ける境界を直線あるいは超平面として与え、以下のような線形関数を用いて所属する群を判別する。

$\begin{aligned} f(\boldsymbol{x})=a_1+a_1x_1+\cdots+a_p x_p,\boldsymbol{x}={}^{t}(x_1,\cdots,x_p) \end{aligned}$

　 $\mathrm{Fisher}$ が提案した線形判別分析では、 $\boldsymbol{x}\sim\mathcal{N}_p(\boldsymbol{\mu},\Sigma)$ に従い各群の母分散が等しいと仮定し、群間の分散と郡内の分散の比を最大化することで係数を推定する。各種パラメータは以下のとおりである：

$\begin{aligned} 係数ベクトル：　{}^{t}A&=\begin{bmatrix}a_1,a_2,\cdots,a_p\end{bmatrix},\\ 群間の分散：　BS&=\displaystyle{\frac{1}{g-1}\sum_{i=1}^{g}m_i(\bar{X}_i-\bar{X}){}^{t}(\bar{X}_i-\bar{X})},\\ 群内の分散：　WS&=\displaystyle{\frac{1}{m-g}\sum_{i=1}^{g}\sum_{j=1}^{m_i}(X_{ij}-\bar{X}_i){}^{t}(X_{ij}-\bar{X}_i)},\\ \end{aligned}$

線形判別関数の係数は群間分散と群内分散の比率

$\begin{aligned} V&=\displaystyle{\frac{BS}{WS}}=\displaystyle{\frac{{}^{t}A(BS)A}{{}^{t}A(WS)A}} \end{aligned}$

を最大にすることで求める。最大値を求めるため各係数での偏微分を $0$ とした方程式を解くと $(WS)^{-1}(BS)$ の固有値となる。第1固有ベクトルを第1判別関数の係数とする。

13.2　ケーススタディ

library("MASS") # 線形判別用のパッケージ
library("ggplot2")
library("ggridges")


##########################
### テストデータの作成 ###
##########################
data(iris)

vc_iris_lab <- c(rep("S",50),rep("C",50),rep("V",50)) # ラベル
df_iris1 <- data.frame(iris[,1:4],Species = vc_iris_lab)

### テストデータはデータセットの奇数行とする
even_n <- 2 *(1:75) -1

df_iris_train <- df_iris1[even_n,]
df_iris_test <- df_iris1[-even_n,]


##################
### 推定の実施 ###
##################
apply(iris[,1:4],2,sd) # 母分散をチェック

Z_lda <- lda(formula = formula(Species ~.), data = df_iris_train)

vc_coefs <- Z_lda$scaling[,1] # 係数の推定値

vc_cts <- apply(Z_lda$means%*%Z_lda$scaling,2,mean) # 定数項

vc_lda_scores <- as.vector(Z_lda$scaling[,1] %*% t(df_iris_train[,1:4]) - vc_cts[1]) # 判別得点：判別関数で得られる値

df_iris1_g <- data.frame(score = vc_lda_scores,df_iris_train)


### 汎化性能をチェック
table(df_iris_train[,5], predict(Z_lda)$class)

for(i in 2:5){
  df_coef <- data.frame(Species = df_iris1_g[df_iris1_g[,"Species"]=="S",6],
                        val = df_iris1_g[df_iris1_g[,"Species"]=="S",i])
  df_coef <- rbind(df_coef,data.frame(Species = df_iris1_g[df_iris1_g[,"Species"]=="C",6],
                                      val = df_iris1_g[df_iris1_g[,"Species"]=="C",i]))
  df_coef <- rbind(df_coef,data.frame(Species = df_iris1_g[df_iris1_g[,"Species"]=="V",6],
                                      val = df_iris1_g[df_iris1_g[,"Species"]=="V",i]))

  # 変数の分布  
  g <- ggplot(data = df_coef,mapping = aes(x = val, group = Species, color = Species)) + geom_density()
  g <- g + ylab("密度") + xlab(colnames(iris)[i-1])
  g <- g + theme_classic()
  g <- g + theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
                   legend.title=element_text(size = 7),
                   legend.text=element_text(size = 7))
  plot(g)
}

# 判別得点の分布
g <- ggplot(data = df_iris1_g,mapping = aes(x = score, y = Species, group = Species, fill = Species)) + geom_density_ridges(alpha = 0.5)
g <- g + labs(x = "種類",y = "判別得点", group = "", fill = "")
g <- g + theme_classic()
g <- g + theme(plot.title = element_text(hjust = 0.5),legend.position = "bottom",
               axis.title = element_text(size = 12),
               legend.title=element_text(size = 12),
               legend.text=element_text(size = 12))
plot(g)


df_scores_prd <- data.frame(score_1 = as.vector(Z_lda$scaling[,1] %*% t(df_iris_train[,1:4]) - vc_cts[1]),
                            score_2 = as.vector(Z_lda$scaling[,2] %*% t(df_iris_train[,1:4]) - vc_cts[2]),
                            Species = df_iris1_g[,"Species"])

g <- ggplot(data = df_scores_prd,mapping = aes(x = score_1, y = score_2, color = Species, group = Species, label = Species))
g <- g + geom_text()
g <- g + labs(x = "第1判別得点",y = "第2判別得点", group = "", color = "")
g <- g + theme_classic()
g <- g + theme(plot.title = element_text(hjust = 0.5),legend.position = "none",
               axis.title = element_text(size = 12),
               legend.title=element_text(size = 12),
               legend.text=element_text(size = 12))
g <- g + geom_hline(yintercept = 0, linetype = 3)
g <- g + geom_vline(xintercept = 0, linetype = 3)
plot(g)

## テストデータに当てはめ
vc_predict <- predict(object = Z_lda,newdata = df_iris_test[,-5])
df_test_prd <- data.frame(target = df_iris_test[,"Species"],
                          score_1 = as.vector(Z_lda$scaling[,1] %*% t(df_iris_test[,-5]) - mean(Z_lda$means%*%Z_lda$scaling[,1])),
                          score_2 = as.vector(Z_lda$scaling[,2] %*% t(df_iris_test[,-5]) - mean(Z_lda$means%*%Z_lda$scaling[,2])),
                          pred = vc_predict$class)

table(df_test_prd$target,df_test_prd$pred)


g <- ggplot(data = df_test_prd,mapping = aes(x = score_1, y = score_2, color = target, group = target, label = target))
g <- g + geom_text()
g <- g + labs(x = "第1判別得点",y = "第2判別得点", group = "", color = "")
g <- g + theme_classic()
g <- g + theme(plot.title = element_text(hjust = 0.5),legend.position = "none",
               axis.title = element_text(size = 12),
               legend.title=element_text(size = 12),
               legend.text=element_text(size = 12))
g <- g + geom_hline(yintercept = 0, linetype = 3)
g <- g + geom_vline(xintercept = 0, linetype = 3)
plot(g)

# 交差診断を実施
iris_CV <- lda(formula = formula(Species ~.),data =iris, CV = T)
lda_tbl <- table(iris[,5],iris_CV$class)

sum(diag(lda_tbl))/sum(lda_tbl)
1 - sum(diag(lda_tbl))/sum(lda_tbl)

補足　スペック情報

エディション	Windows 10 Home
バージョン	20H2
プロセッサ	Intel(R) Core(TM) i5-1035G4 CPU @ 1.10GHz 1.50 GHz
実装 RAM	8.00 GB
システムの種類	64 ビットオペレーティングシステム、x64 ベースプロセッサ
R　バージョン	4.1.3 (March, 2022)
RStudio　バージョン	2022.02.2 Build 485

*1:グループのこと。

前回

13. 線形判別分析

13.1 線形判別分析の基礎

13.2 ケーススタディ

補足 スペック情報

13.　線形判別分析

13.1　線形判別分析の基礎

13.2　ケーススタディ

補足　スペック情報