Pythonによるデータ分析(その02/X) - 「大人の教養・知識・気付き」を伸ばすブログ

　仕事上、早々に $\mathrm{Python}$ を使えるようにしないといけないため、

Pythonによるデータ分析入門第2版 ―NumPy、pandasを使ったデータ処理

作者:Wes McKinney
オライリー・ジャパン

Amazon

を基に学んでいく。

前回

power-of-awareness.com

前回
2.　Python組み込みのデータ構造と関数、ファイルの扱い
- 2.1　データ構造とシーケンス

2.　Python組み込みのデータ構造と関数、ファイルの扱い

2.1　データ構造とシーケンス

2.1.4　辞書

　辞書は、 $\mathrm{Python}$ における組み込みデータ型の中でも最も重要な構造の1つと言える。辞書への操作は、リストやタプルと同じ方式で可能である。

# 辞書の定義
empty_dict = {}

d1 = {'a' : 'some value', 'b' : [1,2,3,4]}
# リストやタプルと同じ文法で要素の参照・挿入・設定が可能

# updateメソッドで辞書を別の辞書にマージできる
d1.update({'b' : 'foo', 'c' : 12})
print(d1)

　2つのシーケンスを要素ごとに辞書にまとめることは良くある。

### シーケンスから辞書を構成する

key_list = ['a','b','c','d','e']
value_list = list(range(1,6))

mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value
print(mapping)

　キーを指定して対応する値を返すには以下のようにする：

### ある値が辞書のキーに含まれているとき、対応する値を返す
### (そうでなければ既定値を返す)

value = some_dict.get(key, default_value)

# ↑は以下に等しい
if key in some_dict:
    value = some_dict[key]
else:
    valie = default_value

### 単語リストを文頭1文字目で分類する

words = ['apple', 'bat', 'bar', 'atom', 'book']

by_letter = {}

# (1) 逐次的にやってみる
for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)
print(by_letter)

# (2) setdefaultメソッドを使ってみる
by_letter = {}

for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
print(by_letter)

　辞書の値は任意のオブジェクトを利用できる。ただしキーはスカラー値のようなイミュータブルなオブジェクトもしくはタプルでなければならない。

2.1.5　集合

　集合は、順序付けされていない一意な要素を集めたオブジェクトである。

set1 = set([2,2,2,1,3,3])
print(set1)

set2 = {2,2,2,1,3,3}
print(set2)

　集合には、数学の文脈で定義できる各種演算が定義されている。

a = {1,2,3,4,5}
b = {3,4,5,6,7,8}

# 和集合
c = a.union(b) # 和集合を作って、aを書き換えたければ.updateを用いる
print(c)
c = a | b
print(c)

# 共通部分
d = a.intersection(b)
print(d)
d = a & b
print(d)

関数	代替文法	解説
$\mathrm{a.add(x)}$	なし	元 $x$ を集合 $a$ に含める。
$\mathrm{a.clear()}$	なし	すべての元を破棄して、集合 $a$ を空集合とする。
$\mathrm{a.remove(x)}$	なし	集合 $a$ から元 $x$ を取り除く。
$\mathrm{a.pop()}$	なし	任意の元を集合 $a$ から削除し、集合が空集合になると $\mathrm{KeyError}$ を返す。
$\mathrm{a.union(b)}$	$a$ ｜ $b$	集合 $a,b$ の和集合を返す。
$\mathrm{a.update(b)}$	$a$ ｜= $b$	集合 $a$ を $a,b$ の和集合に書き換える。
$\mathrm{a.intersection(b)}$	$a\&b$	集合 $a,b$ の共通部分を返す。
$\mathrm{a.intersection}$ _ $\mathrm{update(b)}$	$a\&=b$	集合 $a$ を $a,b$ の共通部分に書き換える。
$\mathrm{a.difference(b)}$	$a-b$	集合 $a,b$ の差集合
$\mathrm{a.difference}$ _ $\mathrm{update(b)}$	$a-=b$	集合 $a$ を $a,b$ の差集合に書き換える。
$\mathrm{a.symmetric}$ _ $\mathrm{difference(b)}$	$a$ ^ $b$	集合 $a,b$ の対称差*1を返す。
$\mathrm{a.symmetric}$ _ $\mathrm{difference}$ _ $\mathrm{update(b)}$	$a$ ^ $=b$	集合 $a$ を集合 $a,b$ の対称差に書き換える。
$\mathrm{a.issubset(b)}$	$a\lt=b$	集合 $a$ が集合 $b$ の部分集合であれば $\mathrm{true}$ を返す。
$\mathrm{a.issuperset(b)}$	$\mathrm{a}=\gt\mathrm{b}$	集合 $b$ が集合 $a$ の部分集合であれば $\mathrm{true}$ を返す。
$\mathrm{a.isdisjoint(b)}$	なし	$a,b$ に共通する元が1つも無いときに $\mathrm{true}$ を返す。

　集合の元はイミュータブルでなければならない。リストはタプルに変更しなければならない。

2.1.6　内包表記

　リストの内包表記は以下のようなフィルタリングで行う。

[expr for val in collection if condition]

# リストの内包表記
strings = ['a','as','bat','car','dove','python']
result = [x.upper() for x in strings if len(x) >2]

　辞書の内包表記は以下のようなフィルタリングで行う。

{key-expr : value-expr for value in collection if condition}

# 辞書の内包表記
keys = ['a','b','c','d','e']
values = range(1,6)

result = {key:value for key,value in zip(keys,values)}
print(result)

# indexをキーにする場合
loc_mapping = {val : index for index, val in enumerate(strings)}
print(loc_mapping)

　集合の内包表記は以下のようなフィルタリングで行う。

{expr for val in collection if condition}

# 別の方法
strings = ['a','as','bat','car','dove','python']
unique_length = {len(x) for x in strings}
print(unique_length)

del unique_length
unique_length = set(map(len, strings)) # より簡潔に
print(unique_length)

some_tuples = [(1,2,3),(4,5,6),(7,8,9)]

flattened = [x for tup in some_tuples for x in tup]

print(flattened)

*1:いずれか一方にしか含まれない元すべての集合

前回

2. Python組み込みのデータ構造と関数、ファイルの扱い

2.1 データ構造とシーケンス

2.1.4 辞書

2.1.5 集合

2.1.6 内包表記

2.　Python組み込みのデータ構造と関数、ファイルの扱い

2.1　データ構造とシーケンス

2.1.4　辞書

2.1.5　集合

2.1.6　内包表記