판다스: 그룹핑

import pandas as pd

census.xlsx 파일 열기

df = pd.read_excel('census.xlsx')

df.head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

소득 수준에 따라 그룹짓기

g1 = df.groupby('income')

그룹 보기

g1.groups

{'<=50K': Int64Index([    0,     1,     2,     3,     4,     5,     6,    12,    13,
                15,
             ...
             32548, 32549, 32550, 32551, 32552, 32553, 32555, 32556, 32558,
             32559],
            dtype='int64', length=24720),
 '>50K': Int64Index([    7,     8,     9,    10,    11,    14,    19,    20,    25,
                27,
             ...
             32530, 32532, 32533, 32536, 32538, 32539, 32545, 32554, 32557,
             32560],
            dtype='int64', length=7841)}

특정 그룹만 보기

g1.get_group('>50K').head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	45	United-States	>50K
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	>50K
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	40	United-States	>50K
10	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	80	United-States	>50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	40	India	>50K

그룹별 통계

import numpy as np

df['education_num'].mean()  # 전체 평균

10.0806793403151

g1['education_num'].mean()

income
<=50K     9.595065
>50K     11.611657
Name: education_num, dtype: float64

g1['education_num'].describe()

	count	mean	std	min	25%	50%	75%	max
income
<=50K	24720.0	9.595065	2.436147	1.0	9.0	9.0	10.0	16.0
>50K	7841.0	11.611657	2.385129	2.0	10.0	12.0	13.0	16.0

nth

g1['capital_gain'].nth(0)

income
<=50K    2174
>50K        0
Name: capital_gain, dtype: int64

g1['capital_gain'].nth(-1)

income
<=50K        0
>50K     15024
Name: capital_gain, dtype: int64

agg

aggregate: 모으다(모아서 정리하다)

g1.agg({'education_num': np.mean})

	education_num
income
<=50K	9.595065
>50K	11.611657

그룹별 education_num 평균과 표준편차

g1.agg({'education_num': [np.mean, np.std]})

	education_num
	mean	std
income
<=50K	9.595065	2.436147
>50K	11.611657	2.385129

그룹별 capital_gain 평균

g1.agg({'capital_gain': np.mean})

	capital_gain
income
<=50K	148.752468
>50K	4006.142456

모두 적용

g1.agg(
    {
        'education_num': [np.mean, np.std],
        'capital_gain': np.mean
    }
)

	education_num		capital_gain
	mean	std	mean
income
<=50K	9.595065	2.436147	148.752468
>50K	11.611657	2.385129	4006.142456

실습

성별(sex)에 따라 education_num의 평균(np.mean)과 표준편차(np.std)를 구하세요.

SELECT AVG(eduation_num), STD(education_num) FROM df GROUP BY sex;

df.groupby('sex').agg({'education_num': [np.mean, np.std]})

	education_num
	mean	std
sex
Female	10.035744	2.379954
Male	10.102891	2.662630

race와 income으로 그룹

g2 = df.groupby(['race', 'income'])

rie = g2.agg({'education_num': np.mean})

rie.to_excel('rie.xlsx')

rie

		education_num
race	income
Amer-Indian-Eskimo	<=50K	9.061818
Amer-Indian-Eskimo	>50K	11.222222
Asian-Pac-Islander	<=50K	10.440367
Asian-Pac-Islander	>50K	12.398551
Black	<=50K	9.261235
Black	>50K	11.077519
Other	<=50K	8.560976
Other	>50K	11.600000
White	<=50K	9.627422
White	>50K	11.612196

reset_index

rie.reset_index()

	race	income	education_num
0	Amer-Indian-Eskimo	<=50K	9.061818
1	Amer-Indian-Eskimo	>50K	11.222222
2	Asian-Pac-Islander	<=50K	10.440367
3	Asian-Pac-Islander	>50K	12.398551
4	Black	<=50K	9.261235
5	Black	>50K	11.077519
6	Other	<=50K	8.560976
7	Other	>50K	11.600000
8	White	<=50K	9.627422
9	White	>50K	11.612196

unique

컬럼에 포함된 값의 종류를 알려준다.

df['race'].unique()

array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
       'Other'], dtype=object)

판다스: 그룹핑

census.xlsx 파일 열기

소득 수준에 따라 그룹짓기

그룹 보기

특정 그룹만 보기

그룹별 통계

nth

agg

그룹별 education_num 평균과 표준편차

그룹별 capital_gain 평균

모두 적용

실습

race와 income으로 그룹

reset_index

unique

category