[python-ds] 판다스: 인덱스 :: 마인드스케일

import pandas as pd

df = pd.read_excel('census.xlsx')

df.head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

인덱스

df.index

RangeIndex(start=0, stop=32561, step=1)

.loc과 .iloc

.loc: 판다스 인덱스 기준 .iloc: 순서 기준으로

df.loc[0]  # 인덱스 0번

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object

df.iloc[0]  # 0번째

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object

정렬을 하면 12318번 행이 제일 처음(0번째)에 나옴

df.sort_values('age').head()

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	income
12318	17	Private	127366	11th	7	Never-married	Sales	Own-child	White	Female	8	United-States	<=50K
6312	17	Private	132755	11th	7	Never-married	Sales	Own-child	White	Male	15	United-States	<=50K
30927	17	Private	108470	11th	7	Never-married	Other-service	Own-child	Black	Male	17	United-States	<=50K
12787	17	Local-gov	308901	11th	7	Never-married	Adm-clerical	Own-child	White	Female	15	United-States	<=50K
25755	17	?	47407	11th	7	Never-married	?	Own-child	White	Male	10	United-States	<=50K

df.sort_values('age').loc[0]  # 0번행

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object

df.sort_values('age').iloc[0]  # 나이순으로 정렬했을 때 0번째 나오는 12318번행

age                          17
workclass               Private
fnlwgt                   127366
education                  11th
education_num                 7
marital_status    Never-married
occupation                Sales
relationship          Own-child
race                      White
sex                      Female
capital_gain                  0
capital_loss                  0
hours_per_week                8
native_country    United-States
income                    <=50K
Name: 12318, dtype: object

행과 열 모두 가리키기

df.loc[0, 'age']  # 0번 행, age 열

df.iloc[0, 0]  # 0번째 행, 0번째 열

문자열로 된 인덱스

import numpy as np

am = df.groupby('race').agg({'age': np.mean, 'education_num': np.mean})

am

	age	education_num
race
Amer-Indian-Eskimo	37.173633	9.311897
Asian-Pac-Islander	37.746872	10.960539
Black	37.767926	9.486236
Other	33.457565	8.841328
White	38.769881	10.135246

am.loc['Black', 'age']

37.7679257362356

am.iloc[2, 0]

37.7679257362356

슬라이싱

.loc은 Python의 일반적인 인덱싱과 달리 5:7이라고 하면 5, 6, 7을 모두 포함한다.

df.loc[5:7]

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	hours_per_week	native_country	income
5	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	40	United-States	<=50K
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	16	Jamaica	<=50K
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	45	United-States	>50K

df.loc[5:7, 'age':'education']

	age	workclass	fnlwgt	education
5	37	Private	284582	Masters
6	49	Private	160187	9th
7	52	Self-emp-not-inc	209642	HS-grad

.iloc은 Python의 일반적인 인덱싱과 마찬가지로 5:7이라고 하면 6까지만 포함하고 7은 포함하지 않는다.

df.iloc[5:7]

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	income
5	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	<=50K
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	<=50K

df.iloc[5:7,0:4]

	age	workclass	fnlwgt	education
5	37	Private	284582	Masters
6	49	Private	160187	9th

실습

education_num 순으로 정렬을 해보세요
위의 정렬한 데이터에서 첫 100명(0~99)을 뽑아보세요
위에서 뽑은 사람들의 age의 평균을 구해보세요.

df.sort_values('education_num').iloc[0:100].agg({'age': np.mean})

age    43.46
dtype: float64

무작위 고르기

df['age'].mean()

38.58164675532078

df.sample(1000)['age'].mean()

38.384

인덱스 바꾸기

d1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

d1

	a	b
0	1	4
1	2	5
2	3	6

특정 컬럼을 set_index를 이용해 인덱스로 바꿀 수 있다.

d2 = d1.set_index('b')

d2

	a
b
4	1
5	2
6	3

d2.loc[4]

a    1
Name: 4, dtype: int64

reset_index를 하면 자동으로 0, 1, 2, ...으로 인덱스가 붙는다.

d2.reset_index()

	b	a
0	4	1
1	5	2
2	6	3

수동 지정

d3 = d1.copy()

d3.index = ['x', 'y', 'z']

d3

	a	b
1	1	4
0	2	5
2	3	6

멀티 인덱스

d4 = d1.copy()

d4.index = pd.MultiIndex.from_tuples([('x', 1), ('x', 2), ('y', 1)])

d4

		a	b
x	1	1	4
x	2	2	5
y	1	3	6

groupby를 할 때 결과가 멀티 인덱스 형태가 되는 경우가 많음

result = df.groupby(['race', 'sex']).agg({'age': [np.mean, np.median]})

result

		age
		mean	median
race	sex
Amer-Indian-Eskimo	Female	37.117647	36
Amer-Indian-Eskimo	Male	37.208333	35
Asian-Pac-Islander	Female	35.089595	33
...	...	...	...
Other	Male	34.654321	32
White	Female	36.811618	35
White	Male	39.652498	38

10 rows × 2 columns

멀티 인덱스는 (a, b, c, ..) 순으로 표시

result.loc[('White', 'Female')]

age  mean      36.811618
     median    35.000000
Name: (White, Female), dtype: float64

result[('age', 'median')]

race                sex
Amer-Indian-Eskimo  Female    36
                    Male      35
Asian-Pac-Islander  Female    33
                              ..
Other               Male      32
White               Female    35
                    Male      38
Name: (age, median), Length: 10, dtype: int64