판다스: 인덱스 :: Python 빅데이터 처리와 시각화 - mindscale
Skip to content

판다스: 인덱스

import pandas as pd
df = pd.read_excel('census.xlsx')
df.head()
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

인덱스

df.index
RangeIndex(start=0, stop=32561, step=1)

.loc과 .iloc

.loc: 판다스 인덱스 기준 .iloc: 순서 기준으로

df.loc[0]  # 인덱스 0번
age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object
df.iloc[0]  # 0번째
age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object

정렬을 하면 12318번 행이 제일 처음(0번째)에 나옴

df.sort_values('age').head()
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
12318 17 Private 127366 11th 7 Never-married Sales Own-child White Female 0 0 8 United-States <=50K
6312 17 Private 132755 11th 7 Never-married Sales Own-child White Male 0 0 15 United-States <=50K
30927 17 Private 108470 11th 7 Never-married Other-service Own-child Black Male 0 0 17 United-States <=50K
12787 17 Local-gov 308901 11th 7 Never-married Adm-clerical Own-child White Female 0 0 15 United-States <=50K
25755 17 ? 47407 11th 7 Never-married ? Own-child White Male 0 0 10 United-States <=50K
df.sort_values('age').loc[0]  # 0번행
age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education_num                13
marital_status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital_gain               2174
capital_loss                  0
hours_per_week               40
native_country    United-States
income                    <=50K
Name: 0, dtype: object
df.sort_values('age').iloc[0]  # 나이순으로 정렬했을 때 0번째 나오는 12318번행
age                          17
workclass               Private
fnlwgt                   127366
education                  11th
education_num                 7
marital_status    Never-married
occupation                Sales
relationship          Own-child
race                      White
sex                      Female
capital_gain                  0
capital_loss                  0
hours_per_week                8
native_country    United-States
income                    <=50K
Name: 12318, dtype: object

행과 열 모두 가리키기

df.loc[0, 'age']  # 0번 행, age 열
39
df.iloc[0, 0]  # 0번째 행, 0번째 열
39

문자열로 된 인덱스

import numpy as np
am = df.groupby('race').agg({'age': np.mean, 'education_num': np.mean})
am
age education_num
race
Amer-Indian-Eskimo 37.173633 9.311897
Asian-Pac-Islander 37.746872 10.960539
Black 37.767926 9.486236
Other 33.457565 8.841328
White 38.769881 10.135246
am.loc['Black', 'age']
37.7679257362356
am.iloc[2, 0]
37.7679257362356

슬라이싱

.loc은 Python의 일반적인 인덱싱과 달리 5:7이라고 하면 5, 6, 7을 모두 포함한다.

df.loc[5:7]
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
5 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
6 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica <=50K
7 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
df.loc[5:7, 'age':'education']
age workclass fnlwgt education
5 37 Private 284582 Masters
6 49 Private 160187 9th
7 52 Self-emp-not-inc 209642 HS-grad

.iloc은 Python의 일반적인 인덱싱과 마찬가지로 5:7이라고 하면 6까지만 포함하고 7은 포함하지 않는다.

df.iloc[5:7]
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country income
5 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
6 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica <=50K
df.iloc[5:7,0:4]
age workclass fnlwgt education
5 37 Private 284582 Masters
6 49 Private 160187 9th

실습

1) education_num 순으로 정렬을 해보세요

2) 위의 정렬한 데이터에서 첫 100명(0~99)을 뽑아보세요

3) 위에서 뽑은 사람들의 age의 평균을 구해보세요.

df.sort_values('education_num').iloc[0:100].agg({'age': np.mean})
age    43.46
dtype: float64

무작위 고르기

df['age'].mean()
38.58164675532078
df.sample(1000)['age'].mean()
38.384

인덱스 바꾸기

d1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
d1
a b
0 1 4
1 2 5
2 3 6

특정 컬럼을 set_index를 이용해 인덱스로 바꿀 수 있다.

d2 = d1.set_index('b')
d2
a
b
4 1
5 2
6 3
d2.loc[4]
a    1
Name: 4, dtype: int64

reset_index를 하면 자동으로 0, 1, 2, ...으로 인덱스가 붙는다.

d2.reset_index()
b a
0 4 1
1 5 2
2 6 3

수동 지정

d3 = d1.copy()
d3.index = ['x', 'y', 'z']
d3
a b
1 1 4
0 2 5
2 3 6

멀티 인덱스

d4 = d1.copy()
d4.index = pd.MultiIndex.from_tuples([('x', 1), ('x', 2), ('y', 1)])
d4
a b
x 1 1 4
2 2 5
y 1 3 6

groupby를 할 때 결과가 멀티 인덱스 형태가 되는 경우가 많음

result = df.groupby(['race', 'sex']).agg({'age': [np.mean, np.median]})
result
age
mean median
race sex
Amer-Indian-Eskimo Female 37.117647 36
Male 37.208333 35
Asian-Pac-Islander Female 35.089595 33
... ... ... ...
Other Male 34.654321 32
White Female 36.811618 35
Male 39.652498 38

10 rows × 2 columns

멀티 인덱스는 (a, b, c, ..) 순으로 표시

result.loc[('White', 'Female')]
age  mean      36.811618
     median    35.000000
Name: (White, Female), dtype: float64
result[('age', 'median')]
race                sex   
Amer-Indian-Eskimo  Female    36
                    Male      35
Asian-Pac-Islander  Female    33
                              ..
Other               Male      32
White               Female    35
                    Male      38
Name: (age, median), Length: 10, dtype: int64