판다스: 인덱스
import pandas as pd
df = pd.read_excel('census.xlsx')
df.head()
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | 
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | 
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | 
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | 
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | 
인덱스
df.index
RangeIndex(start=0, stop=32561, step=1)
.loc과 .iloc
.loc: 판다스 인덱스 기준
.iloc: 순서 기준으로
df.loc[0]  # 인덱스 0번
age 39 workclass State-gov fnlwgt 77516 education Bachelors education_num 13 marital_status Never-married occupation Adm-clerical relationship Not-in-family race White sex Male capital_gain 2174 capital_loss 0 hours_per_week 40 native_country United-States income <=50K Name: 0, dtype: object
df.iloc[0]  # 0번째
age 39 workclass State-gov fnlwgt 77516 education Bachelors education_num 13 marital_status Never-married occupation Adm-clerical relationship Not-in-family race White sex Male capital_gain 2174 capital_loss 0 hours_per_week 40 native_country United-States income <=50K Name: 0, dtype: object
정렬을 하면 12318번 행이 제일 처음(0번째)에 나옴
df.sort_values('age').head()
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12318 | 17 | Private | 127366 | 11th | 7 | Never-married | Sales | Own-child | White | Female | 0 | 0 | 8 | United-States | <=50K | 
| 6312 | 17 | Private | 132755 | 11th | 7 | Never-married | Sales | Own-child | White | Male | 0 | 0 | 15 | United-States | <=50K | 
| 30927 | 17 | Private | 108470 | 11th | 7 | Never-married | Other-service | Own-child | Black | Male | 0 | 0 | 17 | United-States | <=50K | 
| 12787 | 17 | Local-gov | 308901 | 11th | 7 | Never-married | Adm-clerical | Own-child | White | Female | 0 | 0 | 15 | United-States | <=50K | 
| 25755 | 17 | ? | 47407 | 11th | 7 | Never-married | ? | Own-child | White | Male | 0 | 0 | 10 | United-States | <=50K | 
df.sort_values('age').loc[0]  # 0번행
age 39 workclass State-gov fnlwgt 77516 education Bachelors education_num 13 marital_status Never-married occupation Adm-clerical relationship Not-in-family race White sex Male capital_gain 2174 capital_loss 0 hours_per_week 40 native_country United-States income <=50K Name: 0, dtype: object
df.sort_values('age').iloc[0]  # 나이순으로 정렬했을 때 0번째 나오는 12318번행
age 17 workclass Private fnlwgt 127366 education 11th education_num 7 marital_status Never-married occupation Sales relationship Own-child race White sex Female capital_gain 0 capital_loss 0 hours_per_week 8 native_country United-States income <=50K Name: 12318, dtype: object
행과 열 모두 가리키기
df.loc[0, 'age']  # 0번 행, age 열
39
df.iloc[0, 0]  # 0번째 행, 0번째 열
39
문자열로 된 인덱스
import numpy as np
am = df.groupby('race').agg({'age': np.mean, 'education_num': np.mean})
am
| age | education_num | |
|---|---|---|
| race | ||
| Amer-Indian-Eskimo | 37.173633 | 9.311897 | 
| Asian-Pac-Islander | 37.746872 | 10.960539 | 
| Black | 37.767926 | 9.486236 | 
| Other | 33.457565 | 8.841328 | 
| White | 38.769881 | 10.135246 | 
am.loc['Black', 'age']
37.7679257362356
am.iloc[2, 0]
37.7679257362356
슬라이싱
.loc은 Python의 일반적인 인덱싱과 달리 5:7이라고 하면 5, 6, 7을 모두 포함한다.
df.loc[5:7]
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K | 
| 6 | 49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K | 
| 7 | 52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K | 
df.loc[5:7, 'age':'education']
| age | workclass | fnlwgt | education | |
|---|---|---|---|---|
| 5 | 37 | Private | 284582 | Masters | 
| 6 | 49 | Private | 160187 | 9th | 
| 7 | 52 | Self-emp-not-inc | 209642 | HS-grad | 
.iloc은 Python의 일반적인 인덱싱과 마찬가지로 5:7이라고 하면 6까지만 포함하고 7은 포함하지 않는다.
df.iloc[5:7]
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K | 
| 6 | 49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K | 
df.iloc[5:7,0:4]
| age | workclass | fnlwgt | education | |
|---|---|---|---|---|
| 5 | 37 | Private | 284582 | Masters | 
| 6 | 49 | Private | 160187 | 9th | 
실습
1) education_num 순으로 정렬을 해보세요
2) 위의 정렬한 데이터에서 첫 100명(0~99)을 뽑아보세요
3) 위에서 뽑은 사람들의 age의 평균을 구해보세요.
df.sort_values('education_num').iloc[0:100].agg({'age': np.mean})
age 43.46 dtype: float64
무작위 고르기
df['age'].mean()
38.58164675532078
df.sample(1000)['age'].mean()
38.384
인덱스 바꾸기
d1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
d1
| a | b | |
|---|---|---|
| 0 | 1 | 4 | 
| 1 | 2 | 5 | 
| 2 | 3 | 6 | 
특정 컬럼을 set_index를 이용해 인덱스로 바꿀 수 있다.
d2 = d1.set_index('b')
d2
| a | |
|---|---|
| b | |
| 4 | 1 | 
| 5 | 2 | 
| 6 | 3 | 
d2.loc[4]
a 1 Name: 4, dtype: int64
reset_index를 하면 자동으로 0, 1, 2, ...으로 인덱스가 붙는다.
d2.reset_index()
| b | a | |
|---|---|---|
| 0 | 4 | 1 | 
| 1 | 5 | 2 | 
| 2 | 6 | 3 | 
수동 지정
d3 = d1.copy()
d3.index = ['x', 'y', 'z']
d3
| a | b | |
|---|---|---|
| 1 | 1 | 4 | 
| 0 | 2 | 5 | 
| 2 | 3 | 6 | 
멀티 인덱스
d4 = d1.copy()
d4.index = pd.MultiIndex.from_tuples([('x', 1), ('x', 2), ('y', 1)])
d4
| a | b | ||
|---|---|---|---|
| x | 1 | 1 | 4 | 
| 2 | 2 | 5 | |
| y | 1 | 3 | 6 | 
groupby를 할 때 결과가 멀티 인덱스 형태가 되는 경우가 많음
result = df.groupby(['race', 'sex']).agg({'age': [np.mean, np.median]})
result
| age | |||
|---|---|---|---|
| mean | median | ||
| race | sex | ||
| Amer-Indian-Eskimo | Female | 37.117647 | 36 | 
| Male | 37.208333 | 35 | |
| Asian-Pac-Islander | Female | 35.089595 | 33 | 
| ... | ... | ... | ... | 
| Other | Male | 34.654321 | 32 | 
| White | Female | 36.811618 | 35 | 
| Male | 39.652498 | 38 | |
10 rows × 2 columns
멀티 인덱스는 (a, b, c, ..) 순으로 표시
result.loc[('White', 'Female')]
age  mean      36.811618
     median    35.000000
Name: (White, Female), dtype: float64result[('age', 'median')]
race                sex   
Amer-Indian-Eskimo  Female    36
                    Male      35
Asian-Pac-Islander  Female    33
                              ..
Other               Male      32
White               Female    35
                    Male      38
Name: (age, median), Length: 10, dtype: int64