logo

[python-ds] 판다스 기초

import pandas as pd
 

엑셀 파일 열기

df = pd.read_excel('census.xlsx')
 

첫 부분 보기

df.head()
ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryincome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K

csv로 저장, 엑셀로 저장 csv 열기

 

컬럼 이름

df.columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')
 

컬럼 선택

df['age']
0        39
1        50
2        38
         ..
32558    58
32559    22
32560    52
Name: age, Length: 32561, dtype: int64
 

dtypes와 dtype

df.dtypes
age                int64
workclass         object
fnlwgt             int64
                   ...
hours_per_week     int64
native_country    object
income            object
Length: 15, dtype: object
df['age'].dtype
dtype('int64')
 

배열로 변환

df['age'].to_numpy()  # 주의: 여기까지는 import numpy 필요 없음
array([39, 50, 38, ..., 58, 22, 52], dtype=int64)
 

통계 계산

import numpy as np
np.max(df['age'])
90
df['age'].max()
90
2019 - df['age']
0        1980
1        1969
2        1981
         ...
32558    1961
32559    1997
32560    1967
Name: age, Length: 32561, dtype: int64
 

여러 컬럼에 계산

cols = ['age', 'education_num']
df[cols].head()
ageeducation_num
03913
15013
2389
3537
42813
df[cols].max()
age              90
education_num    16
dtype: int64
df[cols].agg(np.max)
age              90
education_num    16
dtype: int64
 

정렬

df.sort_values('age').head()
ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryincome
1231817Private12736611th7Never-marriedSalesOwn-childWhiteFemale008United-States<=50K
631217Private13275511th7Never-marriedSalesOwn-childWhiteMale0015United-States<=50K
3092717Private10847011th7Never-marriedOther-serviceOwn-childBlackMale0017United-States<=50K
1278717Local-gov30890111th7Never-marriedAdm-clericalOwn-childWhiteFemale0015United-States<=50K
2575517?4740711th7Never-married?Own-childWhiteMale0010United-States<=50K
df.sort_values('age', ascending=False).head()  # 내림차순
ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryincome
540690Private51744Masters14Never-marriedExec-managerialNot-in-familyBlackMale0050United-States>50K
662490Private31398611th7Married-civ-spouseCraft-repairHusbandWhiteMale0040United-States<=50K
2061090Private206667Masters14Married-civ-spouseProf-specialtyWifeWhiteFemale0040United-States>50K
104090Private137018HS-grad9Never-marriedOther-serviceNot-in-familyWhiteFemale0040United-States<=50K
193590Private221832Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States<=50K
 

쿼리

df.query('age < 18 and capital_gain > 1000')
ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryincome
10617?30487310th6Never-married?Own-childWhiteFemale34095032United-States<=50K
27117Private1912609th5Never-marriedOther-serviceOwn-childWhiteMale1055024United-States<=50K
42117Private17502411th7Never-marriedHandlers-cleanersOwn-childWhiteMale2176018United-States<=50K
................................................
169117Private10385111th7Never-marriedAdm-clericalOwn-childWhiteFemale1055020United-States<=50K
360517Private13012510th6Never-marriedOther-serviceOwn-childAmer-Indian-EskimoFemale1055020United-States<=50K
2788917Private5653611th7Never-marriedSalesOwn-childWhiteFemale1055018India<=50K

7 rows × 15 columns

 

그룹

df.groupby('income').agg({'education_num': np.mean})
education_num
income
<=50K9.595065
>50K11.611657
 

상자-수염 그림

import seaborn as sns
sns.boxplot(x="income", y="education_num", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x218f61f7978>
Previous
파이플롯으로 간단한 그래프 그리기