[패스트캠퍼스] 데이터 분석 부트캠프 5주차 학습일지

데이터 분석(Data Analysis)/- 데이터 분석 부트캠프

[패스트캠퍼스] 데이터 분석 부트캠프 5주차 학습일지

데이터쿠 2024. 9. 20. 17:00

이 번주차는 추석이 끼어 있어서 한 주가 순식간에 지나갔다.

연휴 기간동안 부족한 부분을 더 채워넣고 싶었지만,

나의 환경상 온전히 나에게 시간을 쏟을 수 있는 윈도우 타임은

연휴 마지막 날 뿐이어서 아쉬웠던 한 주였고, 이번주를 끝으로

실질적인 프로젝트에 들어가게 된다.

들어가기에 앞서 2일 간 배웠던 내용을 정리해보았다.

데이터 분석에 쓰이는 라이브러리

1. Numpy

1) Numpy 대표 속성값 함수

a = [
        [ 0  1  2  3],
        [ 4  5  6  7],
        [ 8  9 10 11]
    ]

.shape : 배열의 각 축(axis)의 크기

a.shape

# >>> (3, 4)

.ndim : 축(axis)의 개수(Dimension)

a.ndim

# >>> 2

.dtype : 배열의 각 요소(element)의 타입

a.dtype

# >>> dtype('int64')

.itemsize : 각 요소(element) 타입의 bytes 크기

a.itemsize

# >>> 8

.size : 전체 요소(Element)의 개수

a.size

# >>> 12

2) Numpy 배열 다루기

배열 생성

np.array(리스트 or 튜플)

import numpy as np

a = np.array([1, 2, 3])
print(a)

# >>> [1 2 3]

np.arange(start, end, diff)
- diff = common difference (등차)
- diff 만큼 차이나는 숫자 생성
- 끝 값 포함 안함
```
np.arange(10, 30, 5)

# >>> array([10, 15, 20, 25])
```

np.linspace(start, end, n)

n 등분한 숫자 생성
처음과 끝 포함

np.linspace(0, 10, 2)
# >>> array([0., 10.])

np.linspace(0, 10, 3)
# >>> array([0., 5., 10.])

.reshape()
- 해당 배열을 원하는 배열형태로 재구성해주는 함수
```
np.arange(6).reshape(2,3)

# >>> array([[0, 1, 2],
#            [3, 4, 5]])
```

3) 집계 함수

a = np.arange(8).reshape(2, 4)**2
print(a)

# >>> [[ 0  1  4  9]
#      [16 25 36 49]]

.sum() : 모든 요소의 합

print(a.sum())
# >>> 140

.min() : 모든 요소 중 최소값

print(a.min())
# >>> 0

.max() : 모든 요소 중 최대값

print(a.max())
# >>> 49

.argmax() : 모든 요소 중 최대값의 인덱스

print(a.argmax())
# >>> 7

.cumsum() : 모든 요소의 누적합

print(a.cumsum())
# >>> [  0   1   5  14  30  55  91 140]

2. Pandas

1) 데이터 형식

Pandas는 Series와 DataFrame이라는 구조화된 데이터 형식을 제공한다.

Series

데이터가 순차적으로 나열된 1차원 배열의 형태로 되어 있다

index와 value가 1:1 대응 관계에 있다

Series는 Dictionary와 List로 생성할 수 있다

Dictionary → Series

각 key 값과 Series의 index가 서로 대응된다.

data = {'a': 1, 'b': 2}
pd.Sereis(data)

dict_data = {'a': 1, 'b': 2, 'c': 3}
series_data = pd.Series(dict_data)
print(series_data)

# >>> a    1
#     b    2
#     c    3
#     dtype: int64

List → Series

인덱스를 별도로 지정하지 않으면 자동으로 정수형 인덱스가 지정된다.

data = ['a', 'b']
pd.Series(data)

list_data = ['a', 300, 'Banana', False]
series_data = pd.Series(list_data)
print(series_data)

# >>> 0    2022-10-11
#     1          3.14
#     2           ABC
#     3           100
#     4          True
#     dtype: object

DataFrame

행과 열로 만들어지는 2차원 배열의 형태를 가진다.

데이터 프레임의 열은 각각의 시리즈 객체이다. ( Series ⊂ DataFrame )

Dictionary → DataFrame

Key 값이 열(column)의 이름으로 들어가게 된다

data = {'a':1, 'b': 2, ...}
df = pd.DataFrame(data)

# 딕셔너리로 Dataframe 생성 
dict_data = {'c0':[1,2,3],'c1':[4,5,6],'c2':[7,8,9],'c3':[10,11,12],'c4':[13,14,15]}
df = pd.DataFrame(dict_data)

print(df)

# >>>    c0  c1  c2  c3  c4
#     0   1   4   7  10  13
#     1   2   5   8  11  14
#     2   3   6   9  12  15

2) 데이터 입출력

Format Reader Writer

csv	read_csv()	to_csv()
excel	read_excel()	to_excel()
JSON	read_json()	to_json()
sql	read_sql()	to_sql()
html	read_html()	to_html()

3) 데이터 내용 확인

titanic = pd.read_csv('titanic.csv')

.columns : 컬럼명 확인

titanic.columns

# >>> Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
#            'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
#           dtype='object')

.head() : 데이터의 상위 5개 행 출력. 값을 따로 지정하여 원하는 길이만큼 출력 가능

titanic.head()

.tail() : 데이터의 하위 5개 행 출력. 값을 따로 지정하여 원하는 길이만큼 출력 가능

titanic.tail()

.shape : 행과 열의 크기를 확인할 수 있다.

titanic.shape

# >>> (891, 12)

.info()

titanic.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"""

Non-Null Count (컬럼별 결측치)

저작자표시 비영리 변경금지