Data preprocessing

2020. 8. 11. 16:46

본 게시글은 필자가 강의, 책 등을 통해 개인적으로 학습한 것으로

본문의 모든 정보는 출처를 기반으로 작성되었습니다.

Preprocessing

말 그대로 given dataset을 전처리하는 개념이다.

data peprocessing는 우리가 가진 raw data를

학습을 위한 모델에 최적화하기 위함이다.

Preprocessing의 절차는 소스마다 차이가 있지만

기본적으로 아래의 7단계로 설명한다.

1. getting the dataset

2. Importing libraries

3. Importing the dataset to a working directory

4. Handling the missing values

5. Encoding categorical data

6. Splitting dataset into training and test set

7. Feature scaling

(위의 절차와 방법을 아주 잘 설명해주는 글이 있으니 이곳을 참고해도 좋다..)

1~3 단계는 개념적인 부분과 단순한 파일 읽기이므로 넘어가도록 하겠다.

Handling the missing values

쉽게 말해서 dataset에서 누락된 부분을 처리하는 것이다.

이는 사이킷런 모듈을 사용하면 아주 쉽게 해결된다.

import numpy as np
from sklearn.impute import SimpleImputer

x = np.array([[1,None,1,5], [2,4,1,2],[None,1,2,3]])
print(x)

'''
[[1 None 1 5]
 [2 4 1 2]
 [None 1 2 3]]
'''

ex_imputer = SimpleImputer(strategy='mean')
new_x = ex_imputer.fit_transform(x))
print(new_x)

'''
[[1.  2.5 1.  5. ]
 [2.  4.  1.  2. ]
 [1.5 1.  2.  3. ]]
 '''

사용된 함수의 자세한 설명은 scikit learn 공식 사이트 페이지를 참고하자.

Encoding categorical data

categorical data는 범주로 나누어진 데이터를 뜻한다.

예를 들어 인종, 성별, 국가, 그룹(ex. A~F) 등이 있다.

이 categorical data는 전처리시 보통 0~N의 integers로 치환하여 표현하는데, 이를

Label Encoding이라 한다.

하지만 Label Encoding만으로 categorical data를 완전히 처리하지는 못한다.

label variables이 modeling과정에서 continuous number로 인식 될 수 있기 때문이다.

그룹 {A,B,C,D,E,F}를 나타내는 data가 '수치' {0.,1.,2.,3.,4.,5.}로 학습에 적용되어 잘못된

결과를 가져올 수 있다는 뜻이다.

categorical data를 의미 그대로 학습에 적용시키기 위해서 사용하는 대표적인 방법은

Dummy Encoding이다. 이는 Label화 된 cetegorical values를 dummy values로 나타내는 방법이다.

예시로는 이전에 Softmax Regression에서 설명한 One-Hot Encoding가 있다.

이 또한 sklearn이 지원하는데 코드와 output을 참고하자.

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

x = np.array([['A'],['B'],['C'],['D'],['E'],['F']])

# labeling
lb_encoder = LabelEncoder()
label_x = lb_encoder.fit_transform(x)
print(label_x)
'''
[0 1 2 3 4 5]
'''
# One-Hot Encoding
label_x = label_x.reshape(-1,1) 
o_h_endcoder = OneHotEncoder()
one_hot_x = o_h_endcoder.fit_transform(label_x).toarray()
print(one_hot_x)
'''
[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]]
'''

Splitting dataset into training and test set

말 그대로 가지고 있는 총 dateset을 학습을 위한 train_set과 테스트를 위한 test_set으로

나누는 것이다. 모든 dataset을 학습시켜 버리면 test를 위한 untrained dataset이 존재하지 않으므로 학습이 끝난 후

모델이 알맞게 작동하는지 판단할 수 없기 때문이다.

일반적으로 20%의 dataset을 test를 위해 사용하지만 이는 dataset의 크기에 따라 유동적으로 조정하는게 좋다.

(ex. dataset이 아주 크다면, training set의 비율을 줄여도 학습에 충분할 것이며

남은 dataset으로 더 많은 test를 진행할 수 있다.)

sklearn을 이용한 코드를 보자.(dataset은 softmax regression의 두번째 예제 코드에 사용한것과 같다.)

import numpy as np
from sklearn.model_selection import train_test_split  

xy = np.loadtxt('/gdrive/My Drive/data-04-zoo.csv', delimiter=',', dtype=np.float32)
x = xy[:,:-1]
y = xy[:,[-1]]
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

print(xy.shape, x_train.shape, x_test.shape, y_train.shape, y_test.shape)
'''
(101, 17) (80, 16) (21, 16) (80, 1) (21, 1)
'''

101개의 행으로 이루어진 dataset에 대해 test_size를 0.2(=20%)로 지정하여 split하니

80개의 train_set과 21개의 test_set으로 나누어지는 것을 볼 수 있다.

Feature scaling(=Data Normalization)

마지막 preprocessing 절차인 Feature scaling은 쉽게 말해서

variables of a range의 range를 표준화 시켜주는 것이다.

ML model은 학습을 할 때 variables 마다 일반적으로 주어지는 range를 상대적으로

고려하지 못한다.

예를 들어 dataset상 1~10사이의 수치를 갖는 variable A와 10000~100000의 수치를 갖는 variable B를

포함한 dataset이 있다면 ML model은 학습에 있어서 variable A 보다 variable B의 영향을 훨씬 크게 받는다.

(Since ML model is based on Euclidean Distance : $d(A,B) = \sqrt{(x_{1} - x_{2})^2 + (y_{1} - y_{2})^2}$)

그러므로 variable A and B가 학습에 있어서 비슷한 가치를 가진 data라면,

$variable B = variable B/10000$를 통해 B의 range를 A와 똑같이 1~10으로 맞춰주어야

학습에 대해 정확한 result를 가져올 것이다.

이것이 Feature scaling의 concept와 같은 것이다.

(Feature Scaling에 대한 well defined된 게시글이 여기 있으니 참고하자.)

대표적으로 사용하는 두 가지 Feature scaling 방법을 sklearn을 통해 살펴보고 마무리 짓도록 하겠다.

Min-Max Normalization : $x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$

import numpy as np
from sklearn.preprocessing import MinMaxScaler

x = np.array([[1,10000], [3,20000], [9, 75000], [5, 40000]])
mnmx = MinMaxScaler()
new_x = mnmx.fit_transform(x)

print(new_x)
'''
[[0.         0.        ]
 [0.25       0.15384615]
 [1.         1.        ]
 [0.5        0.46153846]]
'''

Standardization : $x_{new} = \frac{x - \mu}{\sigma}$

import numpy as np
from sklearn.preprocessing import StandardScaler

x = np.array([[1,10000], [3,20000], [9, 75000], [5, 40000]])
stnd = StandardScaler()
new_x = stnd.fit_transform(x)

print(new_x)
'''
[[-1.18321596 -1.05662467]
 [-0.50709255 -0.65410099]
 [ 1.52127766  1.55977928]
 [ 0.16903085  0.15094638]]
'''

참고 문헌 및 자료

1. Sung Kim Youtube channel : https://www.youtube.com/channel/UCML9R2ol-l0Ab9OXoNnr7Lw

Sung Kim

컴퓨터 소프트웨어와 딥러닝, 영어등 다양한 재미있는 이야기들을 나누는 곳입니다.

www.youtube.com

2. Andrew Ng Coursera class : https://www.coursera.org/learn/machine-learning

3. 조태호(2017). 모두의 딥러닝. 서울: 길벗

'IT study > 모두를 위한 딥러닝' 카테고리의 다른 글

MNIST (0)	2020.08.18
Regularization for Overfitting problem (0)	2020.08.13
How to set learning rate (0)	2020.08.11
Softmax Regression(2) - ex.1 (0)	2020.08.10
Softmax Regression(1) (0)	2020.08.05

작심삼일

Data preprocessing

'IT study > 모두를 위한 딥러닝' 카테고리의 다른 글

+ Recent posts

티스토리툴바