[Kaggle] Melbourne 집값 예측하기

파이프라인!

지금까지는 컴퓨터구조에만 파이프라인이 있는 줄 알았는데 머신러닝에도 파이프라인이 있었다

사이킷런의 파이프라인의 장점은

1. 깔끔한 코드 : preprocessing 단계마다 매번 코드를 작성하면 코드가 더러워질 수 있고 흐름 따라가기가 어렵지만 파이프라인을 사용하면 깔끔하게 사용할 수 있다

2. 더 적은 실수 : preprocessing 단계를 빼먹는다거나, 순서를 잘못 적용하지 않을 수 있다

3. 쉬운 생산화

4. model validation 과정에서 많은 옵션 가능

이 있다고 한다

개념은 이정도면 충분하고 실제로 적용 방법을 알아보자!

이번에도 데이터는 캐글의 집값 데이터를 사용해보았다

쉽게 사용법을 익히기 위해 숫자가 아닌 값을 가진 데이터는 unique한 종류가 너무 많지 않은 행만 사용하였다

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
 
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]
 
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
 
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
Colored by Color Scripter

cs

train 데이터가 80프로 test 데이터가 20프로!

얼른 cross validation까지 하고 싶다,, 시간 없어서 조금씩 끊어서 하니까 답답ㅠㅠ 방학되면 열심히해야지

아무튼! 이렇게 사용할 행들을 골라내고 preprocessing을 시작했다

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
 
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
 
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
 
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])
Colored by Color Scripter

cs

이렇게 numerical data를 다루는 imputer와 categorical data를 다루는 onehotEncoder와 imputer!

steps는 pipeline에서의 순서를 나타낸다

1
2
3
4
5
6
7
8

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
 
model_1 = LinearRegression()
model_2 = DecisionTreeRegressor(random_state = 0)
model_3 = RandomForestRegressor(n_estimators = 10, random_state = 0)
 
Colored by Color Scripter

cs

세가지 모델을 비교해보자

linear regression, decision tree regressor, random forest regressor를 사용해보았다

먼저 첫번째!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

from sklearn.metrics import mean_absolute_error
 
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model_1)])
 
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)
 
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
 
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
Colored by Color Scripter

cs

첫번째 모델을 사용했을 때는 MAE가 269715.5478161838 나왔다

두번째 모델은 위와 같은 코드에서 model_1만 2랑 3으로 바꾸어주었더니 각각 221891.95360824742, 171155.24614629356의 MAE가 나왔다

예상했던 데로 랜덤포레스트가 가장 오차가 적게 나왔다

이제 이 모델에 인자들을 추가하여 오차를 줄여보려고 해보았다

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
 
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model_3)])
 
my_pipeline.steps.insert(1,['scl',StandardScaler()])
my_pipeline.steps[2][1].n_estimators = 100
 
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)
 
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
 
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
Colored by Color Scripter

cs

이렇게 Standard Scaler를 사용하여 표준화도 시켜보고 n_estimators 값도 늘려보니 오차가 161117.62791237113로 줄었다!

사실 결과적으로 Standard Scaler는 빼는게 더 나았다ㅋㅋㅋㅋㅋ

파이프라인 연습에 의의를 두자ㅠㅠ

Standard Scaler는 언제 쓰는게 좋으려나..

캐글 강의를 다 공부한 후에 알아보아야겠다

저작자표시

'AI' 카테고리의 다른 글

Fisher Discriminant Analysis (0)	2020.10.02
Online learning - Perceptron Algorithm (0)	2020.10.01
[Kaggle] Melbourne 집값 예측하기 - 3 Categorical Variables (0)	2020.05.23
[Kaggle] Melbourne 집값 예측하기 - 2 Missing Values (0)	2020.05.23
[Kaggle] Melbourne 집값 예측하기 - 1 (0)	2020.05.22

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

영원토록 빛나고 싶어

[Kaggle] Melbourne 집값 예측하기 - 4 Pipelines

'AI' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Kaggle] Melbourne 집값 예측하기 - 4 Pipelines

'AI' 카테고리의 다른 글

'AI' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역