Kaggle | Autogluon

python

kaggle

titanic

autogluon

Author

강신성

Published

2023-10-17

자동 예측 프로그램인 Autogluon을 활용하여 titanic data를 적합해보자!

1. 라이브러리 imports

#pip install autogluon

import pandas as pd
import numpy as np

## tabular(테이블) 형식의 데이터를 다루는 모듈을 다운로드한다.
from autogluon.tabular import TabularDataset, TabularPredictor

C:\Users\hollyriver\anaconda3\envs\py\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

2. 분석

A. 데이터 입력

문제를 받아오는 과정으로 비유할 수 있다.

tr = TabularDataset('./data/train.csv')  ## 학습할 데이터
tst = TabularDataset('./data/test.csv')

## tr = TabularDataset('/kaggle/input/titanic/train.csv')  ## 학습할 데이터
## tst = TabularDataset('/kaggle/input/titanic/test.csv')

## tr = pd.read_csv('/kaggle/input/titanic/train.csv')
## tst

### B. Predictor 생성

문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.

predictr = TabularPredictor('Survived') ## target variable이 들어있는 데이터프레임, 변수 철자는 임의로 틀리게 설정

No path specified. Models will be saved in: "AutogluonModels\ag-20231017_130536"

predictr는 뭔데?

type(predictr)

autogluon.tabular.predictor.predictor.TabularPredictor

대충 autogluon에서의 class인듯.

C. 적합(fit)

학습 과정에 해당한다.

predictr.fit(tr) ## 학생(predictr)에게 문제(tr)를 주어 학습을 시킴(predictr.fit(tr))
##tr 그 자체로 학습할 수 있는 건 다 시킨다. sklearn의 모델과는 차이가 있음

Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_130536"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
Disk Space Avail:   57.71 GB / 255.01 GB (22.6%)
Train Data Rows:    891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [0, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    1930.98 MB
    Train Data (Original)  Memory Usage: 0.31 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
    0.3s = Fit runtime
    11 features in original data used to generate 28 features in processed data.
    Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
    0.6536   = Validation score   (accuracy)
    1.83s    = Training   runtime
    0.22s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.6536   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    0.8156   = Validation score   (accuracy)
    1.08s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBM ...
    0.8212   = Validation score   (accuracy)
    0.43s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestGini ...
    0.8156   = Validation score   (accuracy)
    0.64s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.8156   = Validation score   (accuracy)
    0.54s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: CatBoost ...
    0.8268   = Validation score   (accuracy)
    7.47s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: ExtraTreesGini ...
    0.8156   = Validation score   (accuracy)
    0.52s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.8101   = Validation score   (accuracy)
    0.52s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
    0.8324   = Validation score   (accuracy)
    3.28s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: XGBoost ...
    0.8101   = Validation score   (accuracy)
    1.32s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: NeuralNetTorch ...
    0.8212   = Validation score   (accuracy)
    7.74s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: LightGBMLarge ...
    0.8324   = Validation score   (accuracy)
    0.93s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8324   = Validation score   (accuracy)
    0.67s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 28.23s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_130536")

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20df4525d50>

학습 완료, 이에 따라 리더보드를 확인한다. (모의고사 채점)

predictr.leaderboard()

                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0         LightGBMLarge   0.832402       0.009000  0.928348                0.009000           0.928348            1       True         13
1       NeuralNetFastAI   0.832402       0.028001  3.279027                0.028001           3.279027            1       True         10
2   WeightedEnsemble_L2   0.832402       0.029001  3.949707                0.001000           0.670681            2       True         14
3              CatBoost   0.826816       0.006002  7.468165                0.006002           7.468165            1       True          7
4              LightGBM   0.821229       0.006004  0.432719                0.006004           0.432719            1       True          4
5        NeuralNetTorch   0.821229       0.031999  7.740229                0.031999           7.740229            1       True         12
6            LightGBMXT   0.815642       0.005002  1.084200                0.005002           1.084200            1       True          3
7        ExtraTreesGini   0.815642       0.061763  0.516426                0.061763           0.516426            1       True          8
8      RandomForestEntr   0.815642       0.064586  0.538568                0.064586           0.538568            1       True          6
9      RandomForestGini   0.815642       0.064718  0.637898                0.064718           0.637898            1       True          5
10              XGBoost   0.810056       0.013002  1.323482                0.013002           1.323482            1       True         11
11       ExtraTreesEntr   0.810056       0.062742  0.519159                0.062742           0.519159            1       True          9
12       KNeighborsDist   0.653631       0.005996  0.012003                0.005996           0.012003            1       True          2
13       KNeighborsUnif   0.653631       0.215770  1.826697                0.215770           1.826697            1       True          1

	model	score_val	pred_time_val	fit_time	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBMLarge	0.832402	0.009000	0.928348	0.009000	0.928348	1	True	13
1	NeuralNetFastAI	0.832402	0.028001	3.279027	0.028001	3.279027	1	True	10
2	WeightedEnsemble_L2	0.832402	0.029001	3.949707	0.001000	0.670681	2	True	14
3	CatBoost	0.826816	0.006002	7.468165	0.006002	7.468165	1	True	7
4	LightGBM	0.821229	0.006004	0.432719	0.006004	0.432719	1	True	4
5	NeuralNetTorch	0.821229	0.031999	7.740229	0.031999	7.740229	1	True	12
6	LightGBMXT	0.815642	0.005002	1.084200	0.005002	1.084200	1	True	3
7	ExtraTreesGini	0.815642	0.061763	0.516426	0.061763	0.516426	1	True	8
8	RandomForestEntr	0.815642	0.064586	0.538568	0.064586	0.538568	1	True	6
9	RandomForestGini	0.815642	0.064718	0.637898	0.064718	0.637898	1	True	5
10	XGBoost	0.810056	0.013002	1.323482	0.013002	1.323482	1	True	11
11	ExtraTreesEntr	0.810056	0.062742	0.519159	0.062742	0.519159	1	True	9
12	KNeighborsDist	0.653631	0.005996	0.012003	0.005996	0.012003	1	True	2
13	KNeighborsUnif	0.653631	0.215770	1.826697	0.215770	1.826697	1	True	1

score_val이 의미하는 것 * 실제로 predictr가 학습한 것은? > predictor와 train set이 있고, train set에 데이터가 1000개 있다고 하면 해당 데이터를 전부 가용하지 않는다. > * 800개를 사용한다고 하면 200개는 학습하지 않고 답을 맞춰 보는 식이다. > > * 200개는 왜 남겨두지? > > 문제에서 답을 찾는 규칙이 맞는지, 다른 데이터들에 대해서도 일반화시킬 수 있는 지 테스트 해보면 좋을 것 같다. 따라서 나머지 데이터셋에서 분석을 해본다. > > 실제 테스트에서 잘하기 위한 자체적 테스트셋에 해당, 200개의 나머지 테스트용 데이터셋을 validation set이라 일컫는다.

	train	val
학생1	95%	72%
학생2	80%	80%
…	…	…

train(연습문제)만 계속 푼 것 보다, val(모의고사)에서 가장 높은 점수를 받은 것이 유의미할 것.

그러니까 score_val는 모의고사 점수라고 보면 된다.

- 따라서 가장 높은 점수를 받은 WeightedEnsemble_L2모델을 사용해보자.^[1]

[1] 처음 실습할 땐 분명 이게 제일 높았었는데…

### D. 예측(predict)

학습 이후에 문제를 푸는 과정으로 비유.

기존에 했던 분석들

무조건 남자는 죽고, 여자는 사는 형식 0.7x / 0.76555
RandomForestClassifier를 사용한 형식 0.8x / 0.77511
RandomForestClassifier에서 하이퍼파라미터를 조정한 형식 0.8x / 0.76555 (트레인 셋에서의 분석에서는 더 높았는데 실제 결과는 오히려 더 낮았다.)

4. WeightedEnsemble_L2모델 사용(알아서 사용하긴 함)

train set을 일단 풀어보자(predict)

type(tr) ## 처음 보는 것으로 저장되는데 데이터프레임에서 쓸 수 있는 모든 기능들을 다 사용할 수 있다.

autogluon.core.dataset.TabularDataset

tr.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

(tr.Survived == predictr.predict(tr)).mean()

0.8810325476992144

정확도가 0.9349나 된다. 상당히 기대가 되는 부분

predictr.predict(tst)

0      0
1      0
2      0
3      0
4      0
      ..
413    0
414    1
415    0
416    0
417    0
Name: Survived, Length: 418, dtype: int64

::: {#cell-27 .cell _kg_hide-input=‘false’ execution=‘{“iopub.status.busy”:“2023-09-14T13:21:20.809082Z”,“iopub.status.idle”:“2023-09-14T13:21:20.809518Z”,“shell.execute_reply”:“2023-09-14T13:21:20.809313Z”,“shell.execute_reply.started”:“2023-09-14T13:21:20.809294Z”}’}

tst.assign(Survived = predictr.predict(tst)).loc[:, ['PassengerId', 'Survived']]\
.to_csv('autogluon_submission.csv', index = False)

:::

제출 결과 정확도는 0.78947로 지금껏 가장 높은 수치가 나왔다.

3. 개선

결과를 좀 더 개선할 수 있지 않을까?

A. `Fsize`로 feature engeenering

1) 데이터

tr = TabularDataset('./data/train.csv')  ## 학습할 데이터
tst = TabularDataset('./data/test.csv')

Loaded data from: ./data/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
Loaded data from: ./data/test.csv | Columns = 11 / 11 | Rows = 418 -> 418

-피쳐 엔지니어링

tr.assign(Fsize = tr.SibSp + tr.Parch)
tst.assign(Fsize = tst.SibSp + tst.Parch)

#tr.eval('Fsize = SibSp + Parch')
#tst.eval('Fsize = SibSp + Parch')

tr.head()  ## 원본 데이터를 손상시키지 않음, Fsize 열이 추가되지 않은 것을 알 수 있음

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

2) Predictor 생성

predictr = TabularPredictor("Survived")

No path specified. Models will be saved in: "AutogluonModels\ag-20231017_132447"

3) 적합(fit)

predictr.fit(tr.assign(Fsize = tr.SibSp + tr.Parch))  ## 새로운 데이터셋을 추가하여 학습

Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_132447"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
Disk Space Avail:   57.59 GB / 255.01 GB (22.6%)
Train Data Rows:    891
Train Data Columns: 12
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [0, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    1923.41 MB
    Train Data (Original)  Memory Usage: 0.32 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
        Warning: Due to memory constraints, ngram feature count is being reduced. Allocate more memory to maximize model quality.
        Reducing Vectorizer vocab size from 8 to 4 to avoid OOM error
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 5 | ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fsize']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 5 | ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fsize']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 5 | ['__nlp__.miss', '__nlp__.mr', '__nlp__.mrs', '__nlp__.william', '__nlp__._total_']
    0.3s = Fit runtime
    12 features in original data used to generate 25 features in processed data.
    Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.37s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
    0.648    = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.6425   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    0.8268   = Validation score   (accuracy)
    0.43s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    0.8492   = Validation score   (accuracy)
    0.55s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestGini ...
    0.7989   = Validation score   (accuracy)
    0.51s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.8156   = Validation score   (accuracy)
    0.5s     = Training   runtime
    0.06s    = Validation runtime
Fitting model: CatBoost ...
    0.8268   = Validation score   (accuracy)
    6.8s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: ExtraTreesGini ...
    0.8045   = Validation score   (accuracy)
    0.45s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.8045   = Validation score   (accuracy)
    0.44s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    0.8324   = Validation score   (accuracy)
    2.76s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ...
    0.8212   = Validation score   (accuracy)
    0.68s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: NeuralNetTorch ...
    0.8324   = Validation score   (accuracy)
    9.58s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: LightGBMLarge ...
    0.838    = Validation score   (accuracy)
    0.83s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8492   = Validation score   (accuracy)
    0.65s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 25.25s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_132447")

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20d8e852770>

-리더보드 확인

predictr.leaderboard()

                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              LightGBM   0.849162       0.010999  0.554687                0.010999           0.554687            1       True          4
1   WeightedEnsemble_L2   0.849162       0.012000  1.201853                0.001000           0.647166            2       True         14
2         LightGBMLarge   0.837989       0.005001  0.827996                0.005001           0.827996            1       True         13
3       NeuralNetFastAI   0.832402       0.024005  2.761039                0.024005           2.761039            1       True         10
4        NeuralNetTorch   0.832402       0.034000  9.577353                0.034000           9.577353            1       True         12
5            LightGBMXT   0.826816       0.004981  0.426716                0.004981           0.426716            1       True          3
6              CatBoost   0.826816       0.005996  6.798872                0.005996           6.798872            1       True          7
7               XGBoost   0.821229       0.008010  0.680577                0.008010           0.680577            1       True         11
8      RandomForestEntr   0.815642       0.063459  0.504724                0.063459           0.504724            1       True          6
9        ExtraTreesEntr   0.804469       0.062692  0.443597                0.062692           0.443597            1       True          9
10       ExtraTreesGini   0.804469       0.065061  0.448723                0.065061           0.448723            1       True          8
11     RandomForestGini   0.798883       0.064575  0.510397                0.064575           0.510397            1       True          5
12       KNeighborsUnif   0.648045       0.007998  0.008998                0.007998           0.008998            1       True          1
13       KNeighborsDist   0.642458       0.007999  0.011001                0.007999           0.011001            1       True          2

	model	score_val	pred_time_val	fit_time	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBM	0.849162	0.010999	0.554687	0.010999	0.554687	1	True	4
1	WeightedEnsemble_L2	0.849162	0.012000	1.201853	0.001000	0.647166	2	True	14
2	LightGBMLarge	0.837989	0.005001	0.827996	0.005001	0.827996	1	True	13
3	NeuralNetFastAI	0.832402	0.024005	2.761039	0.024005	2.761039	1	True	10
4	NeuralNetTorch	0.832402	0.034000	9.577353	0.034000	9.577353	1	True	12
5	LightGBMXT	0.826816	0.004981	0.426716	0.004981	0.426716	1	True	3
6	CatBoost	0.826816	0.005996	6.798872	0.005996	6.798872	1	True	7
7	XGBoost	0.821229	0.008010	0.680577	0.008010	0.680577	1	True	11
8	RandomForestEntr	0.815642	0.063459	0.504724	0.063459	0.504724	1	True	6
9	ExtraTreesEntr	0.804469	0.062692	0.443597	0.062692	0.443597	1	True	9
10	ExtraTreesGini	0.804469	0.065061	0.448723	0.065061	0.448723	1	True	8
11	RandomForestGini	0.798883	0.064575	0.510397	0.064575	0.510397	1	True	5
12	KNeighborsUnif	0.648045	0.007998	0.008998	0.007998	0.008998	1	True	1
13	KNeighborsDist	0.642458	0.007999	0.011001	0.007999	0.011001	1	True	2

4) 예측(predict)

(tr.Survived == predictr.predict(tr.assign(Fsize = tr.SibSp + tr.Parch))).mean()

0.9696969696969697

tst.assign(Survived = predictr.predict(tst.assign(Fsize = tst.SibSp + tst.Parch))).loc[:,['PassengerId','Survived']]\
.to_csv("autogluon(Fsize)_submission.csv",index=False)

제출 결과 : 점수가 오히려 더 낮아졌음

더 개선해보자

### B. `Fsize` + `drop`

1) data

-피처 엔지니어링 (데이터 불러오는 건 위에서 했으니 일단 생략

_tr = tr.assign(Fsize = lambda _df : _df.SibSp + _df.Parch).drop(['SibSp','Parch'],axis=1)
_tst = tst.assign(Fsize = tst.SibSp + tst.Parch).drop(['SibSp','Parch'],axis=1)

_tr.head()
## df.drop(columns = [])
## df.drop([], axis = 1) columns라고 지정해주지 않으면 디폴트로 행을 삭제하기 때문에

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	Fsize
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	A/5 21171	7.2500	NaN	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	PC 17599	71.2833	C85	C	1
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	S	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	C123	S	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	373450	8.0500	NaN	S	0

2) Predictor 생성

predictr = TabularPredictor('Survived')

No path specified. Models will be saved in: "AutogluonModels\ag-20231017_132627"

3) 적합(fit)

predictr.fit(_tr)

Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_132627"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
Disk Space Avail:   57.56 GB / 255.01 GB (22.6%)
Train Data Rows:    891
Train Data Columns: 10
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [0, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    1899.15 MB
    Train Data (Original)  Memory Usage: 0.31 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
        Warning: Due to memory constraints, ngram feature count is being reduced. Allocate more memory to maximize model quality.
        Reducing Vectorizer vocab size from 8 to 4 to avoid OOM error
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 3 | ['PassengerId', 'Pclass', 'Fsize']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 3 | ['PassengerId', 'Pclass', 'Fsize']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 5 | ['__nlp__.miss', '__nlp__.mr', '__nlp__.mrs', '__nlp__.william', '__nlp__._total_']
    0.3s = Fit runtime
    10 features in original data used to generate 23 features in processed data.
    Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
    0.6536   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.648    = Validation score   (accuracy)
    0.02s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: LightGBMXT ...
    0.8212   = Validation score   (accuracy)
    0.47s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBM ...
    0.838    = Validation score   (accuracy)
    0.64s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestGini ...
    0.8045   = Validation score   (accuracy)
    0.54s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.8101   = Validation score   (accuracy)
    0.53s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: CatBoost ...
    0.8324   = Validation score   (accuracy)
    7.6s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: ExtraTreesGini ...
    0.7989   = Validation score   (accuracy)
    0.53s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.8045   = Validation score   (accuracy)
    0.52s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
    0.8268   = Validation score   (accuracy)
    1.95s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ...
    0.8268   = Validation score   (accuracy)
    0.45s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: NeuralNetTorch ...
    0.8436   = Validation score   (accuracy)
    10.87s   = Training   runtime
    0.03s    = Validation runtime
Fitting model: LightGBMLarge ...
    0.8324   = Validation score   (accuracy)
    0.82s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8492   = Validation score   (accuracy)
    0.65s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 26.66s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_132627")

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20d858cd060>

predictr.leaderboard()

                  model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.849162       0.043977  12.163954                0.000984           0.653803            2       True         14
1        NeuralNetTorch   0.843575       0.032010  10.867509                0.032010          10.867509            1       True         12
2              LightGBM   0.837989       0.010982   0.642641                0.010982           0.642641            1       True          4
3         LightGBMLarge   0.832402       0.006009   0.821788                0.006009           0.821788            1       True         13
4              CatBoost   0.832402       0.006051   7.597862                0.006051           7.597862            1       True          7
5               XGBoost   0.826816       0.013022   0.450137                0.013022           0.450137            1       True         11
6       NeuralNetFastAI   0.826816       0.017003   1.949074                0.017003           1.949074            1       True         10
7            LightGBMXT   0.821229       0.006997   0.471555                0.006997           0.471555            1       True          3
8      RandomForestEntr   0.810056       0.063482   0.526611                0.063482           0.526611            1       True          6
9      RandomForestGini   0.804469       0.061717   0.544051                0.061717           0.544051            1       True          5
10       ExtraTreesEntr   0.804469       0.064033   0.519959                0.064033           0.519959            1       True          9
11       ExtraTreesGini   0.798883       0.064803   0.533057                0.064803           0.533057            1       True          8
12       KNeighborsUnif   0.653631       0.006997   0.011003                0.006997           0.011003            1       True          1
13       KNeighborsDist   0.648045       0.031997   0.016009                0.031997           0.016009            1       True          2

	model	score_val	pred_time_val	fit_time	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	WeightedEnsemble_L2	0.849162	0.043977	12.163954	0.000984	0.653803	2	True	14
1	NeuralNetTorch	0.843575	0.032010	10.867509	0.032010	10.867509	1	True	12
2	LightGBM	0.837989	0.010982	0.642641	0.010982	0.642641	1	True	4
3	LightGBMLarge	0.832402	0.006009	0.821788	0.006009	0.821788	1	True	13
4	CatBoost	0.832402	0.006051	7.597862	0.006051	7.597862	1	True	7
5	XGBoost	0.826816	0.013022	0.450137	0.013022	0.450137	1	True	11
6	NeuralNetFastAI	0.826816	0.017003	1.949074	0.017003	1.949074	1	True	10
7	LightGBMXT	0.821229	0.006997	0.471555	0.006997	0.471555	1	True	3
8	RandomForestEntr	0.810056	0.063482	0.526611	0.063482	0.526611	1	True	6
9	RandomForestGini	0.804469	0.061717	0.544051	0.061717	0.544051	1	True	5
10	ExtraTreesEntr	0.804469	0.064033	0.519959	0.064033	0.519959	1	True	9
11	ExtraTreesGini	0.798883	0.064803	0.533057	0.064803	0.533057	1	True	8
12	KNeighborsUnif	0.653631	0.006997	0.011003	0.006997	0.011003	1	True	1
13	KNeighborsDist	0.648045	0.031997	0.016009	0.031997	0.016009	1	True	2

4) 예측(predict)

(_tr.Survived == predictr.predict(_tr)).mean()

0.9472502805836139

predictr.predict(_tr)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

_tst.assign(Survived = predictr.predict(_tst)).loc[:, ['PassengerId', 'Survived']]\
.to_csv('autogluon(Fsize,Drop)_submission.csv', index = False)

지금껏 가장 높은 결과가 나왔다!

다중 공선성 문제를 개선한 결과라고 볼 수 있지… 음음.

아니, 모자라. 더 개선해!!!

### C. `best_quality`

1) data

tr = TabularDataset("./data/train.csv")
tst = TabularDataset("./data/test.csv")

Loaded data from: ./data/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
Loaded data from: ./data/test.csv | Columns = 11 / 11 | Rows = 418 -> 418

2) predictor 생성

predictr = TabularPredictor("Survived")

No path specified. Models will be saved in: "AutogluonModels\ag-20231017_132948"

3) 적합(fit)

어떤 자원이 들어가든, 전부 지원해줄 테니 가장 좋은 퀄리티로 산출해!!

predictr.fit(tr, presets = 'best_quality')

Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_132948"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
Disk Space Avail:   57.53 GB / 255.01 GB (22.6%)
Train Data Rows:    891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [0, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    1996.57 MB
    Train Data (Original)  Memory Usage: 0.31 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
        Warning: Due to memory constraints, ngram feature count is being reduced. Allocate more memory to maximize model quality.
        Reducing Vectorizer vocab size from 8 to 4 to avoid OOM error
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 5 | ['__nlp__.miss', '__nlp__.mr', '__nlp__.mrs', '__nlp__.william', '__nlp__._total_']
    0.4s = Fit runtime
    11 features in original data used to generate 24 features in processed data.
    Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.39s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ...
    0.6296   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ...
    0.6352   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ...
Will use sequential fold fitting strategy because import of ray failed. Reason: ray is required to train folds in parallel for TabularPredictor or HPO for MultiModalPredictor. A quick tip is to install via `pip install ray==2.6.3`
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
    0.835    = Validation score   (accuracy)
    3.82s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: LightGBM_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
    0.8373   = Validation score   (accuracy)
    5.36s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ...
    0.8339   = Validation score   (accuracy)
    0.55s    = Training   runtime
    0.1s     = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ...
    0.8305   = Validation score   (accuracy)
    0.54s    = Training   runtime
    0.1s     = Validation runtime
Fitting model: CatBoost_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
    0.8552   = Validation score   (accuracy)
    72.17s   = Training   runtime
    0.04s    = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ...
    0.8238   = Validation score   (accuracy)
    0.51s    = Training   runtime
    0.11s    = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ...
    0.8316   = Validation score   (accuracy)
    0.49s    = Training   runtime
    0.1s     = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
No improvement since epoch 7: early stopping
No improvement since epoch 6: early stopping
No improvement since epoch 7: early stopping
    0.853    = Validation score   (accuracy)
    20.42s   = Training   runtime
    0.13s    = Validation runtime
Fitting model: XGBoost_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
    0.8373   = Validation score   (accuracy)
    3.6s     = Training   runtime
    0.06s    = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
    0.8462   = Validation score   (accuracy)
    68.5s    = Training   runtime
    0.19s    = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
    0.8429   = Validation score   (accuracy)
    8.68s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8552   = Validation score   (accuracy)
    0.84s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 188.35s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_132948")

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20d90435000>

대신 시간이 상당히 오래 걸린다…

- 리더보드 확인

predictr.leaderboard()

                      model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0           CatBoost_BAG_L1   0.855219       0.036927  72.167391                0.036927          72.167391            1       True          7
1       WeightedEnsemble_L2   0.855219       0.038929  73.009209                0.002002           0.841818            2       True         14
2    NeuralNetFastAI_BAG_L1   0.852974       0.130997  20.415231                0.130997          20.415231            1       True         10
3     NeuralNetTorch_BAG_L1   0.846240       0.194014  68.497755                0.194014          68.497755            1       True         12
4      LightGBMLarge_BAG_L1   0.842873       0.056998   8.680638                0.056998           8.680638            1       True         13
5            XGBoost_BAG_L1   0.837262       0.055978   3.598592                0.055978           3.598592            1       True         11
6           LightGBM_BAG_L1   0.837262       0.061885   5.357185                0.061885           5.357185            1       True          4
7         LightGBMXT_BAG_L1   0.835017       0.049997   3.816595                0.049997           3.816595            1       True          3
8   RandomForestGini_BAG_L1   0.833895       0.096996   0.553528                0.096996           0.553528            1       True          5
9     ExtraTreesEntr_BAG_L1   0.831650       0.095051   0.494969                0.095051           0.494969            1       True          9
10  RandomForestEntr_BAG_L1   0.830527       0.101071   0.535026                0.101071           0.535026            1       True          6
11    ExtraTreesGini_BAG_L1   0.823793       0.111044   0.513969                0.111044           0.513969            1       True          8
12    KNeighborsDist_BAG_L1   0.635241       0.004996   0.006006                0.004996           0.006006            1       True          2
13    KNeighborsUnif_BAG_L1   0.629630       0.015998   0.005992                0.015998           0.005992            1       True          1

	model	score_val	pred_time_val	fit_time	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	CatBoost_BAG_L1	0.855219	0.036927	72.167391	0.036927	72.167391	1	True	7
1	WeightedEnsemble_L2	0.855219	0.038929	73.009209	0.002002	0.841818	2	True	14
2	NeuralNetFastAI_BAG_L1	0.852974	0.130997	20.415231	0.130997	20.415231	1	True	10
3	NeuralNetTorch_BAG_L1	0.846240	0.194014	68.497755	0.194014	68.497755	1	True	12
4	LightGBMLarge_BAG_L1	0.842873	0.056998	8.680638	0.056998	8.680638	1	True	13
5	XGBoost_BAG_L1	0.837262	0.055978	3.598592	0.055978	3.598592	1	True	11
6	LightGBM_BAG_L1	0.837262	0.061885	5.357185	0.061885	5.357185	1	True	4
7	LightGBMXT_BAG_L1	0.835017	0.049997	3.816595	0.049997	3.816595	1	True	3
8	RandomForestGini_BAG_L1	0.833895	0.096996	0.553528	0.096996	0.553528	1	True	5
9	ExtraTreesEntr_BAG_L1	0.831650	0.095051	0.494969	0.095051	0.494969	1	True	9
10	RandomForestEntr_BAG_L1	0.830527	0.101071	0.535026	0.101071	0.535026	1	True	6
11	ExtraTreesGini_BAG_L1	0.823793	0.111044	0.513969	0.111044	0.513969	1	True	8
12	KNeighborsDist_BAG_L1	0.635241	0.004996	0.006006	0.004996	0.006006	1	True	2
13	KNeighborsUnif_BAG_L1	0.629630	0.015998	0.005992	0.015998	0.005992	1	True	1

4) 예측(predict)

(tr.Survived == predictr.predict(tr)).mean()

0.9158249158249159

tst[['PassengerId']].assign(Survived = predictr.predict(tst))\
.to_csv("autogluon(best_quality)_submission.csv",index=False)

하지만 결과는 확실하다. 무려 0.813…

1. 라이브러리 imports

2. 분석

A. 데이터 입력

### B. Predictor 생성

C. 적합(fit)

[1] 처음 실습할 땐 분명 이게 제일 높았었는데…

### D. 예측(predict)

3. 개선

A. Fsize로 feature engeenering

### B. Fsize + drop

### C. best_quality

A. `Fsize`로 feature engeenering

### B. `Fsize` + `drop`

### C. `best_quality`