Plotly : pandas backend

plotly

Author

강신성

Published

2023-11-13

yfinance와 plotly를 이용하여 자료를 받고 시각화해보자!

해당 포스트는 전북대학교 통계학과 최규빈 교수님의 강의내용을 토대로 재구성되었음을 알립니다.

1. 라이브러리 imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
import plotly.io as pio

pd.options.plotting.backend = 'plotly'
pio.templates.default = 'plotly_white'
print(pio.templates)

Templates configuration
-----------------------
    Default template: 'plotly_white'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']

기본적으로 산출되는 옵션을 바꿔준다. pandas의 디폴트 백엔드는 matplotlib이기 때문에, 이것을 plotly로 바꾸고 템플릿을 하얀색으로 바꿨다.

- 간단하게 backend = plotly를 입력하지 않아도 되게 만들었음.

2. `yfinance`를 이용한 주식 자료 시각화

A. 크롤링 + 데이터 정리

주식 종목에 따른 정보 가져오기(야후 파이낸셜에서 검색) :

Apple : AAPL

삼성전자 : 005930.KS

해당 코드를 이용하여 관심있는 데이터를 크롤링하려면?

symbols = ['AMZN','AAPL','GOOG','MSFT','NFLX','NVDA','TSLA']    ## 관심있는 주식들
start = '2020-01-01'    ## 장 시작
end = '2023-11-06'    ## 장 종료
df = yf.download(symbols,start,end)

[*********************100%%**********************]  7 of 7 completed

df

	Adj Close							Close			...	Open			Volume
	AAPL	AMZN	GOOG	MSFT	NFLX	NVDA	TSLA	AAPL	AMZN	GOOG	...	NFLX	NVDA	TSLA	AAPL	AMZN	GOOG	MSFT	NFLX	NVDA	TSLA
Date
2020-01-02	73.152657	94.900497	68.368500	155.093674	329.809998	59.749290	28.684000	75.087502	94.900497	68.368500	...	326.100006	59.687500	28.299999	135480400	80580000	28132000	22622100	4485800	23753600	142981500
2020-01-03	72.441460	93.748497	68.032997	153.162476	325.899994	58.792957	29.534000	74.357498	93.748497	68.032997	...	326.779999	58.775002	29.366667	146322800	75288000	23728000	21116200	3806900	20538400	266677500
2020-01-06	73.018692	95.143997	69.710503	153.558395	335.829987	59.039509	30.102667	74.949997	95.143997	69.710503	...	323.119995	58.080002	29.364668	118387200	81236000	34646000	20813700	5663100	26263600	151995000
2020-01-07	72.675285	95.343002	69.667000	152.158249	330.750000	59.754276	31.270666	74.597504	95.343002	69.667000	...	336.470001	59.549999	30.760000	108872000	80898000	30054000	21634100	4703200	31485600	268231500
2020-01-08	73.844353	94.598503	70.216003	154.581940	339.260010	59.866344	32.809334	75.797501	94.598503	70.216003	...	331.489990	59.939999	31.580000	132079200	70160000	30560000	27746500	7104500	27710800	467164500
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2023-10-30	170.065933	132.710007	125.750000	337.309998	410.079987	411.609985	197.360001	170.289993	132.710007	125.750000	...	402.350006	410.869995	209.279999	51131000	72485500	24165600	22828100	5317100	38802800	136448200
2023-10-31	170.545319	133.089996	125.300003	338.109985	411.690002	407.799988	200.839996	170.770004	133.089996	125.300003	...	409.239990	404.500000	196.119995	44846000	51589400	21123400	20265300	3877600	51796900	118068300
2023-11-01	173.741104	137.000000	127.570000	346.070007	420.190002	423.250000	205.660004	173.970001	137.000000	127.570000	...	414.769989	408.839996	204.039993	56934900	61529400	26536600	28158800	4806100	43759300	121661700
2023-11-02	177.336380	138.070007	128.580002	348.320007	424.709991	435.059998	218.509995	177.570007	138.070007	128.580002	...	421.170013	433.279999	212.970001	77334800	52236700	24091700	24348100	4476000	40917200	125987600
2023-11-03	176.417572	138.600006	130.369995	352.799988	432.359985	450.049988	219.960007	176.649994	138.600006	130.369995	...	428.760010	440.200012	221.149994	79763700	44007200	19517900	23624000	3664800	42385500	119281000

968 rows × 42 columns

wide data로 들어가있는 것을 알 수 있다.

- 그럼 필요한 정보(조정된 주가)만 가져와보자.

df.stack().loc[:, 'Adj Close'].reset_index()

	Date	level_1	Adj Close
0	2020-01-02	AAPL	73.152657
1	2020-01-02	AMZN	94.900497
2	2020-01-02	GOOG	68.368500
3	2020-01-02	MSFT	155.093674
4	2020-01-02	NFLX	329.809998
...	...	...	...
6771	2023-11-03	GOOG	130.369995
6772	2023-11-03	MSFT	352.799988
6773	2023-11-03	NFLX	432.359985
6774	2023-11-03	NVDA	450.049988
6775	2023-11-03	TSLA	219.960007

6776 rows × 3 columns

- 다른 방법(MultiIndex 이용)

df.columns

MultiIndex([('Adj Close', 'AAPL'),
            ('Adj Close', 'AMZN'),
            ('Adj Close', 'GOOG'),
            ('Adj Close', 'MSFT'),
            ('Adj Close', 'NFLX'),
            ('Adj Close', 'NVDA'),
            ('Adj Close', 'TSLA'),
            (    'Close', 'AAPL'),
            (    'Close', 'AMZN'),
            (    'Close', 'GOOG'),
            (    'Close', 'MSFT'),
            (    'Close', 'NFLX'),
            (    'Close', 'NVDA'),
            (    'Close', 'TSLA'),
            (     'High', 'AAPL'),
            (     'High', 'AMZN'),
            (     'High', 'GOOG'),
            (     'High', 'MSFT'),
            (     'High', 'NFLX'),
            (     'High', 'NVDA'),
            (     'High', 'TSLA'),
            (      'Low', 'AAPL'),
            (      'Low', 'AMZN'),
            (      'Low', 'GOOG'),
            (      'Low', 'MSFT'),
            (      'Low', 'NFLX'),
            (      'Low', 'NVDA'),
            (      'Low', 'TSLA'),
            (     'Open', 'AAPL'),
            (     'Open', 'AMZN'),
            (     'Open', 'GOOG'),
            (     'Open', 'MSFT'),
            (     'Open', 'NFLX'),
            (     'Open', 'NVDA'),
            (     'Open', 'TSLA'),
            (   'Volume', 'AAPL'),
            (   'Volume', 'AMZN'),
            (   'Volume', 'GOOG'),
            (   'Volume', 'MSFT'),
            (   'Volume', 'NFLX'),
            (   'Volume', 'NVDA'),
            (   'Volume', 'TSLA')],
           )

이렇게 멀티인덱스로 되어있는 경우, 앞의 것만으로도 호출할 수 있다. 예를들어…

df.loc[:, 'Adj Close']  ## 이건 된다.
#df.loc[:, 'AAPL']  ## 이건 안된다.

	AAPL	AMZN	GOOG	MSFT	NFLX	NVDA	TSLA
Date
2020-01-02	73.152657	94.900497	68.368500	155.093674	329.809998	59.749290	28.684000
2020-01-03	72.441460	93.748497	68.032997	153.162476	325.899994	58.792957	29.534000
2020-01-06	73.018692	95.143997	69.710503	153.558395	335.829987	59.039509	30.102667
2020-01-07	72.675285	95.343002	69.667000	152.158249	330.750000	59.754276	31.270666
2020-01-08	73.844353	94.598503	70.216003	154.581940	339.260010	59.866344	32.809334
...	...	...	...	...	...	...	...
2023-10-30	170.065933	132.710007	125.750000	337.309998	410.079987	411.609985	197.360001
2023-10-31	170.545319	133.089996	125.300003	338.109985	411.690002	407.799988	200.839996
2023-11-01	173.741104	137.000000	127.570000	346.070007	420.190002	423.250000	205.660004
2023-11-02	177.336380	138.070007	128.580002	348.320007	424.709991	435.059998	218.509995
2023-11-03	176.417572	138.600006	130.369995	352.799988	432.359985	450.049988	219.960007

968 rows × 7 columns

- 굳이굳이 위와 같은 방식으로 AAPL만 추출하고 싶다면…

df.stack().stack().swaplevel(i = 1, j = 2)  ## 멀티인덱스(level)의 순서를 바꿔줌.

Date                       
2020-01-02  Adj Close  AAPL    7.315266e+01
            Close      AAPL    7.508750e+01
            High       AAPL    7.515000e+01
            Low        AAPL    7.379750e+01
            Open       AAPL    7.406000e+01
                                   ...     
2023-11-03  Close      TSLA    2.199600e+02
            High       TSLA    2.263700e+02
            Low        TSLA    2.184000e+02
            Open       TSLA    2.211500e+02
            Volume     TSLA    1.192810e+08
Length: 40656, dtype: float64

df.stack().stack().swaplevel(i = 1, j = 2).unstack().unstack().loc[:, 'AAPL']

	Adj Close	Close	High	Low	Open	Volume
Date
2020-01-02	73.152657	75.087502	75.150002	73.797501	74.059998	135480400.0
2020-01-03	72.441460	74.357498	75.144997	74.125000	74.287498	146322800.0
2020-01-06	73.018692	74.949997	74.989998	73.187500	73.447502	118387200.0
2020-01-07	72.675285	74.597504	75.224998	74.370003	74.959999	108872000.0
2020-01-08	73.844353	75.797501	76.110001	74.290001	74.290001	132079200.0
...	...	...	...	...	...	...
2023-10-30	170.065933	170.289993	171.169998	168.869995	169.020004	51131000.0
2023-10-31	170.545319	170.770004	170.899994	167.899994	169.350006	44846000.0
2023-11-01	173.741104	173.970001	174.229996	170.119995	171.000000	56934900.0
2023-11-02	177.336380	177.570007	177.779999	175.460007	175.520004	77334800.0
2023-11-03	176.417572	176.649994	176.820007	173.350006	174.240005	79763700.0

968 rows × 6 columns

이렇게 하면 된다.

df.swaplevel(i = 0, j = 1, axis = 1).loc[:, 'AAPL']

	Adj Close	Close	High	Low	Open	Volume
Date
2020-01-02	73.152657	75.087502	75.150002	73.797501	74.059998	135480400
2020-01-03	72.441460	74.357498	75.144997	74.125000	74.287498	146322800
2020-01-06	73.018692	74.949997	74.989998	73.187500	73.447502	118387200
2020-01-07	72.675285	74.597504	75.224998	74.370003	74.959999	108872000
2020-01-08	73.844353	75.797501	76.110001	74.290001	74.290001	132079200
...	...	...	...	...	...	...
2023-10-30	170.065933	170.289993	171.169998	168.869995	169.020004	51131000
2023-10-31	170.545319	170.770004	170.899994	167.899994	169.350006	44846000
2023-11-01	173.741104	173.970001	174.229996	170.119995	171.000000	56934900
2023-11-02	177.336380	177.570007	177.779999	175.460007	175.520004	77334800
2023-11-03	176.417572	176.649994	176.820007	173.350006	174.240005	79763700

968 rows × 6 columns

이것도 똑같다.(물론 컬럼 인덱스가 정렬이 안되어있는 게 다르긴한데, 결과는 똑같잖아?)

- 솔직히 아래와 같이 하는 게 제일 맘편하다.

df.stack().reset_index().rename({'level_1' : 'Subject'}, axis = 1)  ## 바로 타이디데이터로

	Date	Subject	Adj Close	Close	High	Low	Open	Volume
0	2020-01-02	AAPL	73.152657	75.087502	75.150002	73.797501	74.059998	135480400
1	2020-01-02	AMZN	94.900497	94.900497	94.900497	93.207497	93.750000	80580000
2	2020-01-02	GOOG	68.368500	68.368500	68.406998	67.077499	67.077499	28132000
3	2020-01-02	MSFT	155.093674	160.619995	160.729996	158.330002	158.779999	22622100
4	2020-01-02	NFLX	329.809998	329.809998	329.980011	324.779999	326.100006	4485800
...	...	...	...	...	...	...	...	...
6771	2023-11-03	GOOG	130.369995	130.369995	130.729996	129.009995	129.089996	19517900
6772	2023-11-03	MSFT	352.799988	352.799988	354.390015	347.329987	349.630005	23624000
6773	2023-11-03	NFLX	432.359985	432.359985	434.820007	425.529999	428.760010	3664800
6774	2023-11-03	NVDA	450.049988	450.049988	453.089996	437.230011	440.200012	42385500
6775	2023-11-03	TSLA	219.960007	219.960007	226.369995	218.399994	221.149994	119281000

6776 rows × 8 columns

### B. 시각화

- tidydata를 생성

df.loc[:, 'Adj Close'].reset_index().set_index('Date').stack().reset_index().rename({'level_1' : 'Company', 0 : 'Price'}, axis = 1)

	Date	Company	Price
0	2020-01-02	AAPL	73.152657
1	2020-01-02	AMZN	94.900497
2	2020-01-02	GOOG	68.368500
3	2020-01-02	MSFT	155.093674
4	2020-01-02	NFLX	329.809998
...	...	...	...
6771	2023-11-03	GOOG	130.369995
6772	2023-11-03	MSFT	352.799988
6773	2023-11-03	NFLX	432.359985
6774	2023-11-03	NVDA	450.049988
6775	2023-11-03	TSLA	219.960007

6776 rows × 3 columns

- 시각화 : 데이터프레임 자체 메소드 활용

df.loc[:, 'Adj Close'].reset_index().set_index('Date').stack().reset_index().rename({'level_1' : 'Company', 0 : 'Price'}, axis = 1)\
.plot.line(x = 'Date', y = 'Price', color = 'Company', backend = 'plotly')  ## plot()은 라인이 디폴트긴 함.

깔쌈하죠?

상당히 강력한 툴임, 이미지적으로 지원하는 게 되게 많음.

3. 출산률 시각화

A. 크롤링 + 데이터 정리

- 대한민국의 저출산 문제

ref : https://ko.wikipedia.org/wiki/대한민국의_저출산

- 위의 url에서 3, 5번째 테이블만 읽고 싶다면 어떻게 해야 할까???

3번째 테이블 : 시도별 출산률
5번째 테이블 : 시도별 출생아 수

1 데이터 긁어오기

df_lst = pd.read_html('https://ko.wikipedia.org/wiki/%EB%8C%80%ED%95%9C%EB%AF%BC%EA%B5%AD%EC%9D%98_%EC%A0%80%EC%B6%9C%EC%82%B0')

예, ㅈㄴ 단순합니다. 일단 판다스에서 자체적으로 html을 긁어올 수 있어요.

len(df_lst)  ## 22개의 테뷸러 데이터

df_lst[2]  ## 시도별 합계출산률

	지역/연도[6]	2005	2006[7]	2007	2008[8]	2009[9]	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	서울	0.92	0.97	1.06	1.01	0.96	1.02	1.01	1.06	0.97	0.98	1.00	0.94	0.84	0.76	0.72	0.64	0.63
1	부산	0.88	0.91	1.02	0.98	0.94	1.05	1.08	1.14	1.05	1.09	1.14	1.10	0.98	0.90	0.83	0.75	0.73
2	대구	0.99	1.00	1.13	1.07	1.03	1.11	1.15	1.22	1.13	1.17	1.22	1.19	1.07	0.99	0.93	0.81	0.78
3	인천	1.07	1.11	1.25	1.19	1.14	1.21	1.23	1.30	1.20	1.21	1.22	1.14	1.01	1.01	0.94	0.83	0.78
4	광주	1.10	1.14	1.26	1.20	1.14	1.22	1.23	1.30	1.17	1.20	1.21	1.17	1.05	0.97	0.91	0.81	0.90
5	대전	1.10	1.15	1.27	1.22	1.16	1.21	1.26	1.32	1.23	1.25	1.28	1.19	1.08	0.95	0.88	0.81	0.81
6	울산	1.18	1.24	1.40	1.34	1.31	1.37	1.39	1.48	1.39	1.44	1.49	1.42	1.26	1.13	1.08	0.99	0.94
7	세종	-	-	-	-	-	-	-	1.60	1.44	1.35	1.89	1.82	1.67	1.57	1.47	1.28	1.28
8	경기	1.17	1.23	1.35	1.29	1.23	1.31	1.31	1.36	1.23	1.24	1.27	1.19	1.07	1.00	0.94	0.88	0.85
9	강원	1.18	1.19	1.35	1.25	1.25	1.31	1.34	1.37	1.25	1.25	1.31	1.24	1.12	1.07	1.08	1.04	0.98
10	충북	1.19	1.22	1.39	1.32	1.32	1.40	1.43	1.49	1.37	1.36	1.41	1.36	1.24	1.17	1.05	0.98	0.95
11	충남	1.26	1.35	1.50	1.44	1.41	1.48	1.50	1.57	1.44	1.42	1.48	1.40	1.28	1.19	1.11	1.03	0.96
12	전북	1.17	1.20	1.37	1.31	1.28	1.37	1.41	1.44	1.32	1.33	1.35	1.25	1.15	1.04	0.97	0.91	0.85
13	전남	1.28	1.33	1.53	1.45	1.45	1.54	1.57	1.64	1.52	1.50	1.55	1.47	1.33	1.24	1.23	1.15	1.02
14	경북	1.17	1.20	1.36	1.31	1.27	1.38	1.43	1.49	1.38	1.41	1.46	1.40	1.26	1.17	1.09	1.00	0.97
15	경남	1.18	1.25	1.43	1.37	1.32	1.41	1.45	1.50	1.37	1.41	1.44	1.36	1.23	1.12	1.05	0.95	0.90
16	제주	1.30	1.36	1.48	1.39	1.38	1.46	1.49	1.60	1.43	1.48	1.48	1.43	1.31	1.22	1.15	1.02	0.95
17	전국	1.08	1.13	1.25	1.19	1.15	1.23	1.24	1.30	1.19	1.21	1.24	1.17	1.05	0.98	0.92	0.84	0.81

df_lst[4]  ## 시도별 출생아 수

	지역/연도[6]	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	서울	93266	91526	93914.000	84066.000	83711.000	83005	75.536	65389	58074	53.673	47400	45531
1	부산	27415	27759	28673.000	25831.000	26190.000	26645	24906.000	21480	19152	17049.000	15100	14446
2	대구	20557	20758	21472.000	19340.000	19361.000	19438	18298.000	15946	14400	13233.000	11200	10661
3	인천	25752	20758	21472.000	25560.000	25786.000	25491	23609.000	20445	20087	18522.000	16000	14947
4	광주	13979	13916	14392.000	12729.000	12729.000	12441	11580.000	10120	9105	8364.000	7300	7956
5	대전	14314	14808	15279.000	14099.000	13962.000	13774	12436.000	10851	9337	8410.000	7500	7414
6	울산	11432	11542	12160.000	11330.000	11556.000	11732	10910.000	9381	8149	7539.000	6600	6127
7	세종	-	-	1054.000	1111.000	1344.000	2708	3297.000	3504	3703	3819.000	3500	3570
8	경기	121753	122027	124746.000	112129.000	112.169	113495	105643.000	94088	83198	83.198	77800	76139
9	강원	12477	12408	12426.000	10980.000	10662.000	10929	10058.000	9958	8351	8283.000	7800	7357
10	충북	14670	14804	15139.000	13658.000	13366.000	13563	12742.000	11394	10586	9333.000	8600	8190
11	충남	20.242	20.398	20.448	18.628	18200.000	18604	17302.000	15670	14380	13228.000	11900	10984
12	전북	16100	16175	16238.000	14555.000	14231.000	14087	12698.000	11348	10001	8971.000	8200	7745
13	전남	16654	16612	16990.000	15401.000	14817.000	15061	13980.000	12354	11238	10832.000	9700	8430
14	경북	23700	24250	24635.000	22206.000	22062.000	22310	20616.000	17957	16079	14472.000	12900	12045
15	경남	32203	32536	33211.000	29504.000	29763.000	29537	27138.000	23849	21224	19250.000	16800	15562
16	제주	5657	5628	5992.000	5328.000	5526.000	5600	5494.000	5037	4781	4500.000	4000	3728
17	전국	470171	471265	484550.000	436455.000	435435.000	438420	406243.000	357771	326822	302676.000	272400	260562

세종시의 경우 중간에 지역이 추가되어 결측치가 -로 들어가 있음

2 데이터 처리

df = df_lst[4]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   지역/연도[6]  18 non-null     object 
 1   2010      18 non-null     object 
 2   2011      18 non-null     object 
 3   2012      18 non-null     float64
 4   2013      18 non-null     float64
 5   2014      18 non-null     float64
 6   2015      18 non-null     int64  
 7   2016      18 non-null     float64
 8   2017      18 non-null     int64  
 9   2018      18 non-null     int64  
 10  2019      18 non-null     float64
 11  2020      18 non-null     int64  
 12  2021      18 non-null     int64  
dtypes: float64(5), int64(5), object(3)
memory usage: 2.0+ KB

출생아 수이니까 지역빼고 전부 다 int64여야 할텐데, 1-2는 object임

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').applymap(lambda x : float(x) if x != '-' else 0).reset_index()  ## 어? 열 이름 영어로 안바꿔요???
## applymap이 map으로 명령어가 바뀌었네... 근데 일단 이거 써야지 뭐.

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_144\615695314.py:1: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

	지역	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	서울	93266.000	91526.000	93914.000	84066.000	83711.000	83005.0	75.536	65389.0	58074.0	53.673	47400.0	45531.0
1	부산	27415.000	27759.000	28673.000	25831.000	26190.000	26645.0	24906.000	21480.0	19152.0	17049.000	15100.0	14446.0
2	대구	20557.000	20758.000	21472.000	19340.000	19361.000	19438.0	18298.000	15946.0	14400.0	13233.000	11200.0	10661.0
3	인천	25752.000	20758.000	21472.000	25560.000	25786.000	25491.0	23609.000	20445.0	20087.0	18522.000	16000.0	14947.0
4	광주	13979.000	13916.000	14392.000	12729.000	12729.000	12441.0	11580.000	10120.0	9105.0	8364.000	7300.0	7956.0
5	대전	14314.000	14808.000	15279.000	14099.000	13962.000	13774.0	12436.000	10851.0	9337.0	8410.000	7500.0	7414.0
6	울산	11432.000	11542.000	12160.000	11330.000	11556.000	11732.0	10910.000	9381.0	8149.0	7539.000	6600.0	6127.0
7	세종	0.000	0.000	1054.000	1111.000	1344.000	2708.0	3297.000	3504.0	3703.0	3819.000	3500.0	3570.0
8	경기	121753.000	122027.000	124746.000	112129.000	112.169	113495.0	105643.000	94088.0	83198.0	83.198	77800.0	76139.0
9	강원	12477.000	12408.000	12426.000	10980.000	10662.000	10929.0	10058.000	9958.0	8351.0	8283.000	7800.0	7357.0
10	충북	14670.000	14804.000	15139.000	13658.000	13366.000	13563.0	12742.000	11394.0	10586.0	9333.000	8600.0	8190.0
11	충남	20.242	20.398	20.448	18.628	18200.000	18604.0	17302.000	15670.0	14380.0	13228.000	11900.0	10984.0
12	전북	16100.000	16175.000	16238.000	14555.000	14231.000	14087.0	12698.000	11348.0	10001.0	8971.000	8200.0	7745.0
13	전남	16654.000	16612.000	16990.000	15401.000	14817.000	15061.0	13980.000	12354.0	11238.0	10832.000	9700.0	8430.0
14	경북	23700.000	24250.000	24635.000	22206.000	22062.000	22310.0	20616.000	17957.0	16079.0	14472.000	12900.0	12045.0
15	경남	32203.000	32536.000	33211.000	29504.000	29763.000	29537.0	27138.000	23849.0	21224.0	19250.000	16800.0	15562.0
16	제주	5657.000	5628.000	5992.000	5328.000	5526.000	5600.0	5494.000	5037.0	4781.0	4500.000	4000.0	3728.0
17	전국	470171.000	471265.000	484550.000	436455.000	435435.000	438420.0	406243.000	357771.0	326822.0	302676.000	272400.0	260562.0

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').applymap(lambda x : float(x) if x != '-' else 0).reset_index().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   지역      18 non-null     object 
 1   2010    18 non-null     float64
 2   2011    18 non-null     float64
 3   2012    18 non-null     float64
 4   2013    18 non-null     float64
 5   2014    18 non-null     float64
 6   2015    18 non-null     float64
 7   2016    18 non-null     float64
 8   2017    18 non-null     float64
 9   2018    18 non-null     float64
 10  2019    18 non-null     float64
 11  2020    18 non-null     float64
 12  2021    18 non-null     float64
dtypes: float64(12), object(1)
memory usage: 2.0+ KB

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_144\1650738614.py:1: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

지역만 object로 바뀌고 나머지는 float(???)으로 잘 들어간 모습.

### B. 시각화 1 : 전국 출생아 수의 시각화

- 전국으로 따로 집계가 되어있긴 하지만, 실습을 위해(?) 따로 집계를 해서 산출해보도록 하자.

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').applymap(lambda x : float(x) if x != '-' else 0).reset_index()\
.drop(17, axis = 0)\
.set_index('지역').stack().reset_index().rename({'level_1' : '연도', 0 : '출생아 수'}, axis = 1)

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_144\2390378994.py:1: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

	지역	연도	출생아 수
0	서울	2010	93266.0
1	서울	2011	91526.0
2	서울	2012	93914.0
3	서울	2013	84066.0
4	서울	2014	83711.0
...	...	...	...
199	제주	2017	5037.0
200	제주	2018	4781.0
201	제주	2019	4500.0
202	제주	2020	4000.0
203	제주	2021	3728.0

204 rows × 3 columns

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').applymap(lambda x : float(x) if x != '-' else 0).reset_index()\
.drop(17, axis = 0)\
.set_index('지역').stack().reset_index().rename({'level_1' : '연도', 0 : '출생아 수'}, axis = 1)\
.groupby(by = '연도').aggregate({'출생아 수' : 'sum'}).reset_index()

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_144\4269128748.py:1: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

	연도	출생아 수
0	2010	449949.242
1	2011	445527.398
2	2012	457813.448
3	2013	417845.628
4	2014	323378.169
5	2015	438420.000
6	2016	330782.536
7	2017	358771.000
8	2018	321845.000
9	2019	165941.871
10	2020	272300.000
11	2021	260832.000

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').applymap(lambda x : float(x) if x != '-' else 0).reset_index()\
.drop(17, axis = 0)\
.set_index('지역').stack().reset_index().rename({'level_1' : '연도', 0 : '출생아 수'}, axis = 1)\
.groupby(by = '연도').aggregate({'출생아 수' : 'sum'}).reset_index()\
.plot.line(x = '연도', y = '출생아 수', backend = 'plotly')

## 그래프 표기에 한글이 너무나도 잘 나오기 때문에 자신감 있게 한글로 바꿔도 된다!!!

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_144\2678187976.py:1: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

출생아 수가 팍팍 튀는 구간이 있네용…

시각화 2 : 시도별 출생아 수 시각화(line)

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').map(lambda x : float(x) if x != '-' else 0).reset_index()\
.drop(17, axis = 0)\
.set_index('지역').stack().reset_index().rename({'level_1' : '연도', 0 : '출생아 수'}, axis = 1)\
.pivot_table(index = ['지역', '연도'], values = '출생아 수', aggfunc = 'sum').reset_index()  ## 이미 윗줄만으로 충분해...

	지역	연도	출생아 수
0	강원	2010	12477.0
1	강원	2011	12408.0
2	강원	2012	12426.0
3	강원	2013	10980.0
4	강원	2014	10662.0
...	...	...	...
199	충북	2017	11394.0
200	충북	2018	10586.0
201	충북	2019	9333.0
202	충북	2020	8600.0
203	충북	2021	8190.0

204 rows × 3 columns

df.rename({'지역/연도[6]' : '지역'}, axis = 1).set_index('지역').map(lambda x : float(x) if x != '-' else 0).reset_index()\
.drop(17, axis = 0)\
.set_index('지역').stack().reset_index().rename({'level_1' : '연도', 0 : '출생아 수'}, axis = 1)\
.plot.line(x = '연도', y = '출생아 수', color = '지역', backend = 'plotly')

경기와 서울이 특정 년도에서 현저히 줄어드는 것을 알 수 있다.

- 두 가지 line plot(전체, 지역별) 각각의 장단점이 있다;; 그 둘을 합쳐서 볼 순 없을까??

### D. 시각화 3 : 시도별 출생아 수 시각화(area)

plot.area()

df.rename({'지역/연도[6]':'지역'},axis=1)\
.set_index(['지역']).map(lambda x: float(x) if not x=='-' else 0)\
.drop('전국',axis=0)\
.stack().reset_index().set_axis(['지역','연도','출생아수'],axis=1)\
.plot.area(x='연도',y='출생아수',color='지역',backend='plotly')   ## plot.area()

각각 어떻게 변화했는지와, 그 누적의 변화 정도를 시각화할 수 있다.

- 그래서 자꾸 넘어가는 게, 2014 경기, 2016 서울, 2019 서울ㆍ경기 : 출생아가 없었음. 왜 없었어???

df

	지역/연도[6]	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	서울	93266	91526	93914.000	84066.000	83711.000	83005	75.536	65389	58074	53.673	47400	45531
1	부산	27415	27759	28673.000	25831.000	26190.000	26645	24906.000	21480	19152	17049.000	15100	14446
2	대구	20557	20758	21472.000	19340.000	19361.000	19438	18298.000	15946	14400	13233.000	11200	10661
3	인천	25752	20758	21472.000	25560.000	25786.000	25491	23609.000	20445	20087	18522.000	16000	14947
4	광주	13979	13916	14392.000	12729.000	12729.000	12441	11580.000	10120	9105	8364.000	7300	7956
5	대전	14314	14808	15279.000	14099.000	13962.000	13774	12436.000	10851	9337	8410.000	7500	7414
6	울산	11432	11542	12160.000	11330.000	11556.000	11732	10910.000	9381	8149	7539.000	6600	6127
7	세종	-	-	1054.000	1111.000	1344.000	2708	3297.000	3504	3703	3819.000	3500	3570
8	경기	121753	122027	124746.000	112129.000	112.169	113495	105643.000	94088	83198	83.198	77800	76139
9	강원	12477	12408	12426.000	10980.000	10662.000	10929	10058.000	9958	8351	8283.000	7800	7357
10	충북	14670	14804	15139.000	13658.000	13366.000	13563	12742.000	11394	10586	9333.000	8600	8190
11	충남	20.242	20.398	20.448	18.628	18200.000	18604	17302.000	15670	14380	13228.000	11900	10984
12	전북	16100	16175	16238.000	14555.000	14231.000	14087	12698.000	11348	10001	8971.000	8200	7745
13	전남	16654	16612	16990.000	15401.000	14817.000	15061	13980.000	12354	11238	10832.000	9700	8430
14	경북	23700	24250	24635.000	22206.000	22062.000	22310	20616.000	17957	16079	14472.000	12900	12045
15	경남	32203	32536	33211.000	29504.000	29763.000	29537	27138.000	23849	21224	19250.000	16800	15562
16	제주	5657	5628	5992.000	5328.000	5526.000	5600	5494.000	5037	4781	4500.000	4000	3728
17	전국	470171	471265	484550.000	436455.000	435435.000	438420	406243.000	357771	326822	302676.000	272400	260562

경기 2014의 경우 인구수인데 소수점이 존재…(솔로몬???), 서울 2016… 등등. 사람이 쓰다보니 오탈자가 있다.

E. 위에서의 시각화 수정

df.rename({'지역/연도[6]':'지역'},axis=1)\
.set_index(['지역'])\
.map(lambda x: float(x) if not x=='-' else 0)\
.map(lambda x : x if x%1 == 0 else x*1000)  ## 소수점이 존재할 경우 그대로, 아닐 경우 1000을 곱해줌.

	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
지역
서울	93266.0	91526.0	93914.0	84066.0	83711.0	83005.0	75536.0	65389.0	58074.0	53673.0	47400.0	45531.0
부산	27415.0	27759.0	28673.0	25831.0	26190.0	26645.0	24906.0	21480.0	19152.0	17049.0	15100.0	14446.0
대구	20557.0	20758.0	21472.0	19340.0	19361.0	19438.0	18298.0	15946.0	14400.0	13233.0	11200.0	10661.0
인천	25752.0	20758.0	21472.0	25560.0	25786.0	25491.0	23609.0	20445.0	20087.0	18522.0	16000.0	14947.0
광주	13979.0	13916.0	14392.0	12729.0	12729.0	12441.0	11580.0	10120.0	9105.0	8364.0	7300.0	7956.0
대전	14314.0	14808.0	15279.0	14099.0	13962.0	13774.0	12436.0	10851.0	9337.0	8410.0	7500.0	7414.0
울산	11432.0	11542.0	12160.0	11330.0	11556.0	11732.0	10910.0	9381.0	8149.0	7539.0	6600.0	6127.0
세종	0.0	0.0	1054.0	1111.0	1344.0	2708.0	3297.0	3504.0	3703.0	3819.0	3500.0	3570.0
경기	121753.0	122027.0	124746.0	112129.0	112169.0	113495.0	105643.0	94088.0	83198.0	83198.0	77800.0	76139.0
강원	12477.0	12408.0	12426.0	10980.0	10662.0	10929.0	10058.0	9958.0	8351.0	8283.0	7800.0	7357.0
충북	14670.0	14804.0	15139.0	13658.0	13366.0	13563.0	12742.0	11394.0	10586.0	9333.0	8600.0	8190.0
충남	20242.0	20398.0	20448.0	18628.0	18200.0	18604.0	17302.0	15670.0	14380.0	13228.0	11900.0	10984.0
전북	16100.0	16175.0	16238.0	14555.0	14231.0	14087.0	12698.0	11348.0	10001.0	8971.0	8200.0	7745.0
전남	16654.0	16612.0	16990.0	15401.0	14817.0	15061.0	13980.0	12354.0	11238.0	10832.0	9700.0	8430.0
경북	23700.0	24250.0	24635.0	22206.0	22062.0	22310.0	20616.0	17957.0	16079.0	14472.0	12900.0	12045.0
경남	32203.0	32536.0	33211.0	29504.0	29763.0	29537.0	27138.0	23849.0	21224.0	19250.0	16800.0	15562.0
제주	5657.0	5628.0	5992.0	5328.0	5526.0	5600.0	5494.0	5037.0	4781.0	4500.0	4000.0	3728.0
전국	470171.0	471265.0	484550.0	436455.0	435435.0	438420.0	406243.0	357771.0	326822.0	302676.0	272400.0	260562.0

df.rename({'지역/연도[6]':'지역'},axis=1)\
.set_index(['지역'])\
.map(lambda x: float(x) if not x=='-' else 0)\
.map(lambda x : x if x%1 == 0 else x*1000)\
.stack().reset_index().rename({'level_1' : '연도', 0 : '출생아 수'}, axis = 1)\
.groupby('연도').aggregate({'출생아 수' : 'sum'}).reset_index()\
.plot.line(x = '연도', y = '출생아 수', backend = 'plotly')

사실 튀는 구간 따위는 없었다!!!(그치만 줄어들고 있는건 사실이었네…)

df.rename({'지역/연도[6]':'지역'},axis=1)\
.set_index(['지역'])\
.map(lambda x: float(x) if not x=='-' else 0)\
.map(lambda x : x if x%1 == 0 else x*1000)\
.stack().reset_index().set_axis(['지역','연도','출생아수'],axis=1)\
.plot.line(x = '연도', y = '출생아수', color = '지역', backend = 'plotly')

df.rename({'지역/연도[6]':'지역'},axis=1)\
.set_index(['지역'])\
.map(lambda x: float(x) if not x=='-' else 0)\
.map(lambda x : x if x%1 == 0 else x*1000)\
.stack().reset_index().set_axis(['지역','연도','출생아수'],axis=1)\
.plot.area(x = '연도', y = '출생아수', color = '지역', backend = 'plotly')

정상적으로 이해할 수 있다.

4. 여러가지 플랏

plotly는 bar, line, scatter, hist, box, area형태의 플랏을 지원한다.(pie나 그런 것들은 지원안함!)

### A. `.plot.bar()`

- 예제 1 : 성별 별 합격률 시각화

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1]).reset_index().melt(id_vars='index').set_axis(['department','gender','result','count'],axis=1)
df  ## 파일에 index_column이 존재하고, 첫 행이 열이름인듯.

	department	gender	result	count
0	A	male	fail	314
1	B	male	fail	208
2	C	male	fail	204
3	D	male	fail	279
4	E	male	fail	137
5	F	male	fail	149
6	A	male	pass	511
7	B	male	pass	352
8	C	male	pass	121
9	D	male	pass	138
10	E	male	pass	54
11	F	male	pass	224
12	A	female	fail	19
13	B	female	fail	7
14	C	female	fail	391
15	D	female	fail	244
16	E	female	fail	299
17	F	female	fail	103
18	A	female	pass	89
19	B	female	pass	18
20	C	female	pass	202
21	D	female	pass	131
22	E	female	pass	94
23	F	female	pass	238

어디서 많이 봤던 데이터, 집단 간 비교니까 바차트가 좋겠죠잉

df.pivot_table(index = 'gender', columns = 'result', values = 'count', aggfunc = 'sum')\
.assign(rate = lambda _df : _df['pass']/(_df.fail + _df['pass'])).reset_index()\
.assign(rate = lambda _df : _df.rate.apply(lambda x : round(x, 3)))\
.plot.bar(x = 'gender', y = 'rate', color = 'gender', text = 'rate', width = 600)

# 예제 2 : (성별, 학과) 별 지원자 수 시각화

df.pivot_table(index = ['gender', 'department'], columns = 'result', values = 'count', aggfunc = 'sum')\
.assign(rate = lambda _df : _df['pass']/(_df.fail + _df['pass'])).reset_index()\
.assign(rate = lambda _df : _df.rate.apply(lambda x : round(x, 2)))\
.plot.bar(x = 'gender', y = 'rate', color = 'gender', facet_col = 'department', text = 'rate', width = 800)

B. `.plot.line()`

# 예제 1 : 핸드폰 판매량

df = pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df

	Date	Samsung	Apple	Huawei	Xiaomi	Oppo	Mobicel	Motorola	LG	Others	Realme	Google	Nokia	Lenovo	OnePlus	Sony	Asus
0	2019-10	461	324	136	109	76	81	43	37	135	28	39	14	22	17	20	17
1	2019-11	461	358	167	141	86	61	29	36	141	27	29	20	23	10	19	27
2	2019-12	426	383	143	105	53	45	51	48	129	30	20	26	28	18	18	19
3	2020-01	677	494	212	187	110	79	65	49	158	23	13	19	19	22	27	22
4	2020-02	593	520	217	195	112	67	62	71	157	25	18	16	24	18	23	20
5	2020-03	637	537	246	187	92	66	59	67	145	21	16	24	18	31	22	14
6	2020-04	647	583	222	154	98	59	48	64	113	20	23	25	19	19	23	21
7	2020-05	629	518	192	176	91	87	50	66	150	43	27	15	18	19	19	13
8	2020-06	663	552	209	185	93	69	54	60	140	39	16	16	17	29	25	16
9	2020-07	599	471	214	193	89	78	65	59	130	40	27	25	21	18	18	12
10	2020-08	615	567	204	182	105	82	62	42	129	47	16	23	21	27	23	20
11	2020-09	621	481	230	220	102	88	56	49	143	54	14	15	17	15	19	15
12	2020-10	637	555	232	203	90	52	63	49	140	33	17	20	22	9	22	21

시계열 자료니까 라인차트로 그리면 좋겠다잉

df.melt(id_vars = ['Date']).set_axis(['날짜', '회사', '판매량'], axis = 1)\
.plot.line(x = '날짜', y = '판매량', color = '회사')  ## 어떻게 날짜로 잘 읽었네.

### C. `.plot.scatter()`

position_dict = {
    'GOALKEEPER':{'GK'},
    'DEFENDER':{'CB','RCB','LCB','RB','LB','RWB','LWB'},
    'MIDFIELDER':{'CM','RCM','LCM','CDM','RDM','LDM','CAM','RAM','LAM','RM','LM'},
    'FORWARD':{'ST','CF','RF','LF','RW','LW','RS','LS'},
    'SUB':{'SUB'},
    'RES':{'RES'}
}
df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2021/master/_notebooks/2021-10-25-FIFA22_official_data.csv')\
.loc[:,lambda df: df.isna().mean()<0.5].dropna()\
.assign(Position = lambda df: df.Position.str.split(">").str[-1].apply(lambda x: [k for k,v in position_dict.items() if x in v].pop()))\
.assign(Wage = lambda df: df.Wage.str[1:].str.replace('K','000').astype(int))
df

	ID	Name	Age	Photo	Nationality	Flag	Overall	Potential	Club	Club Logo	...	SlidingTackle	GKDiving	GKHandling	GKKicking	GKPositioning	GKReflexes	Best Position	Best Overall Rating	Release Clause	DefensiveAwareness
0	212198	Bruno Fernandes	26	https://cdn.sofifa.com/players/212/198/22_60.png	Portugal	https://cdn.sofifa.com/flags/pt.png	88	89	Manchester United	https://cdn.sofifa.com/teams/11/30.png	...	65.0	12.0	14.0	15.0	8.0	14.0	CAM	88.0	€206.9M	72.0
1	209658	L. Goretzka	26	https://cdn.sofifa.com/players/209/658/22_60.png	Germany	https://cdn.sofifa.com/flags/de.png	87	88	FC Bayern München	https://cdn.sofifa.com/teams/21/30.png	...	77.0	13.0	8.0	15.0	11.0	9.0	CM	87.0	€160.4M	74.0
2	176580	L. Suárez	34	https://cdn.sofifa.com/players/176/580/22_60.png	Uruguay	https://cdn.sofifa.com/flags/uy.png	88	88	Atlético de Madrid	https://cdn.sofifa.com/teams/240/30.png	...	38.0	27.0	25.0	31.0	33.0	37.0	ST	88.0	€91.2M	42.0
3	192985	K. De Bruyne	30	https://cdn.sofifa.com/players/192/985/22_60.png	Belgium	https://cdn.sofifa.com/flags/be.png	91	91	Manchester City	https://cdn.sofifa.com/teams/10/30.png	...	53.0	15.0	13.0	5.0	10.0	13.0	CM	91.0	€232.2M	68.0
4	224334	M. Acuña	29	https://cdn.sofifa.com/players/224/334/22_60.png	Argentina	https://cdn.sofifa.com/flags/ar.png	84	84	Sevilla FC	https://cdn.sofifa.com/teams/481/30.png	...	82.0	8.0	14.0	13.0	13.0	14.0	LB	84.0	€77.7M	80.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16703	259718	F. Gebhardt	19	https://cdn.sofifa.com/players/259/718/22_60.png	Germany	https://cdn.sofifa.com/flags/de.png	52	66	FC Basel 1893	https://cdn.sofifa.com/teams/896/30.png	...	10.0	53.0	45.0	47.0	52.0	57.0	GK	52.0	€361K	6.0
16704	251433	B. Voll	20	https://cdn.sofifa.com/players/251/433/22_60.png	Germany	https://cdn.sofifa.com/flags/de.png	58	69	F.C. Hansa Rostock	https://cdn.sofifa.com/teams/27/30.png	...	10.0	59.0	60.0	56.0	55.0	61.0	GK	58.0	€656K	5.0
16706	262846	�. Dobre	20	https://cdn.sofifa.com/players/262/846/22_60.png	Romania	https://cdn.sofifa.com/flags/ro.png	53	63	FC Academica Clinceni	https://cdn.sofifa.com/teams/113391/30.png	...	12.0	57.0	52.0	53.0	48.0	58.0	GK	53.0	€279K	5.0
16707	241317	21 Xue Qinghao	19	https://cdn.sofifa.com/players/241/317/21_60.png	China PR	https://cdn.sofifa.com/flags/cn.png	47	60	Shanghai Shenhua FC	https://cdn.sofifa.com/teams/110955/30.png	...	9.0	49.0	48.0	45.0	38.0	52.0	GK	47.0	€223K	21.0
16708	259646	A. Shaikh	18	https://cdn.sofifa.com/players/259/646/22_60.png	India	https://cdn.sofifa.com/flags/in.png	47	67	ATK Mohun Bagan FC	https://cdn.sofifa.com/teams/113146/30.png	...	13.0	49.0	41.0	39.0	45.0	49.0	GK	47.0	€259K	7.0

14398 rows × 63 columns

가슴아픈 데이터, 연속형 변수 간 관계를 보고 싶으니 산점도가 적당

df.query('Position == "DEFENDER" or Position == "FORWARD"')\
.plot.scatter(x = 'ShotPower', y = 'StandingTackle',
             color = 'Position', size = 'Wage', hover_data = ['Name', 'Age'],
             opacity = 0.5, width = 800)  ## hover_data로 마우스를 가져다 대었을 때 표기되는 정보를 알 수 있으며, alpha대신 opacity 옵션으로 투명도 조절

D. `.plot.box()`

# 예제 1 : 렛츠고! 전북고등학교!

y1=[75,75,76,76,77,77,78,79,79,98] # A선생님에게 통계학을 배운 학생의 점수들
y2=[76,76,77,77,78,78,79,80,80,81] # B선생님에게 통계학을 배운 학생의 점수들

df = pd.DataFrame({
    'Class' : ['A']*len(y1) + ['B']*len(y2),
    'Score' : y1+y2
})
df

	Class	Score
0	A	75
1	A	75
2	A	76
3	A	76
4	A	77
5	A	77
6	A	78
7	A	79
8	A	79
9	A	98
10	B	76
11	B	76
12	B	77
13	B	77
14	B	78
15	B	78
16	B	79
17	B	80
18	B	80
19	B	81

df.plot.box(x = 'Class', y = 'Score', color = 'Class',
            points = 'all', width = 500, backend = 'plotly')  ## points를 'outlier'로 설정하면

# 예제 2 : (년도, 시도) 별 전기 에너지 사용량

url = 'https://raw.githubusercontent.com/guebin/DV2022/main/posts/Energy/{}.csv'
prov = ['Seoul', 'Busan', 'Daegu', 'Incheon',
        'Gwangju', 'Daejeon', 'Ulsan', 'Sejongsi',
        'Gyeonggi-do', 'Gangwon-do', 'Chungcheongbuk-do',
        'Chungcheongnam-do', 'Jeollabuk-do', 'Jeollanam-do',
        'Gyeongsangbuk-do', 'Gyeongsangnam-do', 'Jeju-do']
df = pd.concat([pd.read_csv(url.format(p+y)).assign(년도=y, 시도=p) for p in prov for y in ['2018', '2019', '2020', '2021']]).reset_index(drop=True)\
.assign(년도 = lambda df: df.년도.astype(int))\
.set_index(['년도','시도','지역']).applymap(lambda x: int(str(x).replace(',','')))\
.reset_index()
df.head()

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_12860\2652719362.py:9: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

	년도	시도	지역	건물동수	연면적	에너지사용량(TOE)/전기	에너지사용량(TOE)/도시가스	에너지사용량(TOE)/지역난방
0	2018	Seoul	종로구	17929	9141777	64818	82015	111
1	2018	Seoul	중구	10598	10056233	81672	75260	563
2	2018	Seoul	용산구	17201	10639652	52659	85220	12043
3	2018	Seoul	성동구	14180	11631770	60559	107416	0
4	2018	Seoul	광진구	21520	12054796	70609	130308	0

df.plot.box(x = '시도', y = '에너지사용량(TOE)/전기', color = '시도', facet_row = '년도', width = 800, height = 1000, hover_data = ['지역', '연면적'], points = 'outliers')  ## outliers가 디폴트임

### E. `.plot.hist()`

# 예제 1 : 타이타닉 (연령, 성별) 별 생존자

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	logFare
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	1.981001
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	4.266662
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	2.070022
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	3.972177
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	2.085672
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	2.564949
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	3.401197
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	3.154870
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	3.401197
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	2.047693

891 rows × 13 columns

df.Age.hist()

시리즈를 통해 아무 입력 없이 히스토그램을 그릴 수도 있다.

df.plot.hist(x = 'Age', color = 'Sex',
            facet_row = 'Sex', facet_col = 'Survived')

성별 간 효과가 저연령에서는 옅어지는 것이 보인다.(생존률로 시각화해도 좋을듯)

df.plot.hist(x = 'Age', color = 'Survived', facet_col = 'Sex')

df.loc[:, ['Age', 'Sex', 'Survived']].assign(Age_cut = lambda _df : pd.qcut(_df.Age, q = 10))\
.pivot_table(index = 'Age_cut', columns = 'Sex', values = 'Survived', aggfunc = 'mean')\
.stack().reset_index().rename({0 : 'Rate'}, axis = 1)\
.plot.bar(x = 'Sex', y = 'Rate', color = 'Sex', facet_col = 'Age_cut', width = 800)

C:\Users\hollyriver\anaconda3\envs\py\lib\site-packages\plotly\express\_core.py:2044: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

세부조정은 알아서 하시길…

F. `.plot.area()`

# 예제 1 : 핸드폰 판매량

df = pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df

	Date	Samsung	Apple	Huawei	Xiaomi	Oppo	Mobicel	Motorola	LG	Others	Realme	Google	Nokia	Lenovo	OnePlus	Sony	Asus
0	2019-10	461	324	136	109	76	81	43	37	135	28	39	14	22	17	20	17
1	2019-11	461	358	167	141	86	61	29	36	141	27	29	20	23	10	19	27
2	2019-12	426	383	143	105	53	45	51	48	129	30	20	26	28	18	18	19
3	2020-01	677	494	212	187	110	79	65	49	158	23	13	19	19	22	27	22
4	2020-02	593	520	217	195	112	67	62	71	157	25	18	16	24	18	23	20
5	2020-03	637	537	246	187	92	66	59	67	145	21	16	24	18	31	22	14
6	2020-04	647	583	222	154	98	59	48	64	113	20	23	25	19	19	23	21
7	2020-05	629	518	192	176	91	87	50	66	150	43	27	15	18	19	19	13
8	2020-06	663	552	209	185	93	69	54	60	140	39	16	16	17	29	25	16
9	2020-07	599	471	214	193	89	78	65	59	130	40	27	25	21	18	18	12
10	2020-08	615	567	204	182	105	82	62	42	129	47	16	23	21	27	23	20
11	2020-09	621	481	230	220	102	88	56	49	143	54	14	15	17	15	19	15
12	2020-10	637	555	232	203	90	52	63	49	140	33	17	20	22	9	22	21

df.melt(id_vars = 'Date').set_axis(['날짜', '회사', '판매량'], axis = 1)\
.plot.area(x = '날짜', y = '판매량', color = '회사')

전체적인 판매량과 개별 판매량의 정도를 알 수 있음. (두 플랏을 한 번에!)

# 예제 2 : 에너지 사용량

url = 'https://raw.githubusercontent.com/guebin/DV2022/main/posts/Energy/{}.csv'
prov = ['Seoul', 'Busan', 'Daegu', 'Incheon',
        'Gwangju', 'Daejeon', 'Ulsan', 'Sejongsi',
        'Gyeonggi-do', 'Gangwon-do', 'Chungcheongbuk-do',
        'Chungcheongnam-do', 'Jeollabuk-do', 'Jeollanam-do',
        'Gyeongsangbuk-do', 'Gyeongsangnam-do', 'Jeju-do']
df = pd.concat([pd.read_csv(url.format(p+y)).assign(년도=y, 시도=p) for p in prov for y in ['2018', '2019', '2020', '2021']]).reset_index(drop=True)\
.assign(년도 = lambda df: df.년도.astype(int))\
.set_index(['년도','시도','지역']).applymap(lambda x: int(str(x).replace(',','')))\
.reset_index()
df.head()

C:\Users\hollyriver\AppData\Local\Temp\ipykernel_12860\2652719362.py:9: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

	년도	시도	지역	건물동수	연면적	에너지사용량(TOE)/전기	에너지사용량(TOE)/도시가스	에너지사용량(TOE)/지역난방
0	2018	Seoul	종로구	17929	9141777	64818	82015	111
1	2018	Seoul	중구	10598	10056233	81672	75260	563
2	2018	Seoul	용산구	17201	10639652	52659	85220	12043
3	2018	Seoul	성동구	14180	11631770	60559	107416	0
4	2018	Seoul	광진구	21520	12054796	70609	130308	0

df.set_index(['년도','시도','지역','건물동수','연면적']).stack().reset_index().rename({'level_5' : '에너지종류', 0 : '에너지사용량'}, axis = 1)\
.assign(에너지종류 = lambda _df : _df.에너지종류.str.split('/').apply(lambda x : x[-1]))\
.pivot_table(index = ['년도', '시도', '에너지종류'], values = '에너지사용량', aggfunc = 'sum')\
.reset_index().plot.area(x = '년도', y = '에너지사용량', facet_col = '에너지종류', color = '시도', width = 600)

## figure로 저장
fig = df.set_index(['년도','시도','지역','건물동수','연면적']).stack().reset_index().rename({'level_5' : '에너지종류', 0 : '에너지사용량'}, axis = 1)\
.assign(에너지종류 = lambda _df : _df.에너지종류.str.split('/').str[-1])\
.pivot_table(index = ['에너지종류', '시도', '년도'], values = '에너지사용량', aggfunc = 'sum').reset_index()\
.plot.area(x = '년도', y = '에너지사용량', facet_col = '에너지종류', color = '시도', width = 600)

## 아마도 xaxis의 범위를 한정하는 과정에서 xlabel의 범위가 조정되다보니 이리 되는듯.
fig.update_layout(
    xaxis_domain = [0.0, 0.25],
    xaxis2_domain = [0.35, 0.60],
    xaxis3_domain = [0.70, 0.95]
)

겹칠땐 이렇게 하면 됩니다… 근데 좁아진 걸 보니 다르게 개선할 수도 있을 것 같기는 함…

해당 포스트는 전북대학교 통계학과 최규빈 교수님의 강의내용을 토대로 재구성되었음을 알립니다.

1. 라이브러리 imports

2. yfinance를 이용한 주식 자료 시각화

A. 크롤링 + 데이터 정리

### B. 시각화

3. 출산률 시각화

A. 크롤링 + 데이터 정리

ref : https://ko.wikipedia.org/wiki/대한민국의_저출산

### B. 시각화 1 : 전국 출생아 수의 시각화

시각화 2 : 시도별 출생아 수 시각화(line)

### D. 시각화 3 : 시도별 출생아 수 시각화(area)

E. 위에서의 시각화 수정

4. 여러가지 플랏

### A. .plot.bar()

B. .plot.line()

### C. .plot.scatter()

D. .plot.box()

### E. .plot.hist()

F. .plot.area()

2. `yfinance`를 이용한 주식 자료 시각화

### A. `.plot.bar()`

B. `.plot.line()`

### C. `.plot.scatter()`

D. `.plot.box()`

### E. `.plot.hist()`

F. `.plot.area()`