From 7573bd6b897e2024099832a732a265195fecc758 Mon Sep 17 00:00:00 2001 From: Yeseo Jang Date: Mon, 28 Feb 2022 21:13:04 +0900 Subject: [PATCH 1/3] updata curriculum --- README.md | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7664941..0fa0641 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,29 @@ -# 2022-1-Euron-Study-Assignments -Euron 2기 스터디팀 예습·복습 과제 제출 +# Euron 2기 예습·복습 과제 제출 + +### ▶ [Computer Vision](https://github.com/Ewha-Euron/2022-1-Euron-CV) +### ▶ [Natural Language Processing](https://github.com/Ewha-Euron/2022-1-Euron-NLP) +### ▶ [Data Analysis](https://github.com/Ewha-Euron/2022-1-Euron-DA) + +## Curriculum + +| 주차 | 날짜 | CV | NLP | DA | +|---|---|---|---|---| +|1|22/03/08|cs231n 1주차|cs224n 1강|파이썬 머신러닝 완벽가이드 1~3장| +|2|22/03/15|cs231n 2주차|cs224n 2강|파이썬 머신러닝 완벽가이드 4장(1)| +|3|22/03/22|cs231n 3주차|cs224n 3강|파이썬 머신러닝 완벽가이드 4장(2)| +|4|22/03/29|cs231n 4주차|cs224n 4강|4장 관련 필사| +|5|22/04/05|cs231n 5주차|cs224n 5강|파이썬 머신러닝 완벽가이드 5장| +|6|22/04/12|cs231n 6주차|cs224n 6강|5장 관련 필사| +|7|22/04/19|cs231n 7주차|cs224n 7강|파이썬 머신러닝 완벽가이드 6장| +|8|22/05/03|cs231n 8주차|cs224n 8강|6장 관련 필사| +|9|22/05/10|cs231n 9주차|cs224n 9강|파이썬 머신러닝 완벽가이드 7장| +|10|22/05/17|cs231n 10주차|cs224n 10강|7장 관련 필사| +|11|22/05/24|cs231n 11주차|cs224n 11강|파이썬 머신러닝 완벽가이드 9장| +|12|22/05/31|cs231n 12주차|cs224n 12강|9장 관련 필사| +|13|22/06/07|cs231n 13주차|cs224n 13강|캐글 필사 1| +|14|22/06/21|cs231n 14주차|cs224n 14강|캐글 필사 2| +|15|22/06/28|논문 스터디 1|cs224n 15강|캐글 필사 2| +|16|22/07/05|논문 스터디 2|cs224n 18강|| +|17|22/07/12|논문 스터디 3|cs224n 20강|| +|18|22/07/19||cs224n 21강|| +|19|22/07/26||cs224n 22강|| From 18ac8eb26c0965bf93658cac522f3d099a8890e9 Mon Sep 17 00:00:00 2001 From: kimsook <40443049+kimsook@users.noreply.github.com> Date: Thu, 7 Jul 2022 23:43:21 +0900 Subject: [PATCH 2/3] =?UTF-8?q?18=EC=A3=BC=EC=B0=A8=20=EC=98=88=EC=8A=B5?= =?UTF-8?q?=EA=B3=BC=EC=A0=9C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...354\212\265\352\263\274\354\240\234.ipynb" | 3222 +++++++++++++++++ 1 file changed, 3222 insertions(+) create mode 100644 "week18_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" diff --git "a/week18_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" "b/week18_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" new file mode 100644 index 0000000..28d1a67 --- /dev/null +++ "b/week18_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" @@ -0,0 +1,3222 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "D4TDEvhv7_od" + }, + "source": [ + "# Autoencoder based Anomaly Detection\n", + "https://dacon.io/competitions/official/235757/codeshare/4641?page=1&dtype=recent\n", + "\n", + "본 코드는 오토인코더 기반의 이상탐지 모델을 활용하였습니다.\n", + "\n", + "Conv1D-LSTM 기반의 Autoencoder 모델을 구현하였으며 구현 시 라이브러리는 tensorflow의 keras를 이용했습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9vGhX84C7_og" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn import metrics\n", + "\n", + "from tqdm.notebook import trange\n", + "from TaPR_pkg import etapr\n", + "from pathlib import Path\n", + "import time\n", + "\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "from tensorflow.keras.models import load_model\n", + "from tensorflow.keras.callbacks import EarlyStopping" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ywtp-GXC7_oi" + }, + "source": [ + "위의 라이브러리를 활용하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "dATg00ye7_oi", + "outputId": "c31eaaf3-552d-487f-f365-96349892c4de" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPython 3.6.11\n", + "IPython 7.16.1\n", + "\n", + "numpy 1.18.5\n", + "matplotlib 3.3.1\n", + "pandas 1.1.2\n", + "sklearn 0.0\n", + "tqdm 4.48.2\n", + "TaPR_pkg unknown\n", + "pathlib 1.0.1\n", + "tensorflow 2.3.0\n" + ] + } + ], + "source": [ + "%reload_ext watermark\n", + "%watermark -v -p numpy,matplotlib,pandas,sklearn,tqdm,TaPR_pkg,pathlib,tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bPB_oE0f7_oj" + }, + "source": [ + "위와 같이 tensorflow 버전은 2.0 이상의 버전을 활용하였습니다.\n", + "\n", + "TaPR 패키지는 baseline을 참고해주시면 됩니다.\n", + "\n", + "본 코드에서는 \"eTaPR-1.12-py3-none-any.whl\" 파일을 직접 설치하였습니다.\n", + "\n", + "\"eTaPR-1.12-py3-none-any.whl\"이 존재하는 경로에서\n", + "python -m pip install \"eTaPR-1.12-py3-none-any.whl\"와 같은 명령어를 실행하시면 쉽게 설치할 수 잇습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-EnKiv7c7_ok" + }, + "source": [ + "## 데이터 전처리\n", + "\n", + "학습 데이터와 테스트 데이터는 CSV로 제공됩니다.\n", + "HAI 2.0은 단일 파일이 아니라 여러 파일로 제공되기 때문에 디렉토리 안에 있는 모든 CSV를 읽습니다.\n", + "\n", + "데이터 전처리 및 데이터를 불러오는 방법의 대부분은 baseline의 코드를 참고하였습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZF1az35n7_ok" + }, + "source": [ + "# 1) 데이터 불러오기" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "go247fGg7_ol" + }, + "outputs": [], + "source": [ + "TRAIN_DATASET = sorted([x for x in Path(\"D:\\\\data\\\\HAI 2.0\\\\training\\\\\").glob(\"*.csv\")])\n", + "TRAIN_DATASET" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "isWg5xJD7_om" + }, + "outputs": [], + "source": [ + "TEST_DATASET = sorted([x for x in Path(\"D:\\\\data\\\\HAI 2.0\\\\testing\").glob(\"*.csv\")])\n", + "TEST_DATASET" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AfgFH9Iu7_on" + }, + "outputs": [], + "source": [ + "VALIDATION_DATASET = sorted([x for x in Path(\"D:\\\\data\\\\HAI 2.0\\\\validation\").glob(\"*.csv\")])\n", + "VALIDATION_DATASET" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "O9bmFhuB7_op" + }, + "outputs": [], + "source": [ + "def dataframe_from_csv(target):\n", + " return pd.read_csv(target, engine='python').rename(columns=lambda x: x.strip())\n", + "\n", + "def dataframe_from_csvs(targets):\n", + " return pd.concat([dataframe_from_csv(x) for x in targets])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": false, + "id": "-_VLDr4E7_oq", + "outputId": "8dde8777-2266-4080-aabf-514d59e43ba7" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timeC01C02C03C04C05C06C07C08C09...C70C71C72C73C74C75C76C77C78C79
02020-07-11 00:00:00395.19528121052.80456-1.2648-1.87531779.5959528.0264510832.0...808.296200.01.368108.7988235.4370012.01782305.03113301.3599233.65556.0951
12020-07-11 00:00:01395.14420121052.78931-1.3147-1.88294780.6732828.0247310984.0...819.168090.01.368108.7881135.4522712.01782304.27161297.4356733.65555.9262
22020-07-11 00:00:02395.14420121052.79694-1.4032-1.88294780.0657428.0281711120.0...823.516970.01.367348.8178735.4522712.01782303.89179298.6653433.65555.8101
32020-07-11 00:00:03395.19528121052.79694-1.6074-1.88294780.1526528.0230111256.0...823.951720.01.367348.8749335.4370012.01782303.67474298.0686033.65555.7509
42020-07-11 00:00:04395.34866121052.79694-1.7811-1.88294781.8316028.0359511384.0...827.865600.01.368108.8383835.4522712.01782303.22266296.5313733.65555.8547
..................................................................
4787962020-08-10 10:59:56387.27219121066.72057-0.9331-1.84479781.8791528.02389880.0...944.847050.01.3284315.1781735.1471011.79657316.89453296.5495032.00006.6026
4787972020-08-10 10:59:57387.52774121066.72057-0.9996-1.84479787.6507028.02385840.0...940.498350.01.3284315.1734435.1318311.79657315.59247296.1516132.00006.3894
4787982020-08-10 10:59:58387.47665121066.72057-1.2560-1.84479788.5025628.03085792.0...935.714720.01.3291915.1644335.1318311.79657313.92865293.4027732.00006.2584
4787992020-08-10 10:59:59387.73221121066.72057-1.4912-1.84479785.8031628.02649752.0...944.847050.01.3284315.0900135.1471011.79657315.61054302.5897232.00006.4150
4788002020-08-10 11:00:00387.52774121066.72057-1.5727-1.84479780.2138128.02476720.0...951.805050.01.3291915.0867235.1471011.79657317.23816309.0096432.00006.6288
\n", + "

921603 rows × 80 columns

\n", + "
" + ], + "text/plain": [ + " time C01 C02 C03 C04 C05 C06 \\\n", + "0 2020-07-11 00:00:00 395.19528 12 10 52.80456 -1.2648 -1.87531 \n", + "1 2020-07-11 00:00:01 395.14420 12 10 52.78931 -1.3147 -1.88294 \n", + "2 2020-07-11 00:00:02 395.14420 12 10 52.79694 -1.4032 -1.88294 \n", + "3 2020-07-11 00:00:03 395.19528 12 10 52.79694 -1.6074 -1.88294 \n", + "4 2020-07-11 00:00:04 395.34866 12 10 52.79694 -1.7811 -1.88294 \n", + "... ... ... ... ... ... ... ... \n", + "478796 2020-08-10 10:59:56 387.27219 12 10 66.72057 -0.9331 -1.84479 \n", + "478797 2020-08-10 10:59:57 387.52774 12 10 66.72057 -0.9996 -1.84479 \n", + "478798 2020-08-10 10:59:58 387.47665 12 10 66.72057 -1.2560 -1.84479 \n", + "478799 2020-08-10 10:59:59 387.73221 12 10 66.72057 -1.4912 -1.84479 \n", + "478800 2020-08-10 11:00:00 387.52774 12 10 66.72057 -1.5727 -1.84479 \n", + "\n", + " C07 C08 C09 ... C70 C71 C72 C73 \\\n", + "0 779.59595 28.02645 10832.0 ... 808.29620 0.0 1.36810 8.79882 \n", + "1 780.67328 28.02473 10984.0 ... 819.16809 0.0 1.36810 8.78811 \n", + "2 780.06574 28.02817 11120.0 ... 823.51697 0.0 1.36734 8.81787 \n", + "3 780.15265 28.02301 11256.0 ... 823.95172 0.0 1.36734 8.87493 \n", + "4 781.83160 28.03595 11384.0 ... 827.86560 0.0 1.36810 8.83838 \n", + "... ... ... ... ... ... ... ... ... \n", + "478796 781.87915 28.02389 880.0 ... 944.84705 0.0 1.32843 15.17817 \n", + "478797 787.65070 28.02385 840.0 ... 940.49835 0.0 1.32843 15.17344 \n", + "478798 788.50256 28.03085 792.0 ... 935.71472 0.0 1.32919 15.16443 \n", + "478799 785.80316 28.02649 752.0 ... 944.84705 0.0 1.32843 15.09001 \n", + "478800 780.21381 28.02476 720.0 ... 951.80505 0.0 1.32919 15.08672 \n", + "\n", + " C74 C75 C76 C77 C78 C79 \n", + "0 35.43700 12.01782 305.03113 301.35992 33.6555 6.0951 \n", + "1 35.45227 12.01782 304.27161 297.43567 33.6555 5.9262 \n", + "2 35.45227 12.01782 303.89179 298.66534 33.6555 5.8101 \n", + "3 35.43700 12.01782 303.67474 298.06860 33.6555 5.7509 \n", + "4 35.45227 12.01782 303.22266 296.53137 33.6555 5.8547 \n", + "... ... ... ... ... ... ... \n", + "478796 35.14710 11.79657 316.89453 296.54950 32.0000 6.6026 \n", + "478797 35.13183 11.79657 315.59247 296.15161 32.0000 6.3894 \n", + "478798 35.13183 11.79657 313.92865 293.40277 32.0000 6.2584 \n", + "478799 35.14710 11.79657 315.61054 302.58972 32.0000 6.4150 \n", + "478800 35.14710 11.79657 317.23816 309.00964 32.0000 6.6288 \n", + "\n", + "[921603 rows x 80 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "TRAIN_DF_RAW = dataframe_from_csvs(TRAIN_DATASET)\n", + "TRAIN_DF_RAW" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lF5jv6VF7_os" + }, + "source": [ + "# 2) 변수 설정" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UxRPRM627_ou" + }, + "source": [ + "그 다음은 baseline과 동일하게 필드명을 활용하였습니다.\n", + "\n", + "\n", + "\n", + "학습 데이터셋은 공격을 받지 않은 평상시 데이터이고 시간을 나타내는 필드인 time이 있으며, 나머지 필드는 모두 비식별화된 센서/액추에이터의 값입니다. 정규화는 센서/액추에이터 필드만을 대상으로 해야 합니다.\n", + "\n", + "본 문서에서는 전체 데이터를 대상으로 이상을 탐지하므로 \"attack\" 필드만 사용하였습니다.\n", + "\n", + "VALID_COLUMNS_IN_TRAIN_DATASET은 학습 데이터셋에 있는 모든 센서/액추에이터 필드를 담고 있습니다. 가끔 학습 데이터셋에 존재하지 않는 필드가 테스트 데이터셋에 존재하는 경우가 있습니다. 학습 시 보지 못했던 필드에 대해서 테스트를 할 수 없으므로 학습 데이터셋을 기준으로 필드 이름을 얻어냈습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "LgsKguip7_ou", + "outputId": "7594ca7f-68e3-4489-a544-09e6574f4248" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['C01', 'C02', 'C03', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10',\n", + " 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20',\n", + " 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30',\n", + " 'C31', 'C32', 'C33', 'C34', 'C35', 'C36', 'C37', 'C38', 'C39', 'C40',\n", + " 'C41', 'C42', 'C43', 'C44', 'C45', 'C46', 'C47', 'C48', 'C49', 'C50',\n", + " 'C51', 'C52', 'C53', 'C54', 'C55', 'C56', 'C57', 'C58', 'C59', 'C60',\n", + " 'C61', 'C62', 'C63', 'C64', 'C65', 'C66', 'C67', 'C68', 'C69', 'C70',\n", + " 'C71', 'C72', 'C73', 'C74', 'C75', 'C76', 'C77', 'C78', 'C79'],\n", + " dtype='object')" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "TIMESTAMP_FIELD = \"time\"\n", + "IDSTAMP_FIELD = 'id'\n", + "ATTACK_FIELD = \"attack\"\n", + "VALID_COLUMNS_IN_TRAIN_DATASET = TRAIN_DF_RAW.columns.drop([TIMESTAMP_FIELD])\n", + "VALID_COLUMNS_IN_TRAIN_DATASET" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rDO22kXQ7_ov" + }, + "source": [ + "# 3) 데이터 정규화" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E36pbLyB7_ow" + }, + "source": [ + "본 연구에서도 normalize 함수를 통해 데이터를 정규화 합니다.\n", + "정규화 방법은 min_max 정규화로 최댓값과 최솟값을 이용하여 0~1의 범위에 들어오도록 하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z7R_JKS87_ow" + }, + "outputs": [], + "source": [ + "TAG_MIN = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].min()\n", + "TAG_MAX = TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wdI3MHAx7_ox" + }, + "outputs": [], + "source": [ + "def normalize(df):\n", + " ndf = df.copy()\n", + " for c in df.columns:\n", + " if TAG_MIN[c] == TAG_MAX[c]:\n", + " ndf[c] = df[c] - TAG_MIN[c]\n", + " else:\n", + " ndf[c] = (df[c] - TAG_MIN[c]) / (TAG_MAX[c] - TAG_MIN[c])\n", + " return ndf" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oyFm15Rz7_oy" + }, + "source": [ + "먼저 train data set을 정규화 하고 boundary_check 함수를 통해 정규화가 잘 되었는지 점검합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OBSa9E2n7_oy" + }, + "outputs": [], + "source": [ + "TRAIN_DF = normalize(TRAIN_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YqFEfHFH7_oy" + }, + "outputs": [], + "source": [ + "def boundary_check(df):\n", + " x = np.array(df, dtype=np.float32)\n", + " print(x)\n", + " return np.any(x > 1.0), np.any(x < 0), np.any(np.isnan(x))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": false, + "id": "3ZBe9-I67_oz", + "outputId": "8235eade-6552-455f-ca88-9e4440cc4932" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[0.37895253 0. 0. ... 0.2650172 1. 0.5672542 ]\n", + " [0.37845883 0. 0. ... 0.2504694 1. 0.5066231 ]\n", + " [0.37845883 0. 0. ... 0.25502798 1. 0.46494597]\n", + " ...\n", + " [0.30434805 0. 0. ... 0.23551878 0.26161984 0.625875 ]\n", + " [0.3068182 0. 0. ... 0.26957628 0.26161984 0.6820907 ]\n", + " [0.30484188 0. 0. ... 0.29337597 0.26161984 0.7588398 ]]\n" + ] + }, + { + "data": { + "text/plain": [ + "(False, False, False)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "boundary_check(TRAIN_DF)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P8ou_fGT7_oz" + }, + "source": [ + "# 학습 모델 설정\n", + "\n", + "모델 구현 라이브러리는 keras를 사용하였습니다.\n", + "\n", + "본 챌린지의 핵심은 정상 상황의 데이터만을 학습하여 공격 및 비정상 상황을 탐지하는 것입니다.\n", + "\n", + "Autoencoder의 경우 보통 이미지의 생성이나 복원에 많이 사용되며, 정상적인 이미지로 모델 학습 후 비정상적인 이미지를 넣어 이를 디코딩 하게 되면 정상 이미지 특성과 디코딩 된 이미지 간의 차이인 재구성 손실(Reconstruction Error)를 계산하게 됩니다. 이 재구성 손실이 낮은 부분은 정상(normal), 재구성 손실이 높은 부분은 이상(Abnormal)로 판단할 수 있습니다.\n", + "\n", + "본 연구에서는 이러한 Anomaly Detection 방법을 이미지가 아닌 시계열 데이터에 적용하였습니다.\n", + "\n", + "Autoencoder의 레이어를 LSTM으로 구성하여 시퀸스 학습이 가능하게 하였습니다.\n", + "또한, !D-Convolution layer를 적용하여 timestamp와 feature 정보를 세밀하게 이동하면서 학습이 진행되도록 하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CgoXMTJe7_o0" + }, + "outputs": [], + "source": [ + "def temporalize(X, y, timesteps):\n", + " output_X = []\n", + " output_y = []\n", + " for i in range(len(X) - timesteps - 1):\n", + " t = []\n", + " for j in range(1, timesteps + 1):\n", + " t.append(X[[(i + j + 1)], :])\n", + " output_X.append(t)\n", + " output_y.append(y[i + timesteps + 1])\n", + " return np.squeeze(np.array(output_X)), np.array(output_y)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "31v922Pg7_o0" + }, + "source": [ + "위의 함수를 통해서 데이터 셋 자체를 timestamp로 나누어서 학습을 진행할 수 있지만,\n", + "Conv1D 레이어를 활용하였기 때문에 timestamp는 1로 두고 3 차원의 shpae 형태로 만들어 주었습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ztVrNPOs7_o0", + "outputId": "4a266c56-c3dd-4df8-95dd-e9f147262705" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(921603, 1, 79)" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train = np.array(TRAIN_DF)\n", + "x_train = train.reshape(train.shape[0], 1, train.shape[1])\n", + "x_train.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T4anF-bV7_o1" + }, + "source": [ + "# 학습 모델의 구조\n", + "\n", + "파라미터 설명\n", + "\n", + "Conv1D\n", + "- filters : 컨볼루션 연산의 output 출력 수\n", + "- kernel_size : timestamp를 얼마만큼 볼 것인가(=window_size)\n", + "- padding : 한 쪽 방향으로 얼마만큼 padding할 것인가\n", + "- dilation: kernel 내부에서 얼마만큼의 간격으로 kernel을 적용할 것인가\n", + "- stride: default = 1, 컨볼루션 레이어의 이동크기\n", + "\n", + "LSTM\n", + "- unit: 출력 차원층만 설정" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YDhbDsQu7_o1" + }, + "source": [ + "모델의 구조는 Conv1D - Dense층 - LSTM - Dense층으로 encoder 와 decoder가 대칭이 되도록 설계하였습니다.\n", + "파라미터는 주로 filters, kernel_size, Dense, LSTM의 units 값을 조절하면서 실험을 진행하였습니다.\n", + "파라미터 값을 수정하면서 많은 실험을 진행하였지만 다음과 같은 모델의 결과가 가장 좋았습니다.\n", + "\n", + "추가적으로 Conv1D 레이어를 추가하거나 maxpooling과 같이 기존의 CNN 모델과 동일한 방식을 적용할 수 있습니다.\n", + "\n", + "제가 실험을 할 때는 pooling을 적용하지 않는 것이 결과가 좋았지만 모델을 테스트 해보실 분들은 다양하게 \n", + "레이어와 파라미터 값을 조절 하면서 해보시면 좋을 것 같습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uBcEfGAV7_o2" + }, + "outputs": [], + "source": [ + "def conv_auto_model(x):\n", + " n_steps = x.shape[1]\n", + " n_features = x.shape[2]\n", + "\n", + " keras.backend.clear_session()\n", + "\n", + " model = keras.Sequential(\n", + " [\n", + " layers.Input(shape=(n_steps, n_features)),\n", + " layers.Conv1D(filters=512, kernel_size=64, padding='same', data_format='channels_last',\n", + " dilation_rate=1, activation=\"linear\"),\n", + " layers.Dense(128),\n", + " layers.LSTM(\n", + " units=64, activation=\"relu\", name=\"lstm_1\", return_sequences=False\n", + " ),\n", + " layers.Dense(64),\n", + " layers.RepeatVector(n_steps),\n", + " layers.Dense(64),\n", + " layers.LSTM(\n", + " units=64, activation=\"relu\", name=\"lstm_2\", return_sequences=True\n", + " ),\n", + " layers.Dense(128),\n", + " layers.Conv1D(filters=512, kernel_size=64, padding='same', data_format='channels_last',\n", + " dilation_rate=1, activation=\"linear\"),\n", + " layers.TimeDistributed(layers.Dense(x.shape[2], activation='linear'))\n", + " ]\n", + " )\n", + " return model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e7c19iQx7_o3" + }, + "source": [ + "# 모델 구조 확인" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ab15qFLg7_o3", + "outputId": "a00193d2-96aa-4e73-a592-53dbd6ecacd9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"sequential\"\n", + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "conv1d (Conv1D) (None, 1, 512) 2589184 \n", + "_________________________________________________________________\n", + "dense (Dense) (None, 1, 128) 65664 \n", + "_________________________________________________________________\n", + "lstm_1 (LSTM) (None, 64) 49408 \n", + "_________________________________________________________________\n", + "dense_1 (Dense) (None, 64) 4160 \n", + "_________________________________________________________________\n", + "repeat_vector (RepeatVector) (None, 1, 64) 0 \n", + "_________________________________________________________________\n", + "dense_2 (Dense) (None, 1, 64) 4160 \n", + "_________________________________________________________________\n", + "lstm_2 (LSTM) (None, 1, 64) 33024 \n", + "_________________________________________________________________\n", + "dense_3 (Dense) (None, 1, 128) 8320 \n", + "_________________________________________________________________\n", + "conv1d_1 (Conv1D) (None, 1, 512) 4194816 \n", + "_________________________________________________________________\n", + "time_distributed (TimeDistri (None, 1, 79) 40527 \n", + "=================================================================\n", + "Total params: 6,989,263\n", + "Trainable params: 6,989,263\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "model = conv_auto_model(x_train)\n", + "model.compile(optimizer='adam', loss='mse')\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q4gUhyhA7_o4" + }, + "source": [ + "# 모델 학습하기\n", + "epoch을 50으로 하고, earlystopping을 사용하였습니다.\n", + "제출 코드에서는 예시로 에폭을 3회만 실시 하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vhN7gMAx7_o4", + "outputId": "8821d670-5b64-4331-b01b-200c5fd5360f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train on 737282 samples, validate on 184321 samples\n", + "Epoch 1/3\n", + "737282/737282 [==============================] - 401s 544us/sample - loss: 0.0020 - val_loss: 3.2852e-04\n", + "Epoch 2/3\n", + "737282/737282 [==============================] - 401s 544us/sample - loss: 1.4245e-04 - val_loss: 3.2397e-04\n", + "Epoch 3/3\n", + "737282/737282 [==============================] - 390s 529us/sample - loss: 8.5787e-05 - val_loss: 2.5324e-04\n" + ] + } + ], + "source": [ + "early_stopping = EarlyStopping(monitor='val_loss', patience=5)\n", + "\n", + "epochs = 3\n", + "batch = 64\n", + "\n", + "# fit\n", + "history = model.fit(x_train, x_train,\n", + " epochs=epochs, batch_size=batch,\n", + " validation_split=0.2, callbacks=[early_stopping]).history\n", + "\n", + "model.save('model.h5')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "HTTxx_SR7_o5", + "outputId": "df3e8d42-72dc-4b89-e21e-1de301aab6fb" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(history['loss'], label='train loss')\n", + "plt.plot(history['val_loss'], label='valid loss')\n", + "plt.legend()\n", + "plt.xlabel('Epoch'); plt.ylabel('loss')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9xP_ovAk7_o5" + }, + "source": [ + "# 학습된 모델 불러오기" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JZ-gmHGw7_o6" + }, + "source": [ + "기존의 학습된 모델 중 가장 결과가 좋았던 모델을 불러와서 결과를 확인해보도록 하겠습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "G6TeZ0l37_o6" + }, + "outputs": [], + "source": [ + "model = load_model('best_model.h5')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e_NxtYZV7_o6" + }, + "source": [ + "# 학습된 모델을 검증 데이터셋에 적용하여 이상 탐지" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "1Ppn5xIp7_o6", + "outputId": "080003ce-118b-4b5f-c16c-15d21003c6aa" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timeC01C02C03C04C05C06C07C08C09...C71C72C73C74C75C76C77C78C79attack
02020-07-07 15:00:00402.7094712.01051.95007-1.0189-1.86768789.7650828.03162688...0.01.3429310.8929034.8877012.26196380.31683386.2666632.595275.63300
12020-07-07 15:00:01402.8117412.01051.96533-1.2637-1.86768789.1314728.02301648...0.01.3421610.8051234.8877012.26196380.02747386.3028632.595275.41580
22020-07-07 15:00:02402.7606212.01051.96533-1.5398-1.86768785.8165328.02993616...0.01.3436910.8002934.8877012.26196381.52850389.7388332.595275.55320
32020-07-07 15:00:03402.8117412.01051.98822-1.6212-1.86768785.4243828.02993584...0.01.3444510.8057934.8877012.26196382.08911388.9431132.595275.78330
42020-07-07 15:00:04402.9139412.01051.90429-1.5631-1.86768782.9924928.02990552...0.01.3429310.8141534.9029512.26196383.44543389.7208232.595276.03090
..................................................................
431962020-07-08 02:59:56397.0866112.01066.58325-1.2052-1.83716786.9373828.032500...0.01.3597116.1949635.2233812.01019390.13672394.9110731.816345.29770
431972020-07-08 02:59:57397.1888712.01066.58325-0.9256-1.83716783.4498928.023040...0.01.3597116.2392735.2386412.01019390.24518397.3524831.816345.31880
431982020-07-08 02:59:58397.1377612.01066.58325-0.7843-1.83716784.8678028.028140...0.01.3581816.2067535.2386412.01019390.46222396.7014231.816345.18000
431992020-07-08 02:59:59397.3422212.01066.58325-0.7646-1.83716785.5141628.022940...0.01.3581816.1716835.2539112.01019391.78241397.7321831.816344.87630
432002020-07-08 03:00:00397.4955712.01066.58325-0.9083-1.83716786.9829728.029900...0.01.3589516.1041235.2233812.01019391.31219397.2439031.816344.57900
\n", + "

43201 rows × 81 columns

\n", + "
" + ], + "text/plain": [ + " time C01 C02 C03 C04 C05 C06 \\\n", + "0 2020-07-07 15:00:00 402.70947 12.0 10 51.95007 -1.0189 -1.86768 \n", + "1 2020-07-07 15:00:01 402.81174 12.0 10 51.96533 -1.2637 -1.86768 \n", + "2 2020-07-07 15:00:02 402.76062 12.0 10 51.96533 -1.5398 -1.86768 \n", + "3 2020-07-07 15:00:03 402.81174 12.0 10 51.98822 -1.6212 -1.86768 \n", + "4 2020-07-07 15:00:04 402.91394 12.0 10 51.90429 -1.5631 -1.86768 \n", + "... ... ... ... ... ... ... ... \n", + "43196 2020-07-08 02:59:56 397.08661 12.0 10 66.58325 -1.2052 -1.83716 \n", + "43197 2020-07-08 02:59:57 397.18887 12.0 10 66.58325 -0.9256 -1.83716 \n", + "43198 2020-07-08 02:59:58 397.13776 12.0 10 66.58325 -0.7843 -1.83716 \n", + "43199 2020-07-08 02:59:59 397.34222 12.0 10 66.58325 -0.7646 -1.83716 \n", + "43200 2020-07-08 03:00:00 397.49557 12.0 10 66.58325 -0.9083 -1.83716 \n", + "\n", + " C07 C08 C09 ... C71 C72 C73 C74 \\\n", + "0 789.76508 28.03162 688 ... 0.0 1.34293 10.89290 34.88770 \n", + "1 789.13147 28.02301 648 ... 0.0 1.34216 10.80512 34.88770 \n", + "2 785.81653 28.02993 616 ... 0.0 1.34369 10.80029 34.88770 \n", + "3 785.42438 28.02993 584 ... 0.0 1.34445 10.80579 34.88770 \n", + "4 782.99249 28.02990 552 ... 0.0 1.34293 10.81415 34.90295 \n", + "... ... ... ... ... ... ... ... ... \n", + "43196 786.93738 28.03250 0 ... 0.0 1.35971 16.19496 35.22338 \n", + "43197 783.44989 28.02304 0 ... 0.0 1.35971 16.23927 35.23864 \n", + "43198 784.86780 28.02814 0 ... 0.0 1.35818 16.20675 35.23864 \n", + "43199 785.51416 28.02294 0 ... 0.0 1.35818 16.17168 35.25391 \n", + "43200 786.98297 28.02990 0 ... 0.0 1.35895 16.10412 35.22338 \n", + "\n", + " C75 C76 C77 C78 C79 attack \n", + "0 12.26196 380.31683 386.26666 32.59527 5.6330 0 \n", + "1 12.26196 380.02747 386.30286 32.59527 5.4158 0 \n", + "2 12.26196 381.52850 389.73883 32.59527 5.5532 0 \n", + "3 12.26196 382.08911 388.94311 32.59527 5.7833 0 \n", + "4 12.26196 383.44543 389.72082 32.59527 6.0309 0 \n", + "... ... ... ... ... ... ... \n", + "43196 12.01019 390.13672 394.91107 31.81634 5.2977 0 \n", + "43197 12.01019 390.24518 397.35248 31.81634 5.3188 0 \n", + "43198 12.01019 390.46222 396.70142 31.81634 5.1800 0 \n", + "43199 12.01019 391.78241 397.73218 31.81634 4.8763 0 \n", + "43200 12.01019 391.31219 397.24390 31.81634 4.5790 0 \n", + "\n", + "[43201 rows x 81 columns]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "VALIDATION_DF_RAW = dataframe_from_csvs(VALIDATION_DATASET)\n", + "VALIDATION_DF_RAW.to_csv('VALIDATION_DF_RAW.csv')\n", + "VALIDATION_DF_RAW" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SqcW-PoT7_o7" + }, + "source": [ + "검증 데이터 셋에서는 정상 데이터를 기준으로 정규화를 진행합니다.\n", + "그리고, 최솟값과 최댓값을 넘어가는 것이 있는지 확인합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zfv2ftBx7_o7" + }, + "outputs": [], + "source": [ + "VALIDATION_DF = normalize(VALIDATION_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KLvyCZY67_o8", + "outputId": "07089e22-3ab7-4c56-b7a6-86c4e5cb86a5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[0.45158097 0. 0. ... 0.5797802 0.52712005 0.4013713 ]\n", + " [0.45256948 0. 0. ... 0.5799144 0.52712005 0.32340166]\n", + " [0.45207536 0. 0. ... 0.5926521 0.52712005 0.37272498]\n", + " ...\n", + " [0.39772758 0. 0. ... 0.6184636 0.17970447 0.23875508]\n", + " [0.3997038 0. 0. ... 0.62228477 0.17970447 0.129734 ]\n", + " [0.401186 0. 0. ... 0.62047464 0.17970447 0.02301037]]\n" + ] + }, + { + "data": { + "text/plain": [ + "(True, True, False)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "boundary_check(VALIDATION_DF)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xjZ24UIc7_o8" + }, + "source": [ + "그래프로 시각화를 하여 보았을 때도 일정 구간에서 0과 1 범위를 벗어나는 것을 확인할 수 있습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "VC9tdNq37_o8", + "outputId": "c566e3c6-0c27-4896-db75-30f99e83fbd9" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "VALIDATION_DF.plot()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "stSsrPvu7_o9", + "outputId": "a4d35d55-eac0-4165-be79-abb883e16443" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "VALIDATION_DF['C75'].plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jaTjGG5l7_o9" + }, + "source": [ + "# Data cleaning\n", + "모델을 실험하면서 최종결과를 확인할 때 복원이 잘 되는 것을 확인할 수 있었지만,\n", + "앞부분의 정상구간이 다른 정상구간 보다 Reconstruction_error가 높게 나타나는 것을 알 수 있었습니다.\n", + "\n", + "그래서 확인을 했을 때 한 변수가 정상인 구간에서 1이 넘어가는 값을 가지고 있었습니다.\n", + "\n", + "validation set에서 조금 더 정교하게 threshold 조절 및 결과를 확인하기 위해서 해당 변수의 값을 정상 범위에 맞게 임의로 조절하였습니다.\n", + "\n", + "하지만 이부분은 특이한 경우로 validation에서는 정답 label 값을 알고 있고, 정상이지만 비정상 구간에서 일정하게 나타나고 있었기 때문에 조절하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "snikMmBJ7_o-" + }, + "outputs": [], + "source": [ + "# valid 그래프를 보고 앞부분 정상인데 값이 튀는 변수가 있어서 조절\n", + "VALIDATION_DF['C75'][:2110] = 0.95" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2i8ccuWR7_o-", + "outputId": "eaaffc1c-912e-410c-c638-fe584a4c348d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(43201, 1, 79)" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "val = np.array(VALIDATION_DF)\n", + "x_val = val.reshape(val.shape[0], 1, val.shape[1])\n", + "x_val.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-wvyAtAB7_o-" + }, + "source": [ + "모델의 결과가 3차원의 형태이기 때문에 복원된 결과와의 차이를 확인하기 위해서는 2차원으로 다시 바꿔줘야합니다.\n", + "\n", + "그래서 flatten 함수를 구현하여 활용하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TAywnGrt7_o_" + }, + "outputs": [], + "source": [ + "def flatten(X):\n", + " flattened_X = np.empty((X.shape[0], X.shape[2])) # sample x features array.\n", + " for i in range(X.shape[0]):\n", + " flattened_X[i] = X[i, (X.shape[1]-1), :]\n", + " return(flattened_X)\n", + "\n", + "def scale(X, scaler):\n", + " for i in range(X.shape[0]):\n", + " X[i, :, :] = scaler.transform(X[i, :, :])\n", + " \n", + " return X" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N4K2y-mi7_o_" + }, + "source": [ + "모델의 의해 재구성된 값을 실제 값과 차이를 구해서 재구성 손실(reconstruction error) 값을 구해줍니다.\n", + "\n", + "정상인경우 모델이 잘 학습되어 복원이 잘 되었기 때문에 reconstruction error 값이 작게 나올 것이고,\n", + "공격인 경우 정규화된 값에서 0과 1을 벗어나기 때문에 reconstruction error 값이 크게 나올 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KAMK305o7_pA", + "outputId": "cbf660ff-0caa-4cab-eda8-552911a88506" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(43201, 1, 79)\n", + "(43201, 79)\n", + "(43201,)\n", + "7.681452989578247\n" + ] + } + ], + "source": [ + "start = time.time()\n", + "valid_x_predictions = model.predict(x_val)\n", + "print(valid_x_predictions.shape)\n", + "\n", + "error = flatten(x_val) - flatten(valid_x_predictions)\n", + "print((flatten(x_val) - flatten(valid_x_predictions)).shape)\n", + "\n", + "valid_mse = np.mean(np.power(flatten(x_val) - flatten(valid_x_predictions), 2), axis=1)\n", + "print(valid_mse.shape)\n", + "print(time.time()-start)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OEDb6tf57_pA" + }, + "source": [ + "# Precision Recall Curve\n", + "\n", + "threshold의 경우 Recall과 Precision의 값이 교차되는 지점을 기준으로 조금 씩 수정하면서 결과를 확인하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M73o51SE7_pB", + "outputId": "d0b36fd9-a9ed-41f3-e8fa-f04b1f15ce69" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "error_df = pd.DataFrame({'Reconstruction_error': valid_mse, \n", + " 'True_class':list(VALIDATION_DF_RAW['attack'])})\n", + "precision_rt, recall_rt, threshold_rt = metrics.precision_recall_curve(error_df['True_class'], error_df['Reconstruction_error'])\n", + "\n", + "plt.figure(figsize=(8,5))\n", + "plt.plot(threshold_rt, precision_rt[1:], label='Precision')\n", + "plt.plot(threshold_rt, recall_rt[1:], label='Recall')\n", + "plt.xlabel('Threshold'); plt.ylabel('Precision/Recall')\n", + "plt.legend()\n", + "#plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ho-wudP_7_pB", + "outputId": "4bed0605-2f29-41a9-c0ee-abce989e77a1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "precision: 0.9411764705882353 , recall: 0.9411764705882353\n", + "threshold: 0.00036586847637974846\n" + ] + } + ], + "source": [ + "index_cnt = [cnt for cnt, (p, r) in enumerate(zip(precision_rt, recall_rt)) if p==r][0]\n", + "print('precision: ',precision_rt[index_cnt],', recall: ',recall_rt[index_cnt])\n", + "\n", + "# fixed Threshold\n", + "threshold_fixed = threshold_rt[index_cnt]\n", + "print('threshold: ',threshold_fixed)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nWB-Tu-Z7_pC" + }, + "source": [ + "위에서 data cleaning을 적용하였을 때 precision 및 recall 값도 높게 잘 나오는 것을 확인할 수 있었습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OXmK5r4N7_pC" + }, + "source": [ + "# Predict Validation data set\n", + "우선 위에서 구한 threshold의 값으로 시각화를 하여 결과를 확인해 보겠습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1bvhKS_m7_pE", + "outputId": "5ed621b2-7ab1-4982-8764-16ef16cea6e4" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0.5, 0, 'Data point index')" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "error_df = pd.DataFrame({'Reconstruction_error': valid_mse ,\n", + " 'True_class': list(VALIDATION_DF_RAW['attack'])})\n", + "groups = error_df.groupby('True_class')\n", + "fig, ax = plt.subplots(figsize=(20,20))\n", + "\n", + "for name, group in groups:\n", + " ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',\n", + " label= \"Break\" if name == 1 else \"Normal\")\n", + " \n", + "ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors=\"r\", zorder=100, label='Threshold')\n", + "ax.legend()\n", + "\n", + "plt.title(\"Reconstruction error for different classes\")\n", + "plt.ylabel(\"Reconstruction error\")\n", + "plt.xlabel(\"Data point index\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E0oMgM5w7_pF" + }, + "source": [ + "결과를 확인해보면 정상인 구간에서는 복원이 잘 되어 reconstruction error 값이 작게 나왔고, 비정상인 구간은 확실하게 reconstruction error 값이 높게 나오는 것을 확인할 수 있습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Px76I8Gx7_pF" + }, + "source": [ + "# 이동평균 " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RCVt5bHd7_pF" + }, + "source": [ + "그 다음 조금 더 좋은 결과를 얻기 위해서 이동평균 값을 활용하였습니다.\n", + "\n", + "reconstruction error의 이동 평균 값을 활용함으로써 정상인 구간과 비정상인 구간을 조금 더 명확하게 구분할 수 있었습니다.\n", + "\n", + "threshold 값을 기준으로 구분하다보면 시작점과 끝점을 찾는 것이 어려웠습니다.\n", + "그래서 이동평균 값을 통해 정상인 구간은 평균적으로 더 낮게 하고, 비정상인 구간은 평균적으로 더 높은 값을 나타내도록 하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZOUaDD-t7_pF", + "outputId": "90f9cc02-eed5-4521-d607-da784aa4d086" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.000000\n", + "1 0.000000\n", + "2 0.000000\n", + "3 0.000000\n", + "4 0.000000\n", + " ... \n", + "43196 0.000138\n", + "43197 0.000138\n", + "43198 0.000137\n", + "43199 0.000135\n", + "43200 0.000133\n", + "Name: Reconstruction_error, Length: 43201, dtype: float64" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#이동평균\n", + "mean_window = error_df['Reconstruction_error'].rolling(50).mean()\n", + "window_error = mean_window.fillna(0)\n", + "window_error" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": false, + "id": "MoO4Jmpv7_pG", + "outputId": "75299769-9d7c-4d41-ccf5-535744102806" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0.5, 0, 'Data point index')" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "window_error_df = pd.DataFrame({'Reconstruction_error': window_error ,\n", + " 'True_class': list(VALIDATION_DF_RAW['attack'])})\n", + "groups = window_error_df.groupby('True_class')\n", + "fig, ax = plt.subplots(figsize=(20,20))\n", + "\n", + "for name, group in groups:\n", + " ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',\n", + " label= \"Break\" if name == 1 else \"Normal\")\n", + " \n", + "ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors=\"r\", zorder=100, label='Threshold')\n", + "ax.legend()\n", + "\n", + "plt.title(\"Reconstruction error for different classes\")\n", + "plt.ylabel(\"Reconstruction error\")\n", + "plt.xlabel(\"Data point index\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WkXQWXHL7_pH" + }, + "source": [ + "그 결과 공격 범위가 조금 넘어 가더라도 확실하게 공격인 구간을 잘 잡아내도록 하였습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P6VJYZiU7_pH" + }, + "source": [ + "위에서 구한 threshold 값으로 validation set 의 결과를 확인해보겠습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oaTxNvmb7_pH", + "outputId": "d6c450e0-42be-441d-c88c-6fc9f3a28650" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(43201,)" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pred_y = [1 if e > threshold_fixed else 0 for e in window_error_df['Reconstruction_error'].values]\n", + "pred_y = np.array(pred_y)\n", + "pred_y.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p8Of3TVZ7_pH" + }, + "source": [ + "# 평가" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hie7zr-O7_pI", + "outputId": "cd09ff24-80a4-4121-979c-5b251b1672dc" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ATTACK_LABELS = np.array(VALIDATION_DF_RAW[ATTACK_FIELD])\n", + "FINAL_LABELS = np.array(pred_y)\n", + "\n", + "ATTACK_LABELS.shape[0] == FINAL_LABELS.shape[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "Q8XEqngy7_pI", + "outputId": "4a3b6526-058d-4fd2-c148-e2c7414ce63e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "F1: 0.998 (TaP: 0.998, TaR: 0.997)\n", + "# of detected anomalies: 5\n", + "Detected anomalies: ['1', '2', '3', '4', '5']\n" + ] + } + ], + "source": [ + "TaPR = etapr.evaluate(anomalies=ATTACK_LABELS, predictions=FINAL_LABELS)\n", + "print(f\"F1: {TaPR['f1']:.3f} (TaP: {TaPR['TaP']:.3f}, TaR: {TaPR['TaR']:.3f})\")\n", + "print(f\"# of detected anomalies: {len(TaPR['Detected_Anomalies'])}\")\n", + "print(f\"Detected anomalies: {TaPR['Detected_Anomalies']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cc3-ppr37_pJ" + }, + "source": [ + "비정상 구간에서 특이했던 변수의 데이터 값을 조절하고, 이동 평균 값을 활용하였을 때 validation set에서 TaPR 점수가 99.8이라는 높은 점수가 나오는 것을 확인할 수 있었습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NXobwDyy7_pJ" + }, + "source": [ + "# Predict Test data set" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yKIiGu6Z7_pJ" + }, + "source": [ + "validation data set과 동일한 방법으로 진행하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SuLjQjso7_pK", + "outputId": "1bfe326c-8642-46f4-c28c-2772abf22e41" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timeC01C02C03C04C05C06C07C08C09...C70C71C72C73C74C75C76C77C78C79
02020-07-09 15:00:00384.3073712.01070.35980-1.6171-1.79901774.2075228.02385136...936.584470.01.3543713.9723135.2233812.02545293.51129283.9265132.06.5059
12020-07-09 15:00:01384.3073712.01070.35980-1.7606-1.79901772.5875828.02730136...940.933170.01.3543713.9335835.2081312.02545292.67938283.3659132.06.3079
22020-07-09 15:00:02384.2051712.01070.35980-1.7606-1.80664772.5875828.02730136...936.584470.01.3551313.9524835.2081312.02545291.90179282.9318932.06.3079
32020-07-09 15:00:03384.2562612.01070.35980-1.7814-1.79901777.4881028.02905136...933.540340.01.3551313.8997135.2081312.02545291.59430282.0637832.06.1203
42020-07-09 15:00:04384.2051712.01070.35980-1.7370-1.79901778.4221228.03169136...944.412230.01.3543713.9460335.2081312.02545289.87628283.6733432.05.9543
..................................................................
923962020-07-31 12:29:56420.0892312.01048.31848-0.870697.19238786.5438228.03253232...824.82147100.01.356669.6220336.4746011.78894357.27722361.1472832.06.2809
923972020-07-31 12:29:57420.0892312.01048.31848-0.749897.19238784.0718428.03598224...823.51697100.01.355139.4874736.4746011.78894357.29529359.8452132.06.3602
923982020-07-31 12:29:58420.2425812.01048.31848-0.607697.19238786.8388128.02642208...824.82147100.01.356669.5778736.4898611.78894357.27722360.6047432.06.3742
923992020-07-31 12:29:59420.2425812.01048.31848-0.461897.19238786.6613828.03341200...833.51904100.01.355139.5629136.4898611.78894357.80170357.4218832.06.2864
924002020-07-31 12:30:00420.2937312.01048.31848-0.366197.20001785.7229028.03247192...827.43085100.01.355909.5368936.4746011.78894357.07825358.6516132.06.0371
\n", + "

358804 rows × 80 columns

\n", + "
" + ], + "text/plain": [ + " time C01 C02 C03 C04 C05 C06 \\\n", + "0 2020-07-09 15:00:00 384.30737 12.0 10 70.35980 -1.6171 -1.79901 \n", + "1 2020-07-09 15:00:01 384.30737 12.0 10 70.35980 -1.7606 -1.79901 \n", + "2 2020-07-09 15:00:02 384.20517 12.0 10 70.35980 -1.7606 -1.80664 \n", + "3 2020-07-09 15:00:03 384.25626 12.0 10 70.35980 -1.7814 -1.79901 \n", + "4 2020-07-09 15:00:04 384.20517 12.0 10 70.35980 -1.7370 -1.79901 \n", + "... ... ... ... ... ... ... ... \n", + "92396 2020-07-31 12:29:56 420.08923 12.0 10 48.31848 -0.8706 97.19238 \n", + "92397 2020-07-31 12:29:57 420.08923 12.0 10 48.31848 -0.7498 97.19238 \n", + "92398 2020-07-31 12:29:58 420.24258 12.0 10 48.31848 -0.6076 97.19238 \n", + "92399 2020-07-31 12:29:59 420.24258 12.0 10 48.31848 -0.4618 97.19238 \n", + "92400 2020-07-31 12:30:00 420.29373 12.0 10 48.31848 -0.3661 97.20001 \n", + "\n", + " C07 C08 C09 ... C70 C71 C72 C73 \\\n", + "0 774.20752 28.02385 136 ... 936.58447 0.0 1.35437 13.97231 \n", + "1 772.58758 28.02730 136 ... 940.93317 0.0 1.35437 13.93358 \n", + "2 772.58758 28.02730 136 ... 936.58447 0.0 1.35513 13.95248 \n", + "3 777.48810 28.02905 136 ... 933.54034 0.0 1.35513 13.89971 \n", + "4 778.42212 28.03169 136 ... 944.41223 0.0 1.35437 13.94603 \n", + "... ... ... ... ... ... ... ... ... \n", + "92396 786.54382 28.03253 232 ... 824.82147 100.0 1.35666 9.62203 \n", + "92397 784.07184 28.03598 224 ... 823.51697 100.0 1.35513 9.48747 \n", + "92398 786.83881 28.02642 208 ... 824.82147 100.0 1.35666 9.57787 \n", + "92399 786.66138 28.03341 200 ... 833.51904 100.0 1.35513 9.56291 \n", + "92400 785.72290 28.03247 192 ... 827.43085 100.0 1.35590 9.53689 \n", + "\n", + " C74 C75 C76 C77 C78 C79 \n", + "0 35.22338 12.02545 293.51129 283.92651 32.0 6.5059 \n", + "1 35.20813 12.02545 292.67938 283.36591 32.0 6.3079 \n", + "2 35.20813 12.02545 291.90179 282.93189 32.0 6.3079 \n", + "3 35.20813 12.02545 291.59430 282.06378 32.0 6.1203 \n", + "4 35.20813 12.02545 289.87628 283.67334 32.0 5.9543 \n", + "... ... ... ... ... ... ... \n", + "92396 36.47460 11.78894 357.27722 361.14728 32.0 6.2809 \n", + "92397 36.47460 11.78894 357.29529 359.84521 32.0 6.3602 \n", + "92398 36.48986 11.78894 357.27722 360.60474 32.0 6.3742 \n", + "92399 36.48986 11.78894 357.80170 357.42188 32.0 6.2864 \n", + "92400 36.47460 11.78894 357.07825 358.65161 32.0 6.0371 \n", + "\n", + "[358804 rows x 80 columns]" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "TEST_DF_RAW = dataframe_from_csvs(TEST_DATASET)\n", + "TEST_DF_RAW" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": false, + "id": "i6Ijapc_7_pK", + "outputId": "0631110c-7d97-41ac-cd59-1a4475447e6f" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
C01C02C03C04C05C06C07C08C09C10...C70C71C72C73C74C75C76C77C78C79
00.2737150.00.00.8283250.2258820.0009990.3187610.3338250.0104170.139867...0.6777230.00.3178420.4520440.2364830.9444400.2259710.2003890.261620.714722
10.2737150.00.00.8283250.1716340.0009990.3001870.4263980.0104170.137445...0.6805840.00.3178420.4507010.2303440.9444400.2231210.1984990.261620.650106
20.2728250.00.00.8283250.1667470.0009300.2985140.4347380.0104170.137227...0.6780070.00.3182900.4512290.2297910.9444400.2202230.1968800.261620.644285
30.2731820.00.00.8283250.1584780.0009920.3539800.4820560.0104170.190063...0.6757690.00.3183350.4494700.2297360.9444400.2188930.1938230.261620.583093
40.2727730.00.00.8283250.1742690.0009980.3701230.5569140.0104170.266280...0.6826260.00.3178910.4508840.2297310.9444400.2129320.1988880.261620.523348
..................................................................
923960.6195200.00.00.0734260.5321470.9980790.4767200.5810660.0174890.358745...0.5964061.00.3193370.2858880.7905980.0833390.4661520.4858090.261620.622130
923970.6195610.00.00.0734260.5810960.9980790.4465270.6807750.0168530.386044...0.5959561.00.3184390.2814500.7905430.0833390.4663700.4822290.261620.658390
923980.6208990.00.00.0734260.6392100.9980790.4749160.4367910.0157480.453718...0.5967611.00.3192520.2841100.7966190.0833390.4663300.4844060.261620.666539
923990.6210330.00.00.0734260.6995870.9980790.4757410.5980770.0151160.563381...0.6025051.00.3184310.2838620.7972270.0833390.4681060.4740040.261620.638988
924000.6214910.00.00.0734260.7414410.9981480.4651710.5892350.0145320.705708...0.5991151.00.3188030.2829440.7912060.0833390.4658290.4770660.261620.555689
\n", + "

358804 rows × 79 columns

\n", + "
" + ], + "text/plain": [ + " C01 C02 C03 C04 C05 C06 C07 C08 \\\n", + "0 0.273715 0.0 0.0 0.828325 0.225882 0.000999 0.318761 0.333825 \n", + "1 0.273715 0.0 0.0 0.828325 0.171634 0.000999 0.300187 0.426398 \n", + "2 0.272825 0.0 0.0 0.828325 0.166747 0.000930 0.298514 0.434738 \n", + "3 0.273182 0.0 0.0 0.828325 0.158478 0.000992 0.353980 0.482056 \n", + "4 0.272773 0.0 0.0 0.828325 0.174269 0.000998 0.370123 0.556914 \n", + "... ... ... ... ... ... ... ... ... \n", + "92396 0.619520 0.0 0.0 0.073426 0.532147 0.998079 0.476720 0.581066 \n", + "92397 0.619561 0.0 0.0 0.073426 0.581096 0.998079 0.446527 0.680775 \n", + "92398 0.620899 0.0 0.0 0.073426 0.639210 0.998079 0.474916 0.436791 \n", + "92399 0.621033 0.0 0.0 0.073426 0.699587 0.998079 0.475741 0.598077 \n", + "92400 0.621491 0.0 0.0 0.073426 0.741441 0.998148 0.465171 0.589235 \n", + "\n", + " C09 C10 ... C70 C71 C72 C73 C74 \\\n", + "0 0.010417 0.139867 ... 0.677723 0.0 0.317842 0.452044 0.236483 \n", + "1 0.010417 0.137445 ... 0.680584 0.0 0.317842 0.450701 0.230344 \n", + "2 0.010417 0.137227 ... 0.678007 0.0 0.318290 0.451229 0.229791 \n", + "3 0.010417 0.190063 ... 0.675769 0.0 0.318335 0.449470 0.229736 \n", + "4 0.010417 0.266280 ... 0.682626 0.0 0.317891 0.450884 0.229731 \n", + "... ... ... ... ... ... ... ... ... \n", + "92396 0.017489 0.358745 ... 0.596406 1.0 0.319337 0.285888 0.790598 \n", + "92397 0.016853 0.386044 ... 0.595956 1.0 0.318439 0.281450 0.790543 \n", + "92398 0.015748 0.453718 ... 0.596761 1.0 0.319252 0.284110 0.796619 \n", + "92399 0.015116 0.563381 ... 0.602505 1.0 0.318431 0.283862 0.797227 \n", + "92400 0.014532 0.705708 ... 0.599115 1.0 0.318803 0.282944 0.791206 \n", + "\n", + " C75 C76 C77 C78 C79 \n", + "0 0.944440 0.225971 0.200389 0.26162 0.714722 \n", + "1 0.944440 0.223121 0.198499 0.26162 0.650106 \n", + "2 0.944440 0.220223 0.196880 0.26162 0.644285 \n", + "3 0.944440 0.218893 0.193823 0.26162 0.583093 \n", + "4 0.944440 0.212932 0.198888 0.26162 0.523348 \n", + "... ... ... ... ... ... \n", + "92396 0.083339 0.466152 0.485809 0.26162 0.622130 \n", + "92397 0.083339 0.466370 0.482229 0.26162 0.658390 \n", + "92398 0.083339 0.466330 0.484406 0.26162 0.666539 \n", + "92399 0.083339 0.468106 0.474004 0.26162 0.638988 \n", + "92400 0.083339 0.465829 0.477066 0.26162 0.555689 \n", + "\n", + "[358804 rows x 79 columns]" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "TEST_DF = normalize(TEST_DF_RAW[VALID_COLUMNS_IN_TRAIN_DATASET]).ewm(alpha=0.9).mean()\n", + "TEST_DF" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PsuVGylM7_pK" + }, + "source": [ + "test dataset에서는 validation dataset에서 처럼 특이한 경우가 보이지 않아서 데이터 값을 정규화만 하고, 그대로 진행하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "DQhE5-W07_pL", + "outputId": "fcab3236-6f1c-4b08-ca62-44a78610af99" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "TEST_DF.plot()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xs3N-bBv7_pL", + "outputId": "4f58dc7e-182c-4bbd-ceda-c252a3bf338d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[0.27371535 0. 0. ... 0.20038876 0.26161984 0.7147216 ]\n", + " [0.27371535 0. 0. ... 0.19849946 0.26161984 0.6501059 ]\n", + " [0.27282545 0. 0. ... 0.19687971 0.26161984 0.64428467]\n", + " ...\n", + " [0.6208987 0. 0. ... 0.48440555 0.26161984 0.666539 ]\n", + " [0.62103254 0. 0. ... 0.4740037 0.26161984 0.6389876 ]\n", + " [0.62149084 0. 0. ... 0.47706646 0.26161984 0.555689 ]]\n" + ] + }, + { + "data": { + "text/plain": [ + "(True, True, False)" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "boundary_check(TEST_DF)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "l0urBWpl7_pM", + "outputId": "1313f6a8-b86b-42e4-b7d8-c123b1b5e3d0" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(358804, 1, 79)" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test = np.array(TEST_DF)\n", + "x_test = test.reshape(test.shape[0], 1, test.shape[1])\n", + "x_test.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d14bj9Wq7_pM", + "outputId": "304bb27e-c64d-45b9-cec4-28953f35a228" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(358804, 1, 79)\n", + "(358804,)\n", + "57.45735478401184\n" + ] + } + ], + "source": [ + "start = time.time()\n", + "test_x_predictions = model.predict(x_test)\n", + "#print(test_x_predictions)\n", + "print(test_x_predictions.shape)\n", + "#print((flatten(x_test) - flatten(test_x_predictions)).shape)\n", + "test_mse = np.mean(np.power(flatten(x_test) - flatten(test_x_predictions), 2), axis=1)\n", + "print(test_mse.shape)\n", + "print(time.time()-start)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fUtdlMMJ7_pM" + }, + "outputs": [], + "source": [ + "test_error = pd.DataFrame({'Reconstruction_error': test_mse})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NrEsuo-y7_pN" + }, + "source": [ + "테스트 데이터 셋에서는 label 값을 알 수 없었기 때문에, \n", + "이동평균의 구간과 threshold 값을 조금씩 변경하면서 제출 후 결과를 보고 조절하였습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bXotIxtm7_pN", + "outputId": "06a1b9a8-7a8a-4202-f44a-31b9b5b23186" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.000000\n", + "1 0.000000\n", + "2 0.000000\n", + "3 0.000000\n", + "4 0.000000\n", + " ... \n", + "358799 0.000335\n", + "358800 0.000333\n", + "358801 0.000332\n", + "358802 0.000330\n", + "358803 0.000328\n", + "Name: Reconstruction_error, Length: 358804, dtype: float64" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_move = test_error['Reconstruction_error'].rolling(71).mean()\n", + "\n", + "test_d = test_move.fillna(0)\n", + "test_d" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DxWXlOPz7_pN" + }, + "outputs": [], + "source": [ + "movemean_test = pd.DataFrame({'Reconstruction_error': test_d})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cNlOInkR7_pO", + "outputId": "1833aa61-ab6f-4a1c-f889-8e90ed2d698d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(358804,)" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pred_y_test = [1 if e > 0.000425 else 0 for e in movemean_test['Reconstruction_error'].values]\n", + "pred_y_test = np.array(pred_y_test)\n", + "pred_y_test.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ByvHHUsm7_pO" + }, + "outputs": [], + "source": [ + "submission = pd.read_csv('D:\\\\python_project\\\\hyunmin_project\\\\ot보안\\\\data\\\\HAI 2.0\\\\sample_submission.csv')\n", + "submission.index = submission['time']\n", + "submission['attack'] = pred_y_test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "dwtdBTVB7_pO", + "outputId": "8b8ae166-e3e0-41a7-9021-18b9070176aa" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 349649\n", + "1 9155\n", + "Name: attack, dtype: int64" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "submission['attack'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "id": "Dd8bcnJP7_pP", + "outputId": "fcccbfe2-7141-4778-95b2-b8ce63a44494" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0.5, 0, 'Data point index')" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "test_error_df = pd.DataFrame({'Reconstruction_error': test_d,\n", + " 'True_class': list(submission['attack'])})\n", + "groups = test_error_df.groupby('True_class')\n", + "fig, ax = plt.subplots(figsize=(20,20))\n", + "\n", + "for name, group in groups:\n", + " ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',\n", + " label= \"Break\" if name == 1 else \"Normal\")\n", + " \n", + "ax.hlines(0.000425, ax.get_xlim()[0], ax.get_xlim()[1], colors=\"r\", zorder=100, label='Threshold')\n", + "ax.legend()\n", + "\n", + "plt.title(\"Reconstruction error for different classes\")\n", + "plt.ylabel(\"Reconstruction error\")\n", + "plt.xlabel(\"Data point index\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3mATnS167_pP" + }, + "source": [ + "최종 예측한 label의 결과를 확인해보았습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3A4ebO2a7_pP" + }, + "source": [ + "마지막으로 결과를 제출약식에 맞춰 저장합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b8eabi897_pQ" + }, + "outputs": [], + "source": [ + "#submission.to_csv('predict.csv', index=False)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.10" + }, + "colab": { + "name": "week18_김희숙_예습과제.ipynb", + "provenance": [] + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file From e6bd33e1b9c5e08dcc20b4faf3c8437847a14f0e Mon Sep 17 00:00:00 2001 From: kimsook <40443049+kimsook@users.noreply.github.com> Date: Thu, 14 Jul 2022 23:40:26 +0900 Subject: [PATCH 3/3] =?UTF-8?q?19=EC=A3=BC=EC=B0=A8=20=EC=98=88=EC=8A=B5?= =?UTF-8?q?=EA=B3=BC=EC=A0=9C?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...354\212\265\352\263\274\354\240\234.ipynb" | 2059 +++++++++++++++++ 1 file changed, 2059 insertions(+) create mode 100644 "week19_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" diff --git "a/week19_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" "b/week19_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" new file mode 100644 index 0000000..7e76aa3 --- /dev/null +++ "b/week19_\352\271\200\355\235\254\354\210\231_\354\230\210\354\212\265\352\263\274\354\240\234.ipynb" @@ -0,0 +1,2059 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "VB2w6aqFG1Mh" + }, + "source": [ + "# **1. 데이터 및 라이브러리 불러오기**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VYP2ypq6G1Mr" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import re\n", + "import json\n", + "import os\n", + "import tqdm\n", + "\n", + "from konlpy.tag import Okt\n", + "\n", + "import sklearn\n", + "from sklearn.preprocessing import LabelEncoder\n", + "\n", + "from sklearn.metrics import log_loss, accuracy_score,f1_score\n", + "import tensorflow as tf\n", + "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", + "from tensorflow.keras.preprocessing.text import Tokenizer\n", + "from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint\n", + "from transformers import *" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NuRT-ijiG1Mv" + }, + "outputs": [], + "source": [ + "train=pd.read_csv('train.csv')\n", + "test=pd.read_csv('test.csv')\n", + "sample_submission=pd.read_csv('sample_submission.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oD1EcmQOG1Mw" + }, + "source": [ + "# **2. 데이터 EDA**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sNO4CkCWG1Mx", + "outputId": "c0ed4d11-f90f-4ec2-e6ad-5e6536581749" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
index제출년도사업명사업_부처명계속과제여부내역사업명과제명요약문_연구목표요약문_연구내용요약문_기대효과요약문_한글키워드요약문_영문키워드label
002016농업기초기반연구농촌진흥청신규농산물안전성연구유전정보를 활용한 새로운 해충 분류군 동정기술 개발○ 새로운 해충분류군의 동정기술 개발 및 유입확산 추적(가) 외래 및 돌발해충의 발생조사 및 종 동정\\n\\n\\n ○ 대상해충 : 최...○ 새로운 돌발 및 외래해충의 신속, 정확한 동정법 향상\\n\\n\\n○ 돌발 및 외래...뉴클레오티드 염기서열, 분자마커, 종 동정, 침샘, 전사체nucleotide sequence, molecular marker, species...24
112019이공학학술연구기반구축(R&D)교육부신규지역대학우수과학자지원사업(1년~5년)대장암의 TRAIL 내성 표적 인자 발굴 및 TRAIL 반응 예측 유전자 지도 구축...최종목표: TRAIL 감수성 표적 유전자를 발굴하고 내성제어 기전을 연구. 발굴된...1차년도\\n1) Microarray를 통한 선천적 TRAIL 내성 표적 후보 유전자...1) TRAIL 내성 특이적 표적분자를 발굴하고, 이를 이용한 TRAIL 효과 증진...대장암,항암제 내성,세포사멸,유전자발굴TRAIL,Colorectal cancer,TRAIL resistance,Apopt...0
\n", + "
" + ], + "text/plain": [ + " index 제출년도 사업명 사업_부처명 계속과제여부 내역사업명 \\\n", + "0 0 2016 농업기초기반연구 농촌진흥청 신규 농산물안전성연구 \n", + "1 1 2019 이공학학술연구기반구축(R&D) 교육부 신규 지역대학우수과학자지원사업(1년~5년) \n", + "\n", + " 과제명 \\\n", + "0 유전정보를 활용한 새로운 해충 분류군 동정기술 개발 \n", + "1 대장암의 TRAIL 내성 표적 인자 발굴 및 TRAIL 반응 예측 유전자 지도 구축... \n", + "\n", + " 요약문_연구목표 \\\n", + "0 ○ 새로운 해충분류군의 동정기술 개발 및 유입확산 추적 \n", + "1 최종목표: TRAIL 감수성 표적 유전자를 발굴하고 내성제어 기전을 연구. 발굴된... \n", + "\n", + " 요약문_연구내용 \\\n", + "0 (가) 외래 및 돌발해충의 발생조사 및 종 동정\\n\\n\\n ○ 대상해충 : 최... \n", + "1 1차년도\\n1) Microarray를 통한 선천적 TRAIL 내성 표적 후보 유전자... \n", + "\n", + " 요약문_기대효과 \\\n", + "0 ○ 새로운 돌발 및 외래해충의 신속, 정확한 동정법 향상\\n\\n\\n○ 돌발 및 외래... \n", + "1 1) TRAIL 내성 특이적 표적분자를 발굴하고, 이를 이용한 TRAIL 효과 증진... \n", + "\n", + " 요약문_한글키워드 \\\n", + "0 뉴클레오티드 염기서열, 분자마커, 종 동정, 침샘, 전사체 \n", + "1 대장암,항암제 내성,세포사멸,유전자발굴 \n", + "\n", + " 요약문_영문키워드 label \n", + "0 nucleotide sequence, molecular marker, species... 24 \n", + "1 TRAIL,Colorectal cancer,TRAIL resistance,Apopt... 0 " + ] + }, + "execution_count": 109, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PPTlScQcG1Mz", + "outputId": "9d3fd575-2b8c-4cef-9168-ba492880a69c" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
index제출년도사업명사업_부처명계속과제여부내역사업명과제명요약문_연구목표요약문_연구내용요약문_기대효과요약문_한글키워드요약문_영문키워드
01743042016경제협력권산업육성산업통상자원부신규자동차융합부품R-FSSW 기술 적용 경량 차체 부품 개발 및 품질 평가를 위한 64채널 C-SC...○ 차체 점용접부의 품질 검사를 위한 64채널 무선 기반 C-Scan 탐촉자 개발\\...○ 1차년도\\n\\n . 개발 탐촉 시스템의 성능 평가 위한 표준 시편 제작 시...○ 기술적 파급효과\\n\\n - 본 연구에서 개발된 R-FSSW 접합 기술은 기존 ...마찰교반점용접, 비파괴 검사, 초음파 탐상, 씨 스캔, 용접 품질 평가Friction Stir Spot Welding, Non-destructive ev...
11743052018개인기초연구(과기정통부)(R&D)과학기술정보통신부계속신진연구(총연구비5천이상~1.5억이하)다입자계를 묘사하는 편미분방정식에 대한 연구자연계에는 입자의 개수가 아주 큰 다양한 다입자계가 존재한다. 이런 다입자계의 효...연구과제1. 무한입자계의 동역학 / 작용소(operator) 방정식에 대한 연구\\n...본 연구는 물리학에서 중요한 대상인 다입자계를 묘사하는 모델방정식의 정당성을 보장하...다체계 방정식,동역학의 안정성,양자역학,고전역학,평균장 극한,고전극한,비상대론적 극한many particle system,stability of dynamics,qua...
\n", + "
" + ], + "text/plain": [ + " index 제출년도 사업명 사업_부처명 계속과제여부 내역사업명 \\\n", + "0 174304 2016 경제협력권산업육성 산업통상자원부 신규 자동차융합부품 \n", + "1 174305 2018 개인기초연구(과기정통부)(R&D) 과학기술정보통신부 계속 신진연구(총연구비5천이상~1.5억이하) \n", + "\n", + " 과제명 \\\n", + "0 R-FSSW 기술 적용 경량 차체 부품 개발 및 품질 평가를 위한 64채널 C-SC... \n", + "1 다입자계를 묘사하는 편미분방정식에 대한 연구 \n", + "\n", + " 요약문_연구목표 \\\n", + "0 ○ 차체 점용접부의 품질 검사를 위한 64채널 무선 기반 C-Scan 탐촉자 개발\\... \n", + "1 자연계에는 입자의 개수가 아주 큰 다양한 다입자계가 존재한다. 이런 다입자계의 효... \n", + "\n", + " 요약문_연구내용 \\\n", + "0 ○ 1차년도\\n\\n . 개발 탐촉 시스템의 성능 평가 위한 표준 시편 제작 시... \n", + "1 연구과제1. 무한입자계의 동역학 / 작용소(operator) 방정식에 대한 연구\\n... \n", + "\n", + " 요약문_기대효과 \\\n", + "0 ○ 기술적 파급효과\\n\\n - 본 연구에서 개발된 R-FSSW 접합 기술은 기존 ... \n", + "1 본 연구는 물리학에서 중요한 대상인 다입자계를 묘사하는 모델방정식의 정당성을 보장하... \n", + "\n", + " 요약문_한글키워드 \\\n", + "0 마찰교반점용접, 비파괴 검사, 초음파 탐상, 씨 스캔, 용접 품질 평가 \n", + "1 다체계 방정식,동역학의 안정성,양자역학,고전역학,평균장 극한,고전극한,비상대론적 극한 \n", + "\n", + " 요약문_영문키워드 \n", + "0 Friction Stir Spot Welding, Non-destructive ev... \n", + "1 many particle system,stability of dynamics,qua... " + ] + }, + "execution_count": 110, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "A6YGuk5WG1M1", + "outputId": "1c182dc8-faf6-415f-e0b0-84b5ad12a9c7" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
indexlabel
01743040
11743050
21743060
31743070
41743080
51743090
\n", + "
" + ], + "text/plain": [ + " index label\n", + "0 174304 0\n", + "1 174305 0\n", + "2 174306 0\n", + "3 174307 0\n", + "4 174308 0\n", + "5 174309 0" + ] + }, + "execution_count": 117, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_submission.head(6)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u5X2HhFFG1M2", + "outputId": "5d907a78-c14e-4764-c746-940e1e6b6d12" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['index', '제출년도', '사업명', '사업_부처명', '계속과제여부', '내역사업명', '과제명', '요약문_연구목표',\n", + " '요약문_연구내용', '요약문_기대효과', '요약문_한글키워드', '요약문_영문키워드', 'label'],\n", + " dtype='object')" + ] + }, + "execution_count": 112, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "La4ho1nhG1M3", + "outputId": "ff5fffcd-8422-431c-9d4b-2f804dc56c05" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['index', '제출년도', '사업명', '사업_부처명', '계속과제여부', '내역사업명', '과제명', '요약문_연구목표',\n", + " '요약문_연구내용', '요약문_기대효과', '요약문_한글키워드', '요약문_영문키워드'],\n", + " dtype='object')" + ] + }, + "execution_count": 113, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MIawLOeiG1M5", + "outputId": "f6df0485-dfb0-4a4c-93ac-9844d64b67f3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(174304, 13)\n", + "(43576, 12)\n", + "(43576, 2)\n" + ] + } + ], + "source": [ + "#데이터 구조 파악\n", + "print(train.shape)\n", + "print(test.shape)\n", + "print(sample_submission.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZVZ4A5L4G1M7", + "outputId": "a6b7b0b3-30b4-4cd6-8d07-63451d451ebc" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.817945\n", + "1 0.007234\n", + "2 0.001578\n", + "3 0.000820\n", + "4 0.000327\n", + "5 0.009742\n", + "6 0.000447\n", + "7 0.000648\n", + "8 0.001945\n", + "9 0.000608\n", + "10 0.003775\n", + "11 0.001147\n", + "12 0.001538\n", + "13 0.003299\n", + "14 0.009592\n", + "15 0.000947\n", + "16 0.002903\n", + "17 0.000884\n", + "18 0.008893\n", + "19 0.028330\n", + "20 0.006076\n", + "21 0.002846\n", + "22 0.000849\n", + "23 0.010556\n", + "24 0.020195\n", + "25 0.004647\n", + "26 0.001813\n", + "27 0.003557\n", + "28 0.002576\n", + "29 0.005898\n", + "30 0.001342\n", + "31 0.005290\n", + "32 0.001492\n", + "33 0.003058\n", + "34 0.003001\n", + "35 0.001669\n", + "36 0.006081\n", + "37 0.001526\n", + "38 0.001503\n", + "39 0.001159\n", + "40 0.002530\n", + "41 0.000384\n", + "42 0.000293\n", + "43 0.002014\n", + "44 0.000522\n", + "45 0.006523\n", + "Name: label, dtype: float64" + ] + }, + "execution_count": 115, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#심각한 불균형 데이터임을 알 수 있습니다.\n", + "train.label.value_counts(sort=False)/len(train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eqYZQAZnG1M7", + "outputId": "3e0ffe9a-356d-45e4-ae8e-f3b7c78cd1ad" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "과제명 길이 최댓값: 229\n", + "과제명 길이 최솟값: 2\n", + "과제명 길이 평균값: 35.84252225995961\n", + "과제명 길이 중간값: 34.0\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAsYAAAEvCAYAAABGywdiAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAASc0lEQVR4nO3df6zdd33f8dfb1ykWuJ3vTTwLh1B3kyMZrlRKrxBLo4gEbSP80TCpI1jTSIklbxL1xjZNcudKsD+Q+k82iWTDShtGWjW3JvSH80dgQ6m7zENdcRikjl0UiyUiP4hvYrtksETX1+/9keP0hjr4xveenHuuHw/p6pz7+Z4f7/uP9dTXn3O+1d0BAIDL3bpRDwAAAKuBMAYAgAhjAABIIowBACCJMAYAgCTCGAAAkiTrRz1Aklx11VW9bdu2UY8BAMAa98gjjzzf3ZsvdGxVhPG2bdty5MiRUY8BAMAaV1VPvt4xWykAACDCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBxs7s7Gymp6czMTGR6enpzM7OjnokgDVhVVzgA4ClmZ2dzb59+3LPPffk+uuvz+HDh7Nr164kyc6dO0c8HcB4q+4e9QyZmZlpV74DuLjp6enceeedufHGG19dO3ToUPbs2ZOjR4+OcDKA8VBVj3T3zAWPCWOA8TExMZGXXnopV1xxxatr8/Pz2bBhQxYWFkY4GcB4+ElhbI8xwBjZsWNHDh8+/Jq1w4cPZ8eOHSOaCGDtEMYAY2Tfvn3ZtWtXDh06lPn5+Rw6dCi7du3Kvn37Rj0awNjz4TuAMXL+A3Z79uzJ8ePHs2PHjnz2s5/1wTuAFWCPMQAAlw17jAEA4CKEMQAARBgDAEASYQwAAEmEMQAAJBHGAACQRBgDAEASYQwAAEmEMQAAJBHGAACQRBgDAEASYQwAAEmEMQAAJBHGAACQRBgDAEASYQwAAEmEMQAAJBHGAACQRBgDAEASYQwAAEmEMQAAJBHGAACQRBgDAECSJYRxVV1TVYeq6lhVPVZV/3KwPlVVX6uqxwe3k4P1qqrPVdWJqnq0qt477D8CAACWaylnjM8m+Tfd/a4k70/yyap6V5K9SR7q7u1JHhr8niQ3J9k++Nmd5PMrPjUAAKywi4Zxdz/b3d8c3H8xyfEkVye5Jcm9g4fdm+Qjg/u3JPmdfsWfJdlUVW9f8ckBAGAFvaE9xlW1LckvJPlfSbZ097ODQ99PsmVw/+ok31v0tKcGawAAsGotOYyramOSP0jyqe7+weJj3d1J+o28cVXtrqojVXVkbm7ujTwVAABW3JLCuKquyCtR/Hvd/YeD5efOb5EY3J4crD+d5JpFT3/HYO01uvvu7p7p7pnNmzdf6vwAALAilvKtFJXkniTHu/s/LDr0QJLbBvdvS3Jw0frHB99O8f4kf7VoywUAAKxK65fwmF9K8k+T/EVVfWuw9u+S/GaSL1XVriRPJvno4NiDST6c5ESSHyX5xIpODAAAQ3DRMO7uw0nqdQ5/8AKP7ySfXOZcAADwpnLlOwAAiDAGAIAkwhgAAJIIYwAASCKMAQAgiTAGAIAkwhgAAJIIYwAASCKMAQAgiTAGAIAkwhgAAJIIYwAASCKMAQAgiTAGAIAkwhgAAJIIYwAASCKMAQAgiTAGAIAkwhgAAJIIY4CxMzs7m+np6UxMTGR6ejqzs7OjHglgTVg/6gEAWLrZ2dns27cv99xzT66//vocPnw4u3btSpLs3LlzxNMBjLfq7lHPkJmZmT5y5MioxwBY9aanp7N9+/Z85Stfycsvv5y3vOUtufnmm/P444/n6NGjox4PYNWrqke6e+ZCx5wxBhgjx44dy7Fjx7Jly5acPHkyk5OTOXjw4KjHAlgT7DEGGCPdnY0bN+a+++7LSy+9lPvuuy8bN27MavjfP4BxJ4wBxsxb3/rWn/g7AJdGGAOMmZtuuil79uzJhg0bsmfPntx0002jHglgTRDGAGNkamoqBw4cyO23354XX3wxt99+ew4cOJCpqalRjwYw9oQxwBi56667snHjxuzduzdve9vbsnfv3mzcuDF33XXXqEcDGHvCGGCM7Ny5M/v378+1116bdevW5dprr83+/ft9hzHACvA9xgAAXDZ+0vcYO2MMMGZcEhpgOFzgA2CMuCQ0wPDYSgEwRqanp3PnnXfmxhtvfHXt0KFD2bNnj0tCAyyBrRQAa8Tx48dz//33Z8OGDamqbNiwIffff3+OHz8+6tEAxp4zxgBj5Morr8zp06ezbt26LCwsZGJiIufOncvk5GReeOGFUY8HsOo5YwywRpw+fTrdnd27d+fMmTPZvXt3ujunT58e9WgAY08YA4yR7s6tt96ahx9+OFNTU3n44Ydz6623ZjX87x/AuBPGAAAQYQwwVqoqBw4cyA033JBTp07lhhtuyIEDB1JVox4NYOz58B3AGLnyyitz6tSpTExMvPrhu4WFhUxNTfnwHcAS+PAdwBpx5syZTE9PZ2FhIUmysLCQ6enpnDlzZsSTAYw/YQwwRjZt2pRjx47ljjvuyA9/+MPccccdOXbsWDZt2jTq0QDGnq0UAGPkiiuuSFVlfn7+NWvd/Zo1AC7MVgqANeLs2bOZn5/P5ORk1q1bl8nJyczPz+fs2bOjHg1g7AljgDGzffv2bN26NUmydevWbN++fcQTAawN60c9AABvzOOPP/7q/ccee2yEkwCsLRc9Y1xVX6iqk1V1dNHaZ6rq6ar61uDnw4uO/XpVnaiq71TVPxzW4AAAsJKWspXii0k+dIH1/9jd7xn8PJgkVfWuJB9L8u7Bc/5zVU2s1LAAADAsFw3j7n44yaklvt4tSX6/u1/u7v+T5ESS9y1jPgAAeFMs58N3v1ZVjw62WkwO1q5O8r1Fj3lqsPY3VNXuqjpSVUfm5uaWMQYAACzfpYbx55P83STvSfJskjve6At0993dPdPdM5s3b77EMQAAYGVcUhh393PdvdDd55L8Vv56u8TTSa5Z9NB3DNYAAGBVu6Qwrqq3L/r1HyU5/40VDyT5WFW9pap+Lsn2JH++vBEBAGD4Lvo9xlU1m+QDSa6qqqeSfDrJB6rqPUk6yRNJ/lmSdPdjVfWlJMeSnE3yye5eGM7oAACwcqq7Rz1DZmZm+siRI6MeA2DVq6rXPbYa/j0HWO2q6pHunrnQMZeEBgCACGMAAEgijAEAIIkwBgCAJMIYAACSCGMAAEgijAHG0nXXXZdnnnkm11133ahHAVgzLnqBDwBWn69//evZunXrqMcAWFOcMQYAgAhjAABIIowBACCJMAYYS+vWrXvNLQDL519UgDF07ty519wCsHzCGAAAIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkghjAABIIowBACCJMAYAgCTCGAAAkiwhjKvqC1V1sqqOLlqbqqqvVdXjg9vJwXpV1eeq6kRVPVpV7x3m8AAAsFKWcsb4i0k+9GNre5M81N3bkzw0+D1Jbk6yffCzO8nnV2ZMAAAYrouGcXc/nOTUjy3fkuTewf17k3xk0frv9Cv+LMmmqnr7Sg0LAADDcql7jLd097OD+99PsmVw/+ok31v0uKcGawAAsKot+8N33d1J+o0+r6p2V9WRqjoyNze33DEAAGBZLjWMnzu/RWJwe3Kw/nSSaxY97h2Dtb+hu+/u7pnuntm8efMljgEAACvjUsP4gSS3De7fluTgovWPD76d4v1J/mrRlgsAAFi11l/sAVU1m+QDSa6qqqeSfDrJbyb5UlXtSvJkko8OHv5gkg8nOZHkR0k+MYSZAQBgxV00jLt75+sc+uAFHttJPrncoQAA4M3myncAABBhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQBJhDAAASZL1ox4AYC2qqjXznt09lNcFWG2EMcAQDCsmf1L8CliA5bGVAmCMvF78imKA5XPGGGDMnI/gqhLEACvIGWMAAIgwBgCAJMIYAACSCGMAAEgijAEAIIkwBgCAJMIYAACSCGMAAEgijAEAIIkwBgCAJMIYAACSJOuX8+SqeiLJi0kWkpzt7pmqmkpyIMm2JE8k+Wh3n17emAAAMFwrccb4xu5+T3fPDH7fm+Sh7t6e5KHB7wAAsKoNYyvFLUnuHdy/N8lHhvAeAACwopYbxp3kv1XVI1W1e7C2pbufHdz/fpIty3wPAAAYumXtMU5yfXc/XVV/O8nXquovFx/s7q6qvtATByG9O0ne+c53LnMMAABYnmWdMe7upwe3J5P8UZL3JXmuqt6eJIPbk6/z3Lu7e6a7ZzZv3rycMQAAYNkuOYyr6m1V9dPn7yf5B0mOJnkgyW2Dh92W5OByhwQAgGFbzlaKLUn+qKrOv8593f3VqvpGki9V1a4kTyb56PLHBACA4brkMO7u7yb5+Qusv5Dkg8sZCgAA3myufAcAABHGAACQRBgDAECS5X+PMcDYmZqayunTp0c9xooYfAB6rE1OTubUqVOjHgNAGAOXn9OnT6f7gtceYgTWQtwDa4OtFAAAEGEMAABJhDEAACQRxgBjae5Hc/nVr/5qnv9/z496FIA1QxgDjKH9j+7PN5/7ZvZ/e/+oRwFYM4QxwJiZ+9FcDp44mE7nj0/8sbPGACtEGAOMmf2P7s+5PpckOdfnnDUGWCHCGGCMnD9bPH9uPkkyf27eWWOAFeICH8Blpz/9M8ln/taox7gk+6+czLmNG5N1f31RjHPzL2X/b8/kN14Yz6v59ad/ZtQjACQRxsBlqP79D8b2ynfffuBXMn/6O69Zm19X+dbPziR7vjyiqZanqtKfGfUUAMIYYKx8+ZfHM34BxoE9xgAAEGEMAABJhDEAACQRxgAAkEQYAwBAEt9KAVymquriD+JNMTk5OeoRAJIIY+AyNK7fYfzjqmrN/C0Aq4GtFAAAEGEMAABJhDEAACQRxgAAkEQYAwBAEmEMAABJhDEAACQRxgAAkEQYAwBAEmEMAABJhDEAACQRxgAAkEQYAwBAEmEMAABJhDEAACQRxgAAkEQYAwBAEmEMAABJhDEAACQRxgAAkEQYAwBAEmEMAABJhDEAACQZYhhX1Yeq6jtVdaKq9g7rfQAAYCWsH8aLVtVEkv+U5O8neSrJN6rqge4+Noz3A1htqmrNvE93D/09AFaDoYRxkvclOdHd302Sqvr9JLckEcbAZUFMAoyfYW2luDrJ9xb9/tRgDQAAVqWRffiuqnZX1ZGqOjI3NzeqMQAAIMnwwvjpJNcs+v0dg7VXdffd3T3T3TObN28e0hgAALA0wwrjbyTZXlU/V1U/leRjSR4Y0nsBAMCyDeXDd919tqp+Lcl/TTKR5Avd/dgw3gsAAFbCsL6VIt39YJIHh/X6AACwklz5DgAAIowBACCJMAYAgCTCGAAAkghjAABIklR3j3qGVNVckidHPQfAmLkqyfOjHgJgzPxsd1/w6nKrIowBeOOq6kh3z4x6DoC1wlYKAACIMAYAgCTCGGCc3T3qAQDWEnuMAQAgzhgDAEASYQywZlXVp6rqraOeA2Bc2EoBsEZV1RNJZrrbdx0DLIEzxgAjVFUfr6pHq+rbVfW7VbWtqv5ksPZQVb1z8LgvVtWvLHre/x3cfqCq/rSqvlxVf1lVv1ev+BdJtiY5VFWHqmpi8BpHq+ovqupfjeYvBli91o96AIDLVVW9O8lvJLmuu5+vqqkk9ya5t7vvrarbk3wuyUcu8lK/kOTdSZ5J8j+T/FJ3f66q/nWSGwev/YtJru7u6cF7bxrSnwUwtpwxBhidm5Lcf36rQ3efSvL3ktw3OP67Sa5fwuv8eXc/1d3nknwrybYLPOa7Sf5OVd1ZVR9K8oPlDg+w1ghjgPFwNoN/s6tqXZKfWnTs5UX3F3KB/w3s7tNJfj7Jnyb550l+e1iDAowrYQwwOn+S5B9X1ZVJMthK8fUkHxsc/ydJ/sfg/hNJfnFw/5eTXLGE138xyU8PXvuqJOu6+w/yyvaN967A/ABrij3GACPS3Y9V1WeT/PeqWkjyv5PsSfJfqurfJplL8onBw38rycGq+naSryb54RLe4u4kX62qZ5J8avC650+I/PoK/ikAa4KvawMAgNhKAQAASYQxAAAkEcYAAJBEGAMAQBJhDAAASYQxAAAkEcYAAJBEGAMAQJLk/wM+qVw8go53eAAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "length=train['과제명'].astype(str).apply(len)\n", + "plt.hist(length, bins=50, alpha=0.5, color='r', label='word')\n", + "plt.title('histogram of length of task_name')\n", + "plt.figure(figsize=(12, 5))\n", + "plt.boxplot(length, labels=['counts'], showmeans=True)\n", + "print('과제명 길이 최댓값: {}'.format(np.max(length)))\n", + "print('과제명 길이 최솟값: {}'.format(np.min(length)))\n", + "print('과제명 길이 평균값: {}'.format(np.mean(length)))\n", + "print('과제명 길이 중간값: {}'.format(np.median(length)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Pv-MWn4vG1M8", + "outputId": "52d4b890-f122-4be6-a8f9-cd84cb248bd1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "요약문_연구목표 길이 최댓값: 3951\n", + "요약문_연구목표 길이 최솟값: 1\n", + "요약문_연구목표 길이 평균값: 318.1008066366807\n", + "요약문_연구목표 길이 중간값: 249.0\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEICAYAAABfz4NwAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAdT0lEQVR4nO3dfbhcZXnv8e+PQAICkgCRxiQS1PgSbBtoClitpVAhoDZ46ktoK4GisRpOtdpTwPY6BBQLnmOxnINY1ECiaAgohxRjY0Ss9XgIBImBgDGbEE4SAgkkgVgVDNz947m3WRln9p79MjM72b/Pdc2119zPernXmjXrnvWsNbMVEZiZme3X6QTMzGxocEEwMzPABcHMzJILgpmZAS4IZmaWXBDMzAxwQegoSesl/VGDtt+XtKbdOQ0lKq6XtF3S3XXaz5X0/Q7ldoOkTwzSvF4taaWknZL+ajDmORz09vpL+qakWe3MaW+3f6cTsPoi4t+BV/c2nqS5wCsj4s9bnlT7vRF4MzAhIv6jU0lIOhd4b0S8sUWL+FvgzoiY2qL5D0sRccZA57GPv79+jc8QrCFJnf7AcDSwvpPFoE2OBlZ3OolWyTM9H2v2BhHhR4cewHrgb4BVwNPATcCB2XYysLEy7oXAJmAnsAY4FZgOPAf8Evgp8KMc96XAYmAb0AW8rzKfg4D5wHbgIcqn0401OV2YOT1LOYu8CHg4l/0g8PbK+OcC/xe4CtgBrAN+L+MbgC3ArB62Qd1cgfOBXwDP57pdWmfac4HvV56/BliW81oDvKvSdgNwDfCNXI/lwCsq7aflNE8DnwX+DXgv8NqaPHY0M786uf4x5aC/A/gu8NqMfyfn/Yuc/6sarOe6XM4jwJ9lfC7w5cp4k4AA9s/n3wU+Afwg5/0vwBHAjcAzwD3ApMr0AXwQWJvL+jjwipz+GWARMDLHHQPcDmyl7Eu3U87kqCz78tw3fg78N+DemvX6CHBbL++Rw4AFuZxHgb8H9qvZ9/53vm4/Bk6tyeG9led/QdnntwNLgaMrbceye995AvgYDd5f+/Kj4wkM5wfl4Hs35aB4eO6sf5ltJ5MHakrX0Qbgpfl8UvfBp/agkLHvUQ5qBwJT8810SrZdQTnYjQEmUA78tQVhJTAROChj78wc9wPeDfwHMC7bzgV2AecBIygHoP9POViOohxodwKHNNgGPeV6LpUDfp1pf9UOHJzb6DxKETsOeBKYku03AE8BJ2T7jcDCbDuScsD7L9n2oTwIvLdRHj3Nr06er8pt9mbgAEoR7mL3wfW7VA5cNdMenLm9Op+PA46t99pTvyB0UQ7qh1GK+U+AP8qcFwDXV6YP4DbgxZQD5LPAHcDLK9PPynGPAP4EeBFwKHAz8H8q8/pu7gfH5rJGUQ62r62Mcx/wJ728RxZkTofm+v0EOL9m3/vr3K7vphSGw2u3KzAjt8VrM5+/B36QbYcCm4GPUvbDQ4ETG72/9uWHT+M67+qIeCwitlE+wdXrR36e8oaaIumAiFgfEQ/Xm5mkicAbgAsj4hcRsRL4AnBOjvIu4JMRsT0iNgJXN8hpQ0T8HCAibs4cX4iImyifIE+ojP9IRFwfEc9TznImApdFxLMR8S3Kp6xX9iPXvngrpXvp+ojYFRH3AV+jFLNut0bE3RGxi3IA797WZwKrI+Lr2XY18HgTy2w0v1rvBr4REcsi4pfA/6Scqf1ek+v2AvA6SQdFxOaI6Ev30vUR8XBEPA18E3g4Ir6dOd9MKZxVn4qIZ3IZDwDfioh1lemPA4iIpyLiaxHxs4jYSTkb+IOaed0QEavz9XiWsm/8OYCkYykH+NsbJS5pBDATuDgidkbEeuDTwHsqo20BPhMRv8x9cw3wljqz+0vgHyLioVz3TwJTJR1N2Xcej4hP5364MyKWN8prX+aC0HnVA8/PgENqR4iILuDDlE8rWyQtlPTSBvN7KbAt36TdHgXGV9o3VNqqw3Vjks7Ju2B2SNoBvI7yqbrbE5Xh7iJSG/u19Woi1744GjixO8fM88+A36iM02hb77FNonw03NjEMnt97Srzf7Qy/xdyeb2uZ5TrJ++mHNA2S/qGpNc0kVu32teht9elqfElvUjSP0t6VNIzlDO90XkQ71a7b80H/lSSKAf1RVkoGjmS8sn/0Uqsdv/YlK9Xtb3ee+No4J8q+8Y2QDmviZQu0WHPBWEvERFfiXKXy9GUU/sru5tqRn0MOFzSoZXYyyjXH6CcGk+otE2st7jugfwE9XngAuCIiBhN+eSofq5KX3Ltiw3Av0XE6MrjkIj4QBPT7rFN8oBV3UYD/UngxyivW3X+E2lyPSNiaUS8mdJd9GPK6wGlG+pFlVF/o3baFvoopSvzxIh4MfCmjFf3iz22W0TcRTlb/H3gT4Ev9bKMJyldd0dXYrX7x/jcntX2x+rMawPw/pr946CI+EG2vbxBDsPq56BdEPYCeZ/6KZJGUS4+/pzSjQDlE9yk7rs4ImID5SLgP0g6UNJvUS7QfjnHXwRcLGmMpPGUA31PDqa8KbZmLudRzhAGrIlc++J24FWS3iPpgHz8rqTXNjHtN4DflHRW3lk1hz0Prk8AEySN7EdeULb5WySdKukAysH0Wcq690jSUZJmSDo4p/kpu1/7lcCbJL1M0mHAxf3Mrz8OpeyHOyQdDlzS5HQLKBeBfxkRPX6HJLsgFwGXSzo0P5x8hD33j5cAf5Wv9zsp1wiW1Jnd5yj7/bEAkg7L8aHsO+MkfVjSqFzWidm2x/trXzcsVnIfMIpyMfhJSjfFS9j95r85/z4l6Yc5fDalf/Yx4Fbgkoj4drZdRukOeQT4NnAL5UBTV0Q8SOm3/X+UN8dvUu7sGCw95dq07HY6jdLn/BhlO11J2Xa9Tfsk5VrDpygXiqcAK9i9Xb5DuUPocUlP9iO3NZS+8/9FeQ3fBrwtIp5rYvL9KAfBxyjdHH8AfCDnu4zSL78KuJce+uNb4DOU6yBPAncB/9rkdF+ifKBotuj/V8qZ0Drg+8BXgHmV9uXA5MzjcuAdEfFU7Uwi4lbK/rAwu7geAM7Itp2UC/5vo+w3a4E/zEnrvb/2Wdqz+82GG0kfAGZGRO0FwWErPw1upNzeeWen89mXSDqIciH4+IhY2+JlfQ/4QkQsaOVy9iU+QxhmJI2T9AZJ+0l6NaX74tZO59Vpkk6XNDq75T5G6Qu/q8Np7Ys+ANzThmLwIsp1gUdauZx9Tae/iWrtNxL4Z+AYypekFlK+BzDcvZ7SHTGScr/9Wd233drgkLSeUmjPqomvZs8Lx93eHxE39mM5L6F85+BfKN1M1iR3GZmZGeAuIzMzS3ttl9GRRx4ZkyZN6nQaZmZ7lXvvvffJiBhbr22vLQiTJk1ixYoVnU7DzGyvIunRRm3uMjIzM8AFwczMkguCmZkBLghmZpZcEMzMDHBBMDOz5IJgZmaAC4KZmSUXBDMzA/bibyq3xNy5fYubme1DfIZgZmaAC4KZmSUXBDMzA1wQzMwsuSCYmRnggmBmZqnXgiDpQEl3S/qRpNWSLs34MZKWS+qSdJOkkRkflc+7sn1SZV4XZ3yNpNMr8ekZ65J00eCvppmZ9aaZM4RngVMi4reBqcB0SScBVwJXRcQrge3A+Tn++cD2jF+V4yFpCjATOBaYDnxW0ghJI4BrgDOAKcDZOa6ZmbVRrwUhip/m0wPyEcApwC0Znw+clcMz8jnZfqokZXxhRDwbEY8AXcAJ+eiKiHUR8RywMMc1M7M2auoaQn6SXwlsAZYBDwM7ImJXjrIRGJ/D44ENANn+NHBENV4zTaN4vTxmS1ohacXWrVubSd3MzJrUVEGIiOcjYiowgfKJ/jUtzapxHtdFxLSImDZ27NhOpGBmts/q011GEbEDuBN4PTBaUvdvIU0ANuXwJmAiQLYfBjxVjddM0yhuZmZt1MxdRmMljc7hg4A3Aw9RCsM7crRZwG05vDifk+3fiYjI+My8C+kYYDJwN3APMDnvWhpJufC8eDBWzszMmtfMr52OA+bn3UD7AYsi4nZJDwILJX0CuA/4Yo7/ReBLkrqAbZQDPBGxWtIi4EFgFzAnIp4HkHQBsBQYAcyLiNWDtoZmZtaUXgtCRKwCjqsTX0e5nlAb/wXwzgbzuhy4vE58CbCkiXzNzKxF/P8QmuH/k2Bmw4B/usLMzAAXBDMzSy4IZmYGuCCYmVlyQTAzM8AFwczMkguCmZkBLghmZpZcEMzMDHBBMDOz5IJgZmaAC4KZmSUXBDMzA1wQzMwsuSCYmRnggmBmZskFwczMABcEMzNLLghmZga4IJiZWXJBMDMzwAXBzMySC4KZmQFNFARJEyXdKelBSaslfSjjcyVtkrQyH2dWprlYUpekNZJOr8SnZ6xL0kWV+DGSlmf8JkkjB3tFzcysZ82cIewCPhoRU4CTgDmSpmTbVRExNR9LALJtJnAsMB34rKQRkkYA1wBnAFOAsyvzuTLn9UpgO3D+IK2fmZk1qdeCEBGbI+KHObwTeAgY38MkM4CFEfFsRDwCdAEn5KMrItZFxHPAQmCGJAGnALfk9POBs/q7QmZm1j99uoYgaRJwHLA8QxdIWiVpnqQxGRsPbKhMtjFjjeJHADsiYldNvN7yZ0taIWnF1q1b+5K6mZn1oumCIOkQ4GvAhyPiGeBa4BXAVGAz8OmWZFgREddFxLSImDZ27NhWL87MbFjZv5mRJB1AKQY3RsTXASLiiUr754Hb8+kmYGJl8gkZo0H8KWC0pP3zLKE6vpmZtUkzdxkJ+CLwUET8YyU+rjLa24EHcngxMFPSKEnHAJOBu4F7gMl5R9FIyoXnxRERwJ3AO3L6WcBtA1stMzPrq2bOEN4AvAe4X9LKjH2McpfQVCCA9cD7ASJitaRFwIOUO5TmRMTzAJIuAJYCI4B5EbE653chsFDSJ4D7KAXIzMzaqNeCEBHfB1SnaUkP01wOXF4nvqTedBGxjnIXkpmZdUhT1xCsgblz+xY3MxvC/NMVZmYGuCCYmVlyQTAzM8AFwczMkguCmZkBLghmZpZcEMzMDHBBMDOz5IJgZmaAC4KZmaXh+dMV/mkJM7Nf4zMEMzMDXBDMzCy5IJiZGeCCYGZmyQXBzMwAFwQzM0suCGZmBrggmJlZckEwMzPABcHMzJILgpmZAS4IZmaWei0IkiZKulPSg5JWS/pQxg+XtEzS2vw7JuOSdLWkLkmrJB1fmdesHH+tpFmV+O9Iuj+nuVqSWrGyZmbWWDNnCLuAj0bEFOAkYI6kKcBFwB0RMRm4I58DnAFMzsds4FooBQS4BDgROAG4pLuI5Djvq0w3feCrZmZmfdFrQYiIzRHxwxzeCTwEjAdmAPNztPnAWTk8A1gQxV3AaEnjgNOBZRGxLSK2A8uA6dn24oi4KyICWFCZl5mZtUmfriFImgQcBywHjoqIzdn0OHBUDo8HNlQm25ixnuIb68TrLX+2pBWSVmzdurUvqZuZWS+aLgiSDgG+Bnw4Ip6ptuUn+xjk3H5NRFwXEdMiYtrYsWNbvTgzs2GlqYIg6QBKMbgxIr6e4Seyu4f8uyXjm4CJlcknZKyn+IQ6cTMza6Nm7jIS8EXgoYj4x0rTYqD7TqFZwG2V+Dl5t9FJwNPZtbQUOE3SmLyYfBqwNNuekXRSLuucyrzMzKxNmvmfym8A3gPcL2llxj4GXAEsknQ+8CjwrmxbApwJdAE/A84DiIhtkj4O3JPjXRYR23L4g8ANwEHAN/NhZmZt1GtBiIjvA42+F3BqnfEDmNNgXvOAeXXiK4DX9ZaLmZm1jr+pbGZmgAuCmZklFwQzMwNcEMzMLLkgmJkZ0Nxtp9ZXc+f2LW5mNgT4DMHMzAAXBDMzSy4IZmYGuCCYmVlyQTAzM8AFwczMkguCmZkBLghmZpZcEMzMDHBBMDOz5IJgZmaAC4KZmSUXBDMzA1wQzMwsuSCYmRnggmBmZskFwczMgCYKgqR5krZIeqASmytpk6SV+Tiz0naxpC5JaySdXolPz1iXpIsq8WMkLc/4TZJGDuYKmplZc5o5Q7gBmF4nflVETM3HEgBJU4CZwLE5zWcljZA0ArgGOAOYApyd4wJcmfN6JbAdOH8gK2RmZv3Ta0GIiO8B25qc3wxgYUQ8GxGPAF3ACfnoioh1EfEcsBCYIUnAKcAtOf184Kw+roOZmQ2CgVxDuEDSquxSGpOx8cCGyjgbM9YofgSwIyJ21cTNzKzN+lsQrgVeAUwFNgOfHrSMeiBptqQVklZs3bq1HYs0Mxs2+lUQIuKJiHg+Il4APk/pEgLYBEysjDohY43iTwGjJe1fE2+03OsiYlpETBs7dmx/Ujczswb6VRAkjas8fTvQfQfSYmCmpFGSjgEmA3cD9wCT846ikZQLz4sjIoA7gXfk9LOA2/qTk5mZDcz+vY0g6avAycCRkjYClwAnS5oKBLAeeD9ARKyWtAh4ENgFzImI53M+FwBLgRHAvIhYnYu4EFgo6RPAfcAXB23tzMysab0WhIg4u0644UE7Ii4HLq8TXwIsqRNfx+4uJzMz6xB/U9nMzAAXBDMzSy4IZmYGuCCYmVlyQTAzM8AFwczMkguCmZkBTXwPwQbR3Ll9i5uZtZHPEMzMDHBBMDOz5IJgZmaAC4KZmSUXBDMzA1wQzMwsuSCYmRnggmBmZskFwczMABcEMzNLLghmZga4IJiZWXJBMDMzwAXBzMySC4KZmQEuCGZmlnotCJLmSdoi6YFK7HBJyyStzb9jMi5JV0vqkrRK0vGVaWbl+GslzarEf0fS/TnN1ZI02CtpZma9a+YM4QZgek3sIuCOiJgM3JHPAc4AJudjNnAtlAICXAKcCJwAXNJdRHKc91Wmq12WmZm1Qa8FISK+B2yrCc8A5ufwfOCsSnxBFHcBoyWNA04HlkXEtojYDiwDpmfbiyPirogIYEFlXmZm1kb9/Z/KR0XE5hx+HDgqh8cDGyrjbcxYT/GNdeJ1SZpNOfPgZS97WT9TH4J6+p/K/n/LZtYmA76onJ/sYxByaWZZ10XEtIiYNnbs2HYs0sxs2OhvQXgiu3vIv1syvgmYWBlvQsZ6ik+oEzczszbrb0FYDHTfKTQLuK0SPyfvNjoJeDq7lpYCp0kakxeTTwOWZtszkk7Ku4vOqczLzMzaqNdrCJK+CpwMHClpI+VuoSuARZLOBx4F3pWjLwHOBLqAnwHnAUTENkkfB+7J8S6LiO4L1R+k3Ml0EPDNfJiZWZv1WhAi4uwGTafWGTeAOQ3mMw+YVye+Anhdb3mYmVlr+ZvKZmYGuCCYmVlyQTAzM8AFwczMkguCmZkBLghmZpZcEMzMDHBBMDOz5IJgZmaAC4KZmSUXBDMzA1wQzMwsuSCYmRnggmBmZqm//1PZ2qXR/1T2/1o2s0HmMwQzMwNcEMzMLLkgmJkZ4IJgZmbJBcHMzAAXBDMzSy4IZmYGuCCYmVlyQTAzM2CABUHSekn3S1opaUXGDpe0TNLa/Dsm45J0taQuSaskHV+Zz6wcf62kWQNbJTMz64/BOEP4w4iYGhHT8vlFwB0RMRm4I58DnAFMzsds4FooBQS4BDgROAG4pLuImJlZ+7Siy2gGMD+H5wNnVeILorgLGC1pHHA6sCwitkXEdmAZML0FeZmZWQ8GWhAC+JakeyXNzthREbE5hx8Hjsrh8cCGyrQbM9Yo/mskzZa0QtKKrVu3DjB1MzOrGuivnb4xIjZJegmwTNKPq40REZJigMuozu864DqAadOmDdp8zcxsgGcIEbEp/24BbqVcA3giu4LIv1ty9E3AxMrkEzLWKG5mZm3U7zMESQcD+0XEzhw+DbgMWAzMAq7Iv7flJIuBCyQtpFxAfjoiNktaCnyyciH5NODi/uY1bPj/JJjZIBtIl9FRwK2SuufzlYj4V0n3AIsknQ88Crwrx18CnAl0AT8DzgOIiG2SPg7ck+NdFhHbBpCXmZn1Q78LQkSsA367Tvwp4NQ68QDmNJjXPGBef3MxM7OB8zeVzcwMcEEwM7PkgmBmZoALgpmZJRcEMzMDBv5NZRtq+vo9BH9vwcySzxDMzAxwQTAzs+SCYGZmgAuCmZklFwQzMwN8l5H5V1PNLPkMwczMABcEMzNLLghmZga4IJiZWXJBMDMzwAXBzMySC4KZmQH+HoI14u8nmA07PkMwMzPABcHMzJK7jKxv3JVkts9yQbDB0ddC4cJiNuQMmYIgaTrwT8AI4AsRcUWHU7LBMFj/0tOFwqzlhsQ1BEkjgGuAM4ApwNmSpnQ2KzOz4WWonCGcAHRFxDoASQuBGcCDHc3Khg6fIezmbWEtMlQKwnhgQ+X5RuDE2pEkzQZm59OfSlrTz+UdCTzZz2lbaajmBUM3t6GaF7Qqt0svHegcht82GxxDNbe+5nV0o4ahUhCaEhHXAdcNdD6SVkTEtEFIaVAN1bxg6OY2VPOCoZvbUM0LnFt/DGZeQ+IaArAJmFh5PiFjZmbWJkOlINwDTJZ0jKSRwExgcYdzMjMbVoZEl1FE7JJ0AbCUctvpvIhY3cJFDrjbqUWGal4wdHMbqnnB0M1tqOYFzq0/Bi0vRcRgzcvMzPZiQ6XLyMzMOswFwczMgGFWECRNl7RGUpekizqUw3pJ90taKWlFxg6XtEzS2vw7JuOSdHXmu0rS8YOYxzxJWyQ9UIn1OQ9Js3L8tZJmtTC3uZI25XZbKenMStvFmdsaSadX4oP6ekuaKOlOSQ9KWi3pQxnv+HbrIbeObjdJB0q6W9KPMq9LM36MpOW5jJvyZhIkjcrnXdk+qbd8W5DbDZIeqWyzqRlv9/tghKT7JN2ez1u/zSJiWDwoF6sfBl4OjAR+BEzpQB7rgSNrYp8CLsrhi4Arc/hM4JuAgJOA5YOYx5uA44EH+psHcDiwLv+OyeExLcptLvA3dcadkq/lKOCYfI1HtOL1BsYBx+fwocBPcvkd32495NbR7ZbrfkgOHwAsz22xCJiZ8c8BH8jhDwKfy+GZwE095TvAbdYotxuAd9QZv93vg48AXwFuz+ct32bD6QzhVz+PERHPAd0/jzEUzADm5/B84KxKfEEUdwGjJY0bjAVGxPeAbQPM43RgWURsi4jtwDJgeotya2QGsDAino2IR4Auyms96K93RGyOiB/m8E7gIcq37Du+3XrIrZG2bLdc95/m0wPyEcApwC0Zr91m3dvyFuBUSeoh337rIbdG2vZ6SpoAvAX4Qj4Xbdhmw6kg1Pt5jJ7eMK0SwLck3avyUxwAR0XE5hx+HDgqh9udc1/zaHd+F+Sp+rzubplO5Zan5cdRPlUOqe1Wkxt0eLtl18dKYAvlYPkwsCMidtVZxq+Wn+1PA0e0Iq96uUVE9za7PLfZVZJG1eZWk0MrcvsM8LfAC/n8CNqwzYZTQRgq3hgRx1N+2XWOpDdVG6Oc63X8XuChkkfFtcArgKnAZuDTnUpE0iHA14APR8Qz1bZOb7c6uXV8u0XE8xExlfILBCcAr2l3Do3U5ibpdcDFlBx/l9INdGE7c5L0VmBLRNzbzuXC8CoIQ+LnMSJiU/7dAtxKeYM80d0VlH+35OjtzrmvebQtv4h4It+8LwCfZ/epb1tzk3QA5YB7Y0R8PcNDYrvVy22obLfMZQdwJ/B6SndL9xdjq8v41fKz/TDgqVbmVZPb9Ox+i4h4Frie9m+zNwB/LGk9pcvuFMr/imn9NhvohY+95UH5VvY6ysWV7otlx7Y5h4OBQyvDP6D0Nf4P9rwo+akcfgt7XsS6e5DzmcSeF277lAfl09MjlAtpY3L48BblNq4y/NeUvlGAY9nzwtk6yoXRQX+9c/0XAJ+piXd8u/WQW0e3GzAWGJ3DBwH/DrwVuJk9L5B+MIfnsOcF0kU95TvAbdYot3GVbfoZ4IoOvg9OZvdF5ZZvs0E7uOwND8pdAj+h9GH+XQeW//J8gX4ErO7OgdLfdwewFvh2986UO941me/9wLRBzOWrlC6EX1L6Fs/vTx7AX1AuVnUB57Uwty/lsldRfueqeqD7u8xtDXBGq15v4I2U7qBVwMp8nDkUtlsPuXV0uwG/BdyXy38A+O+V98Lduf43A6MyfmA+78r2l/eWbwty+05usweAL7P7TqS2vg9yviezuyC0fJv5pyvMzAwYXtcQzMysBy4IZmYGuCCYmVlyQTAzM8AFwczMkguCmZkBLghmZpb+E/QWVwydaiMBAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "length=train['요약문_연구목표'].astype(str).apply(len)\n", + "plt.hist(length, bins=50, alpha=0.5, color='r', label='word')\n", + "plt.title('histogram of length of summary_object')\n", + "plt.figure(figsize=(12, 5))\n", + "plt.boxplot(length, labels=['counts'], showmeans=True)\n", + "print('요약문_연구목표 길이 최댓값: {}'.format(np.max(length)))\n", + "print('요약문_연구목표 길이 최솟값: {}'.format(np.min(length)))\n", + "print('요약문_연구목표 길이 평균값: {}'.format(np.mean(length)))\n", + "print('요약문_연구목표 길이 중간값: {}'.format(np.median(length)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "o1kWtNfSG1M9", + "outputId": "0d753fdd-38a6-48bd-bfe9-1f38d1bc7db3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "요약문_연구내용 길이 최댓값: 3999\n", + "요약문_연구내용 길이 최솟값: 1\n", + "요약문_연구내용 길이 평균값: 699.2930282724435\n", + "요약문_연구내용 길이 중간값: 597.0\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "length=train['요약문_연구내용'].astype(str).apply(len)\n", + "plt.hist(length, bins=50, alpha=0.5, color='r', label='word')\n", + "plt.title('histogram of length of summary_content')\n", + "plt.figure(figsize=(12, 5))\n", + "plt.boxplot(length, labels=['counts'], showmeans=True)\n", + "print('요약문_연구내용 길이 최댓값: {}'.format(np.max(length)))\n", + "print('요약문_연구내용 길이 최솟값: {}'.format(np.min(length)))\n", + "print('요약문_연구내용 길이 평균값: {}'.format(np.mean(length)))\n", + "print('요약문_연구내용 길이 중간값: {}'.format(np.median(length)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PcOSA2UrG1M-", + "outputId": "facd11fa-2a9e-45d4-8c0e-a5169bd3aee1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "요약문_기대효과 길이 최댓값: 3649\n", + "요약문_기대효과 길이 최솟값: 1\n", + "요약문_기대효과 길이 평균값: 400.4864374885258\n", + "요약문_기대효과 길이 중간값: 329.0\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "length=train['요약문_기대효과'].astype(str).apply(len)\n", + "plt.hist(length, bins=50, alpha=0.5, color='r', label='word')\n", + "plt.title('histogram of length of summary_effect')\n", + "plt.figure(figsize=(12, 5))\n", + "plt.boxplot(length, labels=['counts'], showmeans=True)\n", + "print('요약문_기대효과 길이 최댓값: {}'.format(np.max(length)))\n", + "print('요약문_기대효과 길이 최솟값: {}'.format(np.min(length)))\n", + "print('요약문_기대효과 길이 평균값: {}'.format(np.mean(length)))\n", + "print('요약문_기대효과 길이 중간값: {}'.format(np.median(length)))" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# **LSTM**" + ], + "metadata": { + "id": "ucpfx1v_HSjB" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2IrIyNptG1M-" + }, + "source": [ + "### 데이터 전처리" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z6iLCH-AG1M_" + }, + "outputs": [], + "source": [ + "#해당 baseline 에서는 과제명 columns만 활용했습니다.\n", + "#다채로운 변수 활용법으로 성능을 높여주세요!\n", + "train=train[['과제명','label']]\n", + "test=test[['과제명']]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4Npx79nOG1M_", + "outputId": "37cd562c-68e6-44ae-9e90-022088e77278" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
과제명label
0유전정보를 활용한 새로운 해충 분류군 동정기술 개발24
1대장암의 TRAIL 내성 표적 인자 발굴 및 TRAIL 반응 예측 유전자 지도 구축...0
\n", + "
" + ], + "text/plain": [ + " 과제명 label\n", + "0 유전정보를 활용한 새로운 해충 분류군 동정기술 개발 24\n", + "1 대장암의 TRAIL 내성 표적 인자 발굴 및 TRAIL 반응 예측 유전자 지도 구축... 0" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EQ2nhDZdG1NA", + "outputId": "964c1a54-38f8-4f64-c421-8e83b1f996f1" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
과제명
0R-FSSW 기술 적용 경량 차체 부품 개발 및 품질 평가를 위한 64채널 C-SC...
1다입자계를 묘사하는 편미분방정식에 대한 연구
\n", + "
" + ], + "text/plain": [ + " 과제명\n", + "0 R-FSSW 기술 적용 경량 차체 부품 개발 및 품질 평가를 위한 64채널 C-SC...\n", + "1 다입자계를 묘사하는 편미분방정식에 대한 연구" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oQ5sdNpQG1NB" + }, + "outputs": [], + "source": [ + "#1. re.sub 한글 및 공백을 제외한 문자 제거\n", + "#2. okt 객체를 활용해 형태소 단위로 나눔\n", + "#3. remove_stopwords로 불용어 제거 \n", + "def preprocessing(text, okt, remove_stopwords=False, stop_words=[]):\n", + " text=re.sub(\"[^가-힣ㄱ-ㅎㅏ-ㅣ]\",\"\", text)\n", + " word_text=okt.morphs(text, stem=True)\n", + " if remove_stopwords:\n", + " word_review=[token for token in word_text if not token in stop_words]\n", + " return word_review" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5tCWzJBuG1NB" + }, + "outputs": [], + "source": [ + "stop_words=['은','는','이','가', '하','아','것','들','의','있','되','수','보','주','등','한']\n", + "okt=Okt()\n", + "clean_train_text=[]\n", + "clean_test_text=[]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cFLVPiRsG1NC", + "outputId": "cc0b1362-5efa-4301-eddb-92223f4e5f18" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 174304/174304 [57:37<00:00, 50.41it/s] \n" + ] + } + ], + "source": [ + "#시간이 많이 걸립니다.\n", + "for text in tqdm.tqdm(train['과제명']):\n", + " try:\n", + " clean_train_text.append(preprocessing(text, okt, remove_stopwords=True, stop_words=stop_words))\n", + " except:\n", + " clean_train_text.append([])\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w_yEivdjG1ND", + "outputId": "ea9f9eac-055e-45b9-ac4a-b6ab8515e43e" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 43576/43576 [15:29<00:00, 46.89it/s]\n" + ] + } + ], + "source": [ + "for text in tqdm.tqdm(test['과제명']):\n", + " if type(text) == str:\n", + " clean_test_text.append(preprocessing(text, okt, remove_stopwords=True, stop_words=stop_words))\n", + " else:\n", + " clean_test_text.append([])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "84acpVlTG1NE", + "outputId": "cc1de942-2469-40bf-f0c6-67d027560c0b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "174304\n", + "43576\n" + ] + } + ], + "source": [ + "print(len(clean_train_text))\n", + "print(len(clean_test_text))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NE1hBG3cG1NF" + }, + "outputs": [], + "source": [ + "#텐서플로의 전처리 모듈을 활용해 토크나이징 객체를 만든 후 인덱스 벡터로 전환\n", + "tokenizer=Tokenizer()\n", + "tokenizer.fit_on_texts(clean_train_text)\n", + "\n", + "train_sequences=tokenizer.texts_to_sequences(clean_train_text)\n", + "test_sequences=tokenizer.texts_to_sequences(clean_test_text)\n", + "word_vocab=tokenizer.word_index\n", + "\n", + "#패딩 처리\n", + "train_inputs=pad_sequences(train_sequences, maxlen=40, padding='post')\n", + "test_inputs=pad_sequences(test_sequences, maxlen=40, padding='post')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kYOnEW9-G1NF", + "outputId": "30b67abc-6d1d-4c63-dde0-a4743648af9f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(174304, 40)\n", + "(43576, 40)\n" + ] + } + ], + "source": [ + "print(train_inputs.shape)\n", + "print(test_inputs.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bhBj7XNUG1NF", + "outputId": "0eb18922-9be8-4149-8772-d60db2dd5ada" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "46" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "labels=np.array(train['label'])\n", + "len(set(labels))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3urpeS40G1NG" + }, + "outputs": [], + "source": [ + "#추후 재사용 가능하도록 npy로 전환\n", + "DATA_IN_PATH='./data_in/'\n", + "TRAIN_INPUT_DATA = 'train_input.npy'\n", + "TEST_INPUT_DATA = 'test_input.npy'\n", + "\n", + "import os\n", + "if not os.path.exists(DATA_IN_PATH):\n", + " os.makedirs(DATA_IN_PATH)\n", + " \n", + "np.save(open(DATA_IN_PATH+TRAIN_INPUT_DATA, 'wb'), train_inputs)\n", + "np.save(open(DATA_IN_PATH+TEST_INPUT_DATA, 'wb'), test_inputs)\n", + "\n", + "data_configs={}\n", + "data_configs['vocab']=word_vocab\n", + "data_configs['vocab_size'] = len(word_vocab)+1\n", + "json.dump(data_configs, open(DATA_IN_PATH+'data_configs.json', 'w'), ensure_ascii=False)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### 모델링" + ], + "metadata": { + "id": "utV-fRXBHpaA" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "K2DAGLmCG1NH" + }, + "outputs": [], + "source": [ + "#파라미터 설정\n", + "vocab_size =data_configs['vocab_size']\n", + "embedding_dim = 32\n", + "max_length = 40\n", + "oov_tok = \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xQlBPyhTG1NI", + "outputId": "f5c87b32-bbe5-41a4-a949-dffc68f250e7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"sequential_1\"\n", + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "embedding_1 (Embedding) (None, 40, 32) 973376 \n", + "_________________________________________________________________\n", + "global_average_pooling1d_1 ( (None, 32) 0 \n", + "_________________________________________________________________\n", + "dense_2 (Dense) (None, 128) 4224 \n", + "_________________________________________________________________\n", + "dense_3 (Dense) (None, 46) 5934 \n", + "=================================================================\n", + "Total params: 983,534\n", + "Trainable params: 983,534\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n", + "None\n" + ] + } + ], + "source": [ + "#가벼운 NLP모델 생성\n", + "model = tf.keras.Sequential([\n", + " tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n", + " tf.keras.layers.GlobalAveragePooling1D(),\n", + " tf.keras.layers.Dense(128, activation='relu'),\n", + " tf.keras.layers.Dense(46, activation='softmax')\n", + "])\n", + "\n", + "# compile model\n", + "model.compile(loss='sparse_categorical_crossentropy',\n", + " optimizer='adam',\n", + " metrics=['accuracy'])\n", + "\n", + "# model summary\n", + "print(model.summary())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sWMYG1l2G1NJ", + "outputId": "d7b06c79-2ad2-4f76-819a-ef2db4b8cefa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 1/30\n", + "4358/4358 - 28s - loss: 0.9296 - accuracy: 0.8202 - val_loss: 0.8065 - val_accuracy: 0.8248\n", + "Epoch 2/30\n", + "4358/4358 - 24s - loss: 0.6618 - accuracy: 0.8400 - val_loss: 0.6244 - val_accuracy: 0.8463\n", + "Epoch 3/30\n", + "4358/4358 - 25s - loss: 0.5196 - accuracy: 0.8643 - val_loss: 0.5572 - val_accuracy: 0.8582\n", + "Epoch 4/30\n", + "4358/4358 - 24s - loss: 0.4356 - accuracy: 0.8810 - val_loss: 0.5206 - val_accuracy: 0.8663\n", + "Epoch 5/30\n", + "4358/4358 - 25s - loss: 0.3740 - accuracy: 0.8946 - val_loss: 0.4925 - val_accuracy: 0.8742\n", + "Epoch 6/30\n", + "4358/4358 - 25s - loss: 0.3273 - accuracy: 0.9061 - val_loss: 0.4862 - val_accuracy: 0.8788\n", + "Epoch 7/30\n", + "4358/4358 - 24s - loss: 0.2925 - accuracy: 0.9148 - val_loss: 0.4805 - val_accuracy: 0.8818\n", + "Epoch 8/30\n", + "4358/4358 - 25s - loss: 0.2648 - accuracy: 0.9218 - val_loss: 0.4684 - val_accuracy: 0.8866\n", + "Epoch 9/30\n", + "4358/4358 - 24s - loss: 0.2407 - accuracy: 0.9283 - val_loss: 0.4734 - val_accuracy: 0.8909\n", + "Epoch 10/30\n", + "4358/4358 - 25s - loss: 0.2210 - accuracy: 0.9341 - val_loss: 0.4924 - val_accuracy: 0.8855\n", + "Epoch 11/30\n", + "4358/4358 - 24s - loss: 0.2043 - accuracy: 0.9384 - val_loss: 0.4894 - val_accuracy: 0.8921\n", + "Epoch 12/30\n", + "4358/4358 - 24s - loss: 0.1898 - accuracy: 0.9424 - val_loss: 0.4938 - val_accuracy: 0.8921\n", + "Epoch 13/30\n", + "4358/4358 - 24s - loss: 0.1765 - accuracy: 0.9464 - val_loss: 0.5126 - val_accuracy: 0.8935\n", + "Epoch 14/30\n", + "4358/4358 - 25s - loss: 0.1652 - accuracy: 0.9491 - val_loss: 0.5247 - val_accuracy: 0.8926\n", + "Epoch 15/30\n", + "4358/4358 - 25s - loss: 0.1552 - accuracy: 0.9523 - val_loss: 0.5314 - val_accuracy: 0.8932\n", + "Epoch 16/30\n", + "4358/4358 - 25s - loss: 0.1459 - accuracy: 0.9552 - val_loss: 0.5397 - val_accuracy: 0.8960\n", + "Epoch 17/30\n", + "4358/4358 - 24s - loss: 0.1376 - accuracy: 0.9573 - val_loss: 0.5556 - val_accuracy: 0.8939\n", + "Epoch 18/30\n", + "4358/4358 - 24s - loss: 0.1295 - accuracy: 0.9592 - val_loss: 0.5693 - val_accuracy: 0.8932\n", + "Epoch 19/30\n", + "4358/4358 - 24s - loss: 0.1232 - accuracy: 0.9612 - val_loss: 0.5777 - val_accuracy: 0.8952\n", + "Epoch 20/30\n", + "4358/4358 - 25s - loss: 0.1162 - accuracy: 0.9629 - val_loss: 0.5951 - val_accuracy: 0.8956\n", + "Epoch 21/30\n", + "4358/4358 - 24s - loss: 0.1105 - accuracy: 0.9647 - val_loss: 0.6128 - val_accuracy: 0.8970\n", + "Epoch 22/30\n", + "4358/4358 - 25s - loss: 0.1052 - accuracy: 0.9666 - val_loss: 0.6324 - val_accuracy: 0.8956\n", + "Epoch 23/30\n", + "4358/4358 - 24s - loss: 0.1002 - accuracy: 0.9681 - val_loss: 0.6410 - val_accuracy: 0.8933\n", + "Epoch 24/30\n", + "4358/4358 - 25s - loss: 0.0957 - accuracy: 0.9698 - val_loss: 0.6635 - val_accuracy: 0.8958\n", + "Epoch 25/30\n", + "4358/4358 - 24s - loss: 0.0914 - accuracy: 0.9705 - val_loss: 0.6841 - val_accuracy: 0.8991\n", + "Epoch 26/30\n", + "4358/4358 - 24s - loss: 0.0874 - accuracy: 0.9719 - val_loss: 0.6902 - val_accuracy: 0.8952\n", + "Epoch 27/30\n", + "4358/4358 - 24s - loss: 0.0838 - accuracy: 0.9732 - val_loss: 0.7118 - val_accuracy: 0.8938\n", + "Epoch 28/30\n", + "4358/4358 - 24s - loss: 0.0805 - accuracy: 0.9742 - val_loss: 0.7211 - val_accuracy: 0.8943\n", + "Epoch 29/30\n", + "4358/4358 - 24s - loss: 0.0771 - accuracy: 0.9748 - val_loss: 0.7344 - val_accuracy: 0.8949\n", + "Epoch 30/30\n", + "4358/4358 - 24s - loss: 0.0739 - accuracy: 0.9759 - val_loss: 0.7569 - val_accuracy: 0.8958\n" + ] + } + ], + "source": [ + "# fit model\n", + "num_epochs = 30\n", + "history = model.fit(train_inputs, labels, \n", + " epochs=num_epochs, verbose=2, \n", + " validation_split=0.2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ghEALrTSG1NM" + }, + "outputs": [], + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# **Random Forest**" + ], + "metadata": { + "id": "XOonBmHOHVHn" + } + }, + { + "cell_type": "markdown", + "source": [ + "### 데이터 전처리" + ], + "metadata": { + "id": "Vi0yyjMWHX3u" + } + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "5mCIk7axH6Yj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "A8alJB_FIGTr" + }, + "outputs": [], + "source": [ + "#해당 baseline 에서는 과제명 columns만 활용했습니다.\n", + "#다채로운 변수 활용법으로 성능을 높여주세요!\n", + "train=train[['과제명','label']]\n", + "test=test[['과제명']]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gW0S2UvlIGTs", + "outputId": "064133e1-7d09-4d42-f46c-6ff8f0c62771" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
과제명label
0유전정보를 활용한 새로운 해충 분류군 동정기술 개발24
1대장암의 TRAIL 내성 표적 인자 발굴 및 TRAIL 반응 예측 유전자 지도 구축...0
\n", + "
" + ], + "text/plain": [ + " 과제명 label\n", + "0 유전정보를 활용한 새로운 해충 분류군 동정기술 개발 24\n", + "1 대장암의 TRAIL 내성 표적 인자 발굴 및 TRAIL 반응 예측 유전자 지도 구축... 0" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CBnQx5cxIGTs", + "outputId": "486c9a8f-cba5-42e0-a9c6-50f27152dd1f" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
과제명
0R-FSSW 기술 적용 경량 차체 부품 개발 및 품질 평가를 위한 64채널 C-SC...
1다입자계를 묘사하는 편미분방정식에 대한 연구
\n", + "
" + ], + "text/plain": [ + " 과제명\n", + "0 R-FSSW 기술 적용 경량 차체 부품 개발 및 품질 평가를 위한 64채널 C-SC...\n", + "1 다입자계를 묘사하는 편미분방정식에 대한 연구" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RWSlxo0JIGTt" + }, + "outputs": [], + "source": [ + "#1. re.sub 한글 및 공백을 제외한 문자 제거\n", + "#2. okt 객체를 활용해 형태소 단위로 나눔\n", + "#3. remove_stopwords로 불용어 제거 \n", + "def preprocessing(text, okt, remove_stopwords=False, stop_words=[]):\n", + " text=re.sub(\"[^가-힣ㄱ-ㅎㅏ-ㅣ]\",\"\", text)\n", + " word_text=okt.morphs(text, stem=True)\n", + " if remove_stopwords:\n", + " word_review=[token for token in word_text if not token in stop_words]\n", + " return word_review" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9cDCnAoHIGTt" + }, + "outputs": [], + "source": [ + "stop_words=['은','는','이','가', '하','아','것','들','의','있','되','수','보','주','등','한']\n", + "okt=Okt()\n", + "clean_train_text=[]\n", + "clean_test_text=[]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w6U0yg-DIGTt", + "outputId": "1bb37197-d9ea-414c-ef12-6912c4f265b3" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 174304/174304 [44:25<00:00, 65.39it/s]\n" + ] + } + ], + "source": [ + "#시간이 많이 걸립니다.\n", + "for text in tqdm.tqdm(train['과제명']):\n", + " try:\n", + " clean_train_text.append(preprocessing(text, okt, remove_stopwords=True, stop_words=stop_words))\n", + " except:\n", + " clean_train_text.append([])\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "S4lSvNu8IGTu", + "outputId": "34f1f138-d366-4f18-90c8-39daa298b881" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 43576/43576 [12:25<00:00, 58.44it/s]\n" + ] + } + ], + "source": [ + "for text in tqdm.tqdm(test['과제명']):\n", + " if type(text) == str:\n", + " clean_test_text.append(preprocessing(text, okt, remove_stopwords=True, stop_words=stop_words))\n", + " else:\n", + " clean_test_text.append([])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3hPs3U5fIGTu", + "outputId": "7eb3caea-c7e3-4e71-9500-5770aa45bf18" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "174304" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(clean_train_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SIdIkfDUIGTv", + "outputId": "f1fdff7e-80d4-448f-b029-7a8cff8c87a1" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "43576" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(clean_test_text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fa7oyP3sIGTv" + }, + "outputs": [], + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer\n", + "\n", + "#tokenizer 인자에는 list를 받아서 그대로 내보내는 함수를 넣어줍니다. 또한 소문자화를 하지 않도록 설정해야 에러가 나지 않습니다.\n", + "vectorizer = CountVectorizer(tokenizer = lambda x: x, lowercase=False)\n", + "train_features=vectorizer.fit_transform(clean_train_text)\n", + "test_features=vectorizer.transform(clean_test_text)\n", + "#test데이터에 fit_transform을 할 경우 data leakage에 해당합니다" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "C5ggldyZIGTv", + "outputId": "82ad1ae9-7655-408c-b7fe-40c21ec5338b" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "<174304x30402 sparse matrix of type ''\n", + "\twith 2078154 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jTjDVp34IGTx", + "outputId": "420b3690-bdb0-4597-ba6f-085fdaf56f51" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "<43576x30402 sparse matrix of type ''\n", + "\twith 518549 stored elements in Compressed Sparse Row format>" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_features" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### 모델링" + ], + "metadata": { + "id": "f863vntlH5Dj" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kvD_8pQxIGTx" + }, + "outputs": [], + "source": [ + "#훈련 데이터 셋과 검증 데이터 셋으로 분리\n", + "TEST_SIZE=0.2\n", + "RANDOM_SEED=42\n", + "\n", + "train_x, eval_x, train_y, eval_y=train_test_split(train_features, train['label'], test_size=TEST_SIZE, random_state=RANDOM_SEED)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NpkZXZYjIGTy", + "outputId": "07e32f89-c40a-4f2f-bccf-531efd343ce7" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "RandomForestClassifier()" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#랜덤포레스트로 모델링\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "forest=RandomForestClassifier(n_estimators=100)\n", + "\n", + "forest.fit(train_x, train_y)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e0IGwIxkIGTy", + "outputId": "9f1f6ee9-f857-4f87-85f5-8a9bdf0ea491" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9208571182696996" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#모델 검증\n", + "forest.score(eval_x, eval_y)" + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "pJ2IqM8NH3R1" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# **BERT**" + ], + "metadata": { + "id": "x8d1D0nWIQJT" + } + }, + { + "cell_type": "markdown", + "source": [ + "### 데이터 전처리" + ], + "metadata": { + "id": "DxxAGGA4IT4N" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"" + ], + "metadata": { + "id": "rd3cngTxIVIY" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "#이번 베이스라인에서는 과제명 뿐만 아니라 요약문_연구내용도 모델에 학습시켜보겠습니다.\n", + "train=train[['과제명', '요약문_연구내용','label']]\n", + "test=test[['과제명', '요약문_연구내용']]\n", + "train['요약문_연구내용'].fillna('NAN', inplace=True)\n", + "test['요약문_연구내용'].fillna('NAN', inplace=True)" + ], + "metadata": { + "id": "k2agNPbaIdHG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "train['data']=train['과제명']+train['요약문_연구내용']\n", + "test['data']=test['과제명']+test['요약문_연구내용']" + ], + "metadata": { + "id": "NMeLNm9lIdEu" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(train.shape)\n", + "print(test.shape)" + ], + "metadata": { + "id": "vE4wzCvXIdCG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "train.head(2)" + ], + "metadata": { + "id": "Ywue1y5dIh3W" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "test.head(2)" + ], + "metadata": { + "id": "oY3pTBoNIh1N" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### 모델링" + ], + "metadata": { + "id": "n95Q74jaIVol" + } + }, + { + "cell_type": "code", + "source": [ + "#random seed 고정\n", + "tf.random.set_seed(1234)\n", + "np.random.seed(1234)\n", + "BATCH_SIZE = 32\n", + "NUM_EPOCHS = 3\n", + "VALID_SPLIT = 0.2\n", + "MAX_LEN=200" + ], + "metadata": { + "id": "1xx-icJUIWXp" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from transformers import *\n", + "tokenizer=BertTokenizer.from_pretrained('bert-base-multilingual-cased', cache_dir='bert_ckpt', do_lower_case=False)\n", + "\n", + "def bert_tokenizer(sent, MAX_LEN):\n", + " \n", + " encoded_dict=tokenizer.encode_plus(\n", + " text = sent, \n", + " add_special_tokens=True, \n", + " max_length=MAX_LEN, \n", + " pad_to_max_length=True, \n", + " return_attention_mask=True,\n", + " truncation = True)\n", + " \n", + " input_id=encoded_dict['input_ids']\n", + " attention_mask=encoded_dict['attention_mask']\n", + " token_type_id = encoded_dict['token_type_ids']\n", + " \n", + " return input_id, attention_mask, token_type_id\n", + "\n", + "input_ids =[]\n", + "attention_masks =[]\n", + "token_type_ids =[]\n", + "train_data_labels = []\n", + "\n", + "def clean_text(sent):\n", + " sent_clean=re.sub(\"[^가-힣ㄱ-하-ㅣ]\", \" \", sent)\n", + " return sent_clean\n", + "\n", + "for train_sent, train_label in zip(train['data'], train['label']):\n", + " try:\n", + " input_id, attention_mask, token_type_id = bert_tokenizer(clean_text(train_sent), MAX_LEN=MAX_LEN)\n", + " \n", + " input_ids.append(input_id)\n", + " attention_masks.append(attention_mask)\n", + " token_type_ids.append(token_type_id)\n", + " #########################################\n", + " train_data_labels.append(train_label)\n", + " \n", + " except Exception as e:\n", + " print(e)\n", + " print(train_sent)\n", + " pass\n", + "\n", + "train_input_ids=np.array(input_ids, dtype=int)\n", + "train_attention_masks=np.array(attention_masks, dtype=int)\n", + "train_token_type_ids=np.array(token_type_ids, dtype=int)\n", + "###########################################################\n", + "train_inputs=(train_input_ids, train_attention_masks, train_token_type_ids)\n", + "train_labels=np.asarray(train_data_labels, dtype=np.int32)" + ], + "metadata": { + "id": "rLNzrvewIoS2" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(train_input_ids[1])\n", + "print(train_attention_masks[1])\n", + "print(train_token_type_ids[1])\n", + "print(tokenizer.decode(train_input_ids[1]))" + ], + "metadata": { + "id": "9GHbUZ0cIoQV" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "class TFBertClassifier(tf.keras.Model):\n", + " def __init__(self, model_name, dir_path, num_class):\n", + " super(TFBertClassifier, self).__init__()\n", + "\n", + " self.bert = TFBertModel.from_pretrained(model_name, cache_dir=dir_path)\n", + " self.dropout = tf.keras.layers.Dropout(self.bert.config.hidden_dropout_prob)\n", + " self.classifier = tf.keras.layers.Dense(num_class, \n", + " kernel_initializer=tf.keras.initializers.TruncatedNormal(self.bert.config.initializer_range), \n", + " name=\"classifier\")\n", + " \n", + " def call(self, inputs, attention_mask=None, token_type_ids=None, training=False):\n", + " \n", + " #outputs 값: # sequence_output, pooled_output, (hidden_states), (attentions)\n", + " outputs = self.bert(inputs, attention_mask=attention_mask, token_type_ids=token_type_ids)\n", + " pooled_output = outputs[1] \n", + " pooled_output = self.dropout(pooled_output, training=training)\n", + " logits = self.classifier(pooled_output)\n", + "\n", + " return logits\n", + "\n", + "cls_model = TFBertClassifier(model_name='bert-base-multilingual-cased',\n", + " dir_path='bert_ckpt',\n", + " num_class=46)\n", + "\n", + "# 학습 준비하기\n", + "optimizer = tf.keras.optimizers.Adam(3e-5)\n", + "loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n", + "metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')\n", + "cls_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])\n", + "\n", + "model_name = \"tf2_bert_classifier\"\n", + "\n", + "# overfitting을 막기 위한 ealrystop 추가\n", + "earlystop_callback = EarlyStopping(monitor='val_accuracy', min_delta=0.0001,patience=5)\n", + "# min_delta: the threshold that triggers the termination (acc should at least improve 0.0001)\n", + "# patience: no improvment epochs (patience = 1, 1번 이상 상승이 없으면 종료)\\\n", + "\n", + "checkpoint_path = os.path.join(model_name, 'weights.h5')\n", + "checkpoint_dir = os.path.dirname(checkpoint_path)\n", + "\n", + "# Create path if exists\n", + "if os.path.exists(checkpoint_dir):\n", + " print(\"{} -- Folder already exists \\n\".format(checkpoint_dir))\n", + "else:\n", + " os.makedirs(checkpoint_dir, exist_ok=True)\n", + " print(\"{} -- Folder create complete \\n\".format(checkpoint_dir))\n", + " \n", + "cp_callback = ModelCheckpoint(\n", + " checkpoint_path, monitor='val_accuracy', verbose=1, save_best_only=True, save_weights_only=True)\n", + "\n", + "# 학습과 eval 시작\n", + "history = cls_model.fit(train_inputs, train_labels, epochs=30, batch_size=32,\n", + " validation_split = VALID_SPLIT, callbacks=[earlystop_callback, cp_callback])" + ], + "metadata": { + "id": "9n9ltFsAIoN8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "input_ids =[]\n", + "attention_masks =[]\n", + "token_type_ids =[]\n", + "train_data_labels = []\n", + "\n", + "def clean_text(sent):\n", + " sent_clean=re.sub(\"[^가-힣ㄱ-하-ㅣ]\", \" \", sent)\n", + " return sent_clean\n", + "\n", + "for test_sent in test['data']:\n", + " try:\n", + " input_id, attention_mask, token_type_id = bert_tokenizer(clean_text(test_sent), MAX_LEN=40)\n", + " \n", + " input_ids.append(input_id)\n", + " attention_masks.append(attention_mask)\n", + " token_type_ids.append(token_type_id)\n", + " #########################################\n", + " \n", + " except Exception as e:\n", + " print(e)\n", + " print(test_sent)\n", + " pass\n", + " \n", + "test_input_ids=np.array(input_ids, dtype=int)\n", + "test_attention_masks=np.array(attention_masks, dtype=int)\n", + "test_token_type_ids=np.array(token_type_ids, dtype=int)\n", + "###########################################################\n", + "test_inputs=(test_input_ids, test_attention_masks, test_token_type_ids)" + ], + "metadata": { + "id": "_zpCG4FgIoLi" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "results = cls_model.predict(test_inputs)\n", + "results=tf.argmax(results, axis=1)" + ], + "metadata": { + "id": "0pVNINyQIoJA" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "isBxHC64IxK8" + }, + "execution_count": null, + "outputs": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "junyoung", + "language": "python", + "name": "junyoung" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + }, + "colab": { + "name": "week19_김희숙_예습과제.ipynb", + "provenance": [] + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file