ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • IMDB ์˜ํ™” ๋ฆฌ๋ทฐ ๊ฐ์ • ๋ถ„์„ ํ”„๋กœ์ ํŠธ ์ •๋ฆฌ ๋ฐ ์ถ”ํ›„ ๋ฉด์ ‘ ๋Œ€๋น„
    AI/NLP 2024. 12. 19. 21:37
    1ํ•™๋…„ 2ํ•™๊ธฐ ์ˆ˜์—…์„ ๋งˆ๋ฌด๋ฆฌํ•˜๋ฉฐ, ํ™œ๋™์„ ์ •๋ฆฌํ•ด๋ณด๋ ค ํ•œ๋‹ค.
    AI๋ฅผ ์œ„ํ•œ ํ†ต๊ณ„ํ•™ ์‹œ๊ฐ„์— ์ž‘์„ฑํ•œ ๊ฐœ์ธ ๋ถ„์„ ๋ณด๊ณ ์„œ์— ๋Œ€ํ•œ ํ•œ๊ธ€ ์„ค๋ช… ๋ฐ ๋ฉด์ ‘ ์งˆ๋ฌธ์„ ์˜ˆ์ƒํ•ด ์ค€๋น„ํ•ด๋ณด์•˜๋‹ค.
    (๊นƒํ—ˆ๋ธŒ์—” ์˜๋ฌธ์œผ๋กœ README๋ฅผ ์ž‘์„ฑํ•ด๋‘์—ˆ๋‹ค.)

    ํ”„๋กœ์ ํŠธ ๊นƒํ—ˆ๋ธŒ ๋งํฌ : https://github.com/Yerin99/IMDB-Movie-Review-Sentiment-Analysis

     

    GitHub - Yerin99/IMDB-Movie-Review-Sentiment-Analysis: AI ๋Œ€ํ•™์› 1-2 [AI๋ฅผ ์œ„ํ•œ ํ†ต๊ณ„ํ•™] ๊ฐœ์ธ ๋ถ„์„ ๋ณด๊ณ ์„œ : ์˜

    AI ๋Œ€ํ•™์› 1-2 [AI๋ฅผ ์œ„ํ•œ ํ†ต๊ณ„ํ•™] ๊ฐœ์ธ ๋ถ„์„ ๋ณด๊ณ ์„œ : ์˜ํ™” ๋ฆฌ๋ทฐ ๊ฐ์ • ๋ถ„์„. Contribute to Yerin99/IMDB-Movie-Review-Sentiment-Analysis development by creating an account on GitHub.

    github.com

     

    ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

    IMDB ์˜ํ™” ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ํ…์ŠคํŠธ ๊ฐ์ • ๋ถ„์„ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๊ณ , ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•(Logistic Regression, Random Forest, Naive Bayes)์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. Logistic Regression ๋ชจ๋ธ์€ ํ…์ŠคํŠธ ๋ถ„์„์—์„œ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ๊ณผ ์ •ํ™•๋„ ๋ฉด์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ํ™œ์šฉํ•ด ๊ธ์ • ๋ฐ ๋ถ€์ • ๊ฐ์ •์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ฃผ์š” ๋‹จ์–ด๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.


    ์™œ Logistic Regression์„ ์„ ํƒํ–ˆ๋Š”๊ฐ€?

    1. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ

    Logistic Regression์€ ์„ ํ˜• ๋ชจ๋ธ๋กœ, ๊ฐ ํŠน์„ฑ(feature)์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜(coefficient)๋ฅผ ์ œ๊ณต

    • ํŠน์ • ๋‹จ์–ด๊ฐ€ ๊ธ์ •์ /๋ถ€์ •์  ๊ฐ์ •์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ •๋Ÿ‰์  ํ•ด์„์ด ๊ฐ€๋Šฅ
    • ์˜ˆ๋ฅผ ๋“ค์–ด, "excellent", "perfect"๋Š” ๊ธ์ •์ ์ธ ๋‹จ์–ด๋กœ ๊ฐ•ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋ฉฐ, "worst", "awful"์€ ๋ถ€์ •์ ์ธ ๋‹จ์–ด๋กœ ์ค‘์š”ํ•œ ์—ญํ• 

    2. ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€์˜ ์ ํ•ฉ์„ฑ

    • ํ…์ŠคํŠธ ๋ถ„์„์—์„œ TF-IDF๋กœ ๋ฒกํ„ฐํ™”๋œ ๋ฐ์ดํ„ฐ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ํฌ์†Œ ํ–‰๋ ฌ์„ ํ˜•์„ฑํ•จ
    • Logistic Regression์€ ํฌ์†Œ ๋ฐ์ดํ„ฐ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ, ๊ณผ์ ํ•ฉ(overfitting) ์œ„ํ—˜์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์Œ

    3. ์„ฑ๋Šฅ ๋น„๊ต ๊ฒฐ๊ณผ

    ๋ชจ๋ธ ์ •ํ™•๋„ (Accuracy)
    Logistic Regression 89.7%
    Random Forest 85.3%
    Naive Bayes 85.8%
    • Logistic Regression์€ ์„ธ ๋ชจ๋ธ ์ค‘ ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ž„
    • ํŠนํžˆ, Recall(์žฌํ˜„์œจ)๊ณผ F1-Score์—์„œ๋„ ๊ณ ๋ฅด๊ฒŒ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„

    ์™œ Random Forest์™€ Naive Bayes๊ฐ€ Logistic Regression๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์กŒ์„๊นŒ?

    1. Random Forest

    • ํŠน์ง•: Random Forest๋Š” ๋น„์„ ํ˜• ๋ชจ๋ธ๋กœ, ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€ ๊ฐ™์€ ๊ณ ์ฐจ์› ํฌ์†Œ ํ–‰๋ ฌ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์ตœ์ ํ™”๋˜์ง€ ์•Š์Œ
    • ๋ฌธ์ œ์ :
      • ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ๊ฐœ์ˆ˜(10,000๊ฐœ ์ด์ƒ)๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ํŠธ๋ฆฌ ๋ชจ๋ธ์˜ ํ•™์Šต์ด ์–ด๋ ค์›Œ์ง (์ด๋ฒˆ์—” ํ•ด๋‹น X)
      • TF-IDF๋กœ ๋ฒกํ„ฐํ™”๋œ ํŠน์„ฑ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ์บก์ฒ˜ํ•˜์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ํผ
    • ๊ฒฐ๋ก : Random Forest๋Š” ์ฃผ๋กœ ์ •ํ˜• ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด๋ฉฐ, ํฌ์†Œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Œ์Œ

    2. Naive Bayes

    • ํŠน์ง•: Naive Bayes๋Š” ์กฐ๊ฑด๋ถ€ ๋…๋ฆฝ ๊ฐ€์ •์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•จ
    • ๋ฌธ์ œ์ :
      • ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ๋‹จ์–ด ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ฌด์‹œํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณต์žกํ•œ ํŒจํ„ด์„ ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ
      • ex) "not bad" ๊ฐ™์€ ๊ตฌ๋ฌธ์—์„œ "not"๊ณผ "bad"์˜ ์กฐํ•ฉ์„ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜๊ณ , "bad"๋ฅผ ๋ถ€์ •์ ์œผ๋กœ๋งŒ ์ฒ˜๋ฆฌํ•  ๊ฐ€๋Šฅ์„ฑ์ด ํผ
    • ๊ฒฐ๋ก : Naive Bayes๋Š” ๋‹จ์ˆœํ•˜๊ณ  ๋น ๋ฅด์ง€๋งŒ, ๋‹จ์–ด ๊ฐ„์˜ ์˜์กด ๊ด€๊ณ„๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ

    ๋ฉด์ ‘ ๋Œ€๋น„ ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€

    Q1. Logistic Regression์„ ์„ ํƒํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

    Logistic Regression์€ ์„ ํ˜• ๋ชจ๋ธ๋กœ, ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ TF-IDF ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐํ™”ํ–ˆ์„ ๋•Œ ํŠน์„ฑ์˜ ์ค‘์š”๋„๋ฅผ ํ•ด์„ํ•˜๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜ (coefficient) ๋ฅผ ํ†ตํ•ด ๊ธ์ •์  / ๋ถ€์ •์  ๋‹จ์–ด์˜ ์˜ํ–ฅ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํฌ์†Œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‹ค๋ฅธ ๋ชจ๋ธ๋ณด๋‹ค ๋†’์€ ์ •ํ™•๋„์™€ ์žฌํ˜„์œจ์„ ๋ณด์—ฌ ์ตœ์ข… ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.


    Q2. ์ค‘์‹ฌ ๊ทนํ•œ ์ •๋ฆฌ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด๋ณด์„ธ์š”.

    ์ค‘์‹ฌ ๊ทนํ•œ ์ •๋ฆฌ (Central Limit Theorem, CLT) ๋Š” ํ‘œ๋ณธ ํฌ๊ธฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ํด ๊ฒฝ์šฐ, ํ‘œ๋ณธ ํ‰๊ท ์˜ ๋ถ„ํฌ๊ฐ€ ๋ชจ์ง‘๋‹จ์˜ ๋ถ„ํฌ ํ˜•ํƒœ์™€ ๊ด€๊ณ„์—†์ด ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค๋Š” ํ†ต๊ณ„ํ•™์  ์›๋ฆฌ์ž…๋‹ˆ๋‹ค.


    Q3. CLT ์‹คํ—˜์˜ ๋ชฉ์ ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?

    ๋ถ€์กฑํ•œ ๋‹ต๋ณ€์ด๋ผ๊ณ  ์ƒ๊ฐ์ด ๋“œ๋‚˜, ์ค‘์‹ฌ ๊ทนํ•œ ์ •๋ฆฌ๋ฅผ ์ด๋ก ์ ์œผ๋กœ๊ฐ€ ์•„๋‹Œ ์˜ํ™” ๋ฆฌ๋ทฐ๋ผ๋Š” ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์‹ค์ œ๋กœ ์ ์šฉ๋˜๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ์‹ถ์—ˆ์Šต๋‹ˆ๋‹ค. ์ƒ˜ํ”Œ๋งํ•˜๋Š” ํ‘œ๋ณธ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ํ‘œ๋ณธ ํ‰๊ท ์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋กœ ์ˆ˜๋ ดํ•˜๋Š”์ง€๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•ด๋ณด๊ณ  ์‹ถ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ๋ฆฌ๋ทฐ ๊ธธ์ด, ๋ถ€์ •์–ด ์‚ฌ์šฉ ๊ฐœ์ˆ˜, ๊ฐํƒ„์‚ฌ ๊ฐœ์ˆ˜์— ๋Œ€ํ•ด CLT ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ‘œ๋ณธ ํฌ๊ธฐ๊ฐ€ ์ž‘์„ ๋•Œ๋Š” ๋ถ„ํฌ์˜ ๋ถ„์‚ฐ์ด ํฌ์ง€๋งŒ, ํ‘œ๋ณธ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

     


    Q4. Logistic Regression์˜ ํ•œ๊ณ„์ ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?

    Logistic Regression์€ ์„ ํ˜• ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์ดํ„ฐ์— ๋น„์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•  ๊ฒฝ์šฐ ์ด๋ฅผ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ณผ๋„ํ•œ ํŠน์„ฑ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„ (Multicollinearity, ๋‘ ๊ฐœ ์ด์ƒ์˜ ๋…๋ฆฝ ๋ณ€์ˆ˜๊ฐ€ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„) ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ, Logistic Regression๊ณผ ๊ฐ™์€ ์„ ํ˜• ๋ชจ๋ธ์—์„  ๋ชจ๋ธ์˜ ์•ˆ์ •์„ฑ๊ณผ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Multicollinearity ๋ฌธ์ œ๋Š” feature selection (๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง„ ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜๋งŒ ์„ ํƒ), ์ •๊ทœํ™” (Ridge Regression๊ณผ ๊ฐ™์€ L2 ์ •๊ทœํ™” ์ถ”๊ฐ€ ๊ธฐ๋ฒ•์œผ๋กœ ๋ชจ๋ธ์˜ ๊ณ„์ˆ˜ ์ œ์–ด) ๋˜๋Š” PCA (์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋ฅผ ์••์ถ•ํ•˜์—ฌ ํ•˜๋‚˜์˜ ์ฃผ์„ฑ๋ถ„์œผ๋กœ ๋ณ€ํ™˜) ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

     

Designed by Tistory.