Backpropagation์˜ ์ •์ฒด๋ฅผ ์ œ๋Œ€๋กœ ์•Œ๊ฒŒ๋œ ๊ฑด ํ•™๋ถ€ 4ํ•™๋…„๋•Œ deep learning ์ˆ˜์—… ๋•Œ๋ฌธ์ด์—ˆ๋‹ค. ๋‹น์‹œ ์•ŒํŒŒ๊ณ  ์‚ฌ๊ฑด ์ดํ›„๋กœ ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋”ฅ๋Ÿฌ๋‹์„ ํ†ตํ•ด AI์— ๋Œ€ํ•œ ๊ด€์‹ฌ์ด ๋Œ€์ค‘์œผ๋กœ๋ถ€ํ„ฐ๋„ ์ปค์งˆ ๋•Œ ์˜€๊ฑด๋งŒ, ์ง€๊ธˆ ์ƒ๊ฐํ•ด๋ณด๋ฉด ๊ฐ•๊ฑด๋„ˆ ๋ถˆ๊ตฌ๊ฒฝํ•˜๋“ฏ ๊ทธ์ € ๋ฐ”๋ผ๋งŒ ๋ณธ ์ฑ„ ์‹œ๊ฐ„๋งŒ ํ˜๋ ค ๋ณด๋‚ด๊ณ  ์žˆ์—ˆ๋˜ ๊ฒƒ ๊ฐ™๋‹ค. 3ํ•™๋…„๋•Œ ํผ์…‰ํŠธ๋ก ์˜ ๊ฐœ๋…์„ ๊ณต๋ถ€ํ•  ๋•Œ๋„ ๋จธ์‹ ๋Ÿฌ๋‹์ด ์‹ ๊ธฐํ•˜๋‹ค๋Š” ์ƒ๊ฐ์ด ๋ง‰์—ฐํ•˜๊ฒŒ ์ž๋ฆฌํ•˜๊ณ  ์žˆ์—ˆ์ง€๋งŒ, backpropagation ์ฒ˜๋Ÿผ ์ด๋ ‡๊ฒŒ ๊ตฌ์ฒด์ ์ธ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ์ค„์€ ๋ชฐ๋ž๋‹ค. ์–ด์ฐŒ๋˜์—ˆ๋“  ์ง์ ‘ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด ํ•™์Šต์ด ๋˜๋Š”๊ฑธ ๊ณ„์‚ฐํ•˜๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ์ˆœ๊ฐ„ ๋ฐ›์•˜๋˜ ์‹ ์„ ํ•œ ์ถฉ๊ฒฉ์€ ์•„์ง๋„ ์ƒ์ƒํ•˜๋‹ค. ์˜ค๋Š˜์€ ์—ญ์ „ํŒŒ์— ๋Œ€ํ•ด์„œ ๊ธ€์„ ์ž‘์„ฑํ•ด๋ณด์ž.

์—ญ์ „ํŒŒ(Backpropagation)์˜ ๊ธฐ๋ณธ ๊ฐœ๋…๊ณผ ์ง๊ด€์  ์ดํ•ด

์—ญ์ „ํŒŒ๋ž€ ๋ฌด์—‡์ธ๊ฐ€?

์—ญ์ „ํŒŒ(Backpropagation)๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง์„ ํ•™์Šต์‹œํ‚ค๋Š” ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)๋ฅผ ์ถœ๋ ฅ์ธต์—์„œ ์ž…๋ ฅ์ธต ๋ฐฉํ–ฅ์œผ๋กœ ‘์—ญ์œผ๋กœ ์ „ํŒŒ’ํ•˜๋ฉฐ ๊ณ„์‚ฐํ•˜๋Š” ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ 1986๋…„ Rumelhart, Hinton, Williams๊ฐ€ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์—์„œ ๋Œ€์ค‘ํ™”๋˜์—ˆ์œผ๋ฉฐ, ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. hinton์€ ์ด๋•Œ ๋ถ€ํ„ฐ ์ค‘์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งŒ๋“ ์‚ฌ๋žŒ์ด ์—ˆ๊ตฌ๋‚˜ ใ„ทใ„ท

๊ฐ„๋‹จํžˆ ๋งํ•˜์ž๋ฉด, ์—ญ์ „ํŒŒ๋Š” ์‹ ๊ฒฝ๋ง์˜ ์˜ˆ์ธก ์˜ค๋ฅ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๊ฐ€์ค‘์น˜๊ฐ€ ์ตœ์ข… ์˜ค๋ฅ˜์— ์–ผ๋งˆ๋‚˜ ๊ธฐ์—ฌํ–ˆ๋Š”์ง€ ๊ณ„์‚ฐํ•˜๊ณ , ์ด ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.

์—ญ์ „ํŒŒ์˜ ์ง๊ด€์  ์ดํ•ด

์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ๊ณผ์ •์€ ํฌ๊ฒŒ ๋‘ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค:

  1. ์ˆœ์ „ํŒŒ(Forward Propagation): ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ ๊ฒฝ๋ง์„ ํ†ต๊ณผํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’์„ ์ƒ์„ฑ
  2. ์—ญ์ „ํŒŒ(Backpropagation): ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด(์˜ค์ฐจ)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด ์˜ค์ฐจ๋ฅผ ์—ญ์œผ๋กœ ์ „ํŒŒํ•˜์—ฌ ๊ฐ ๊ฐ€์ค‘์น˜์˜ ์—…๋ฐ์ดํŠธ ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •

์—ญ์ „ํŒŒ๊ฐ€ ์ž‘๋™ํ•˜๋Š” ์›๋ฆฌ๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•ด ๋ณด์ž:

  • ๋„คํŠธ์›Œํฌ๋Š” ๋งˆ์น˜ ๋ณต์žกํ•œ ํ•จ์ˆ˜์™€ ๊ฐ™๋‹ค: ์ž…๋ ฅ โ†’ [๋ธ”๋ž™๋ฐ•์Šค] โ†’ ์ถœ๋ ฅ
  • ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ์›ํ•˜๋Š” ์ถœ๋ ฅ์ด ๋‚˜์˜ค๋„๋ก ์ด ๋ธ”๋ž™๋ฐ•์Šค์˜ ๋‚ด๋ถ€ ์„ค์ •(๊ฐ€์ค‘์น˜)์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ
  • ์ถœ๋ ฅ์—์„œ ๋ฐœ์ƒํ•œ ์˜ค์ฐจ๋ฅผ ๋‚ด๋ถ€ ์„ค์ •์˜ ์กฐ์ • ๋ฐฉํ–ฅ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ ์—ญ์ „ํŒŒ

์‹ ์šฉ ์นด๋“œ ๋ถ€์ • ๊ฑฐ๋ž˜ ํƒ์ง€ ์‹œ์Šคํ…œ์„ ์˜ˆ๋กœ ๋“ค๋ฉด:

  1. ์‹œ์Šคํ…œ์ด ์ •์ƒ ๊ฑฐ๋ž˜๋ฅผ ๋ถ€์ • ๊ฑฐ๋ž˜๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•จ (์˜ค๋ฅ˜ ๋ฐœ์ƒ)
  2. ์—ญ์ „ํŒŒ๋ฅผ ํ†ตํ•ด “์–ด๋–ค ๋‚ด๋ถ€ ์—ฐ๊ฒฐ(๊ฐ€์ค‘์น˜)์ด ์ด ์˜ค๋ถ„๋ฅ˜์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”๊ฐ€?“๋ฅผ ์ฐพ์•„๋ƒ„
  3. ํ•ด๋‹น ์—ฐ๊ฒฐ์„ ์ ์ ˆํžˆ ์กฐ์ •ํ•˜์—ฌ ๋‹ค์Œ์—๋Š” ์œ ์‚ฌํ•œ ์ •์ƒ ๊ฑฐ๋ž˜๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ํ•จ

์—ญ์ „ํŒŒ์˜ ์ˆ˜ํ•™์  ์ •์˜์™€ ์ž‘๋™ ์›๋ฆฌ

๊ธฐ๋ณธ ์ˆ˜์‹๊ณผ ํ‘œ๊ธฐ๋ฒ•

์ธ๊ณต ์‹ ๊ฒฝ๋ง์—์„œ ์—ญ์ „ํŒŒ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ณธ ์ˆ˜์‹์„ ์•Œ์•„๋ณด์ž. $L$๊ฐœ์˜ ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋œ ์‹ ๊ฒฝ๋ง์„ ๊ฐ€์ •ํ•  ๋•Œ:

  • $W^l$: $l$๋ฒˆ์งธ ์ธต์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ
  • $b^l$: $l$๋ฒˆ์งธ ์ธต์˜ ํŽธํ–ฅ ๋ฒกํ„ฐ
  • $z^l = W^l a^{l-1} + b^l$: $l$๋ฒˆ์งธ ์ธต์˜ ๊ฐ€์ค‘ํ•ฉ(weighted sum)
  • $a^l = \sigma(z^l)$: $l$๋ฒˆ์งธ ์ธต์˜ ํ™œ์„ฑํ™” ์ถœ๋ ฅ
  • $\sigma$: ํ™œ์„ฑํ™” ํ•จ์ˆ˜ (ReLU, sigmoid ๋“ฑ)
  • $L$: ์‹ ๊ฒฝ๋ง์˜ ์ด ์ธต ์ˆ˜
  • $C$: ๋น„์šฉ ํ•จ์ˆ˜ (์˜ˆ: ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ)

์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋‹จ๊ณ„๋ณ„ ์„ค๋ช…

์—ญ์ „ํŒŒ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰๋œ๋‹ค:

  1. ์ˆœ์ „ํŒŒ ๋‹จ๊ณ„: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ $x = a^0$๋กœ ์‹œ์ž‘ํ•˜์—ฌ ๋ชจ๋“  ์ธต์„ ํ†ต๊ณผ์‹œํ‚ค๋ฉฐ ์ตœ์ข… ์ถœ๋ ฅ $a^L$์„ ๊ณ„์‚ฐํ•œ๋‹ค. $$a^l = \sigma(W^l a^{l-1} + b^l) \quad \text{for } l = 1, 2, \ldots, L$$

  2. ์ถœ๋ ฅ์ธต ์˜ค์ฐจ ๊ณ„์‚ฐ: ์˜ˆ์ธก๊ฐ’ $a^L$๊ณผ ์‹ค์ œ๊ฐ’ $y$ ์‚ฌ์ด์˜ ์˜ค์ฐจ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ ๋น„์šฉ ํ•จ์ˆ˜ $C$์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. $$\delta^L = \nabla_a C \odot \sigma’(z^L)$$ ์ด ์ˆ˜์‹์„ ๋ถ„ํ•ดํ•ด์„œ ์‚ดํŽด๋ณด๋ฉด
    $\nabla_a C$ : ์ถœ๋ ฅ์ธต ํ™œ์„ฑํ™” ๊ฐ’ $a^L$์— ๋Œ€ํ•œ ๋น„์šฉ ํ•จ์ˆ˜ $C$์˜ ํŽธ๋ฏธ๋ถ„์ด๋‹ค. ์ฆ‰, ์ถœ๋ ฅ๊ฐ’์ด ๋ณ€ํ•  ๋•Œ ์˜ค์ฐจ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.
    $\sigma’(z^L)$ : ์ถœ๋ ฅ์ธต์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„๊ฐ’์ด๋‹ค. ์ž…๋ ฅ๊ฐ’ $z^L$์— ๋Œ€ํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ณ€ํ™”์œจ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.
    $\odot$ : ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ(element-wise multiplication)์œผ๋กœ, ๋‘ ๋ฒกํ„ฐ์˜ ๋Œ€์‘๋˜๋Š” ์š”์†Œ๋ผ๋ฆฌ ๊ณฑํ•˜๋Š” ์—ฐ์‚ฐ์ด๋‹ค.
    ์—ฌ๊ธฐ์„œ $\odot$๋Š” ํ•˜๋‹ค๋งˆ๋ฅด ๊ณฑ์ด๋ผ๊ณ ๋„ ํ•˜๋ฉฐ, ๋‚ด์ , ํ–‰๋ ฌ๊ณฑ๊ณผ ๋‹ค๋ฅด๋‹ค. ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ(element-wise multiplication)์„ ์˜๋ฏธํ•œ๋‹ค.

  3. ์˜ค์ฐจ ์—ญ์ „ํŒŒ: ์ถœ๋ ฅ์ธต์—์„œ ๊ณ„์‚ฐ๋œ ์˜ค์ฐจ๋ฅผ ์ด์ „ ์ธต์œผ๋กœ ์—ญ์œผ๋กœ ์ „ํŒŒํ•œ๋‹ค. $$\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma’(z^l) \quad \text{for } l = L-1, L-2, \ldots, 1$$

  4. ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ: ๊ฐ ์ธต์˜ ๊ฐ€์ค‘์น˜์™€ ํŽธํ–ฅ์— ๋Œ€ํ•œ ๋น„์šฉ ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. $$\nabla_{W^l} C = \delta^l (a^{l-1})^T$$ $$\nabla_{b^l} C = \delta^l$$

  5. ๊ฐ€์ค‘์น˜ ๋ฐ ํŽธํ–ฅ ์—…๋ฐ์ดํŠธ: ๊ณ„์‚ฐ๋œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ค‘์น˜์™€ ํŽธํ–ฅ์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. $$W^l \leftarrow W^l - \eta \nabla_{W^l} C$$ $$b^l \leftarrow b^l - \eta \nabla_{b^l} C$$ ์—ฌ๊ธฐ์„œ $\eta$๋Š” ํ•™์Šต๋ฅ ์ด๋‹ค.

์ฒด์ธ ๋ฃฐ(Chain Rule)์˜ ์ค‘์š”์„ฑ

์—ญ์ „ํŒŒ์˜ ํ•ต์‹ฌ์€ ๋ฏธ์ ๋ถ„ํ•™์˜ ์ฒด์ธ ๋ฃฐ(chain rule)์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋ณต์žกํ•œ ํ•ฉ์„ฑ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„์„ ๊ฐ ๊ตฌ์„ฑ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„์˜ ๊ณฑ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์›๋ฆฌ๋ฅผ ์ด์šฉํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, $z = f(y)$์™€ $y = g(x)$๋ผ๋ฉด, $\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$์ด๋‹ค.

์‹ ๊ฒฝ๋ง์—์„œ๋Š” ์ž…๋ ฅ๋ถ€ํ„ฐ ์ถœ๋ ฅ๊นŒ์ง€ ์—ฌ๋Ÿฌ ์ธต์˜ ์—ฐ์‚ฐ์ด ์ค‘์ฒฉ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, ์ตœ์ข… ์˜ค์ฐจ๊ฐ€ ๊ฐ ๊ฐ€์ค‘์น˜์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์ฒด์ธ ๋ฃฐ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ ์šฉํ•œ๋‹ค.

์—ญ์ „ํŒŒ ๊ณผ์ • ์ˆ˜์‹ ์ง๊ด€์  ์˜๋ฏธ
์ถœ๋ ฅ์ธต ์˜ค์ฐจ ๊ณ„์‚ฐ $\delta^L = \nabla_a C \odot \sigma’(z^L)$ “์ตœ์ข… ์˜ค์ฐจ๊ฐ€ ์–ผ๋งˆ์ธ๊ฐ€?”
์˜ค์ฐจ ์—ญ์ „ํŒŒ $\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma’(z^l)$ “์ด์ „ ์ธต์˜ ๊ฐ ๋‰ด๋Ÿฐ์ด ์ตœ์ข… ์˜ค์ฐจ์— ์–ผ๋งˆ๋‚˜ ๊ธฐ์—ฌํ–ˆ๋Š”๊ฐ€?”
๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ $\nabla_{W^l} C = \delta^l (a^{l-1})^T$ “๊ฐ ๊ฐ€์ค‘์น˜๊ฐ€ ์ตœ์ข… ์˜ค์ฐจ์— ์–ผ๋งˆ๋‚˜ ๊ธฐ์—ฌํ–ˆ๋Š”๊ฐ€?”
๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ $W^l \leftarrow W^l - \eta \nabla_{W^l} C$ “์ตœ์ข… ์˜ค์ฐจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ๊ฐ ๊ฐ€์ค‘์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ•  ๊ฒƒ์ธ๊ฐ€?”

์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์‹ค์ œ ๊ตฌํ˜„

๊ฐ„๋‹จํ•œ ์‹ ๊ฒฝ๋ง์—์„œ์˜ ์—ญ์ „ํŒŒ ์˜ˆ์‹œ

2๊ฐœ์˜ ์ž…๋ ฅ ๋‰ด๋Ÿฐ, 2๊ฐœ์˜ ์€๋‹‰์ธต ๋‰ด๋Ÿฐ, 1๊ฐœ์˜ ์ถœ๋ ฅ ๋‰ด๋Ÿฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๊ฐ„๋‹จํ•œ ์‹ ๊ฒฝ๋ง์„ ์˜ˆ๋กœ ๋“ค์–ด ์—ญ์ „ํŒŒ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด์ž.

์ž…๋ ฅ: $x = [x_1, x_2]^T$ ๋ชฉํ‘œ ์ถœ๋ ฅ: $y$

์ˆœ์ „ํŒŒ:

  1. ์€๋‹‰์ธต ์ž…๋ ฅ: $z^1 = W^1 x + b^1$
  2. ์€๋‹‰์ธต ์ถœ๋ ฅ: $a^1 = \sigma(z^1)$
  3. ์ถœ๋ ฅ์ธต ์ž…๋ ฅ: $z^2 = W^2 a^1 + b^2$
  4. ์ถœ๋ ฅ์ธต ์ถœ๋ ฅ: $a^2 = \sigma(z^2)$
  5. ์˜ค์ฐจ ๊ณ„์‚ฐ: $C = \frac{1}{2}(a^2 - y)^2$

์—ญ์ „ํŒŒ:

  1. ์ถœ๋ ฅ์ธต ์˜ค์ฐจ: $\delta^2 = (a^2 - y) \cdot \sigma’(z^2)$
  2. ์ถœ๋ ฅ์ธต ๊ฐ€์ค‘์น˜ ๊ทธ๋ž˜๋””์–ธํŠธ: $\nabla_{W^2} C = \delta^2 \cdot (a^1)^T$
  3. ์ถœ๋ ฅ์ธต ํŽธํ–ฅ ๊ทธ๋ž˜๋””์–ธํŠธ: $\nabla_{b^2} C = \delta^2$
  4. ์€๋‹‰์ธต ์˜ค์ฐจ: $\delta^1 = (W^2)^T \delta^2 \odot \sigma’(z^1)$
  5. ์€๋‹‰์ธต ๊ฐ€์ค‘์น˜ ๊ทธ๋ž˜๋””์–ธํŠธ: $\nabla_{W^1} C = \delta^1 \cdot x^T$
  6. ์€๋‹‰์ธต ํŽธํ–ฅ ๊ทธ๋ž˜๋””์–ธํŠธ: $\nabla_{b^1} C = \delta^1$

๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ:

  1. $W^2 \leftarrow W^2 - \eta \nabla_{W^2} C$
  2. $b^2 \leftarrow b^2 - \eta \nabla_{b^2} C$
  3. $W^1 \leftarrow W^1 - \eta \nabla_{W^1} C$
  4. $b^1 \leftarrow b^1 - \eta \nabla_{b^1} C$

ํ™œ์„ฑํ™” ํ•จ์ˆ˜์™€ ๊ทธ ๋ฏธ๋ถ„

์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„ ๊ฐ’์ด ํ•„์š”ํ•˜๋‹ค. ์ฃผ์š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์™€ ๊ทธ ๋ฏธ๋ถ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ •์˜ ๋ฏธ๋ถ„
Sigmoid $\sigma(x) = \frac{1}{1 + e^{-x}}$ $\sigma’(x) = \sigma(x)(1 - \sigma(x))$
ReLU $\text{ReLU}(x) = \max(0, x)$ $\text{ReLU}’(x) = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases}$
tanh $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ $\tanh’(x) = 1 - \tanh^2(x)$
Leaky ReLU $\text{LReLU}(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{if } x \leq 0 \end{cases}$ $\text{LReLU}’(x) = \begin{cases} 1 & \text{if } x > 0 \ \alpha & \text{if } x \leq 0 \end{cases}$

์‹ค์ œ ๊ตฌํ˜„ ์‹œ ๊ณ ๋ ค์‚ฌํ•ญ

1. ์ˆ˜์น˜์  ์•ˆ์ •์„ฑ

  • ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ/์–ธ๋”ํ”Œ๋กœ์šฐ ๋ฐฉ์ง€: ์ง€์ˆ˜ ํ•จ์ˆ˜๋‚˜ ๋กœ๊ทธ ํ•จ์ˆ˜ ์‚ฌ์šฉ ์‹œ ์ฃผ์˜
  • ๋กœ๊ทธ-ํ•ฉ ํŠธ๋ฆญ(log-sum trick)๊ณผ ๊ฐ™์€ ๊ธฐ๋ฒ• ํ™œ์šฉ

2. ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

  • ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ์‚ฌ์šฉ: ์—ฌ๋Ÿฌ ์ƒ˜ํ”Œ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ํ‰๊ท ์œผ๋กœ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ
  • ๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization) ์ ์šฉ: ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™(internal covariate shift) ๊ฐ์†Œ

3. ๊ทธ๋ž˜๋””์–ธํŠธ ์†Œ์‹ค/ํญ๋ฐœ ๋ฌธ์ œ

  • ๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘(gradient clipping) ์ ์šฉ
  • ์ ์ ˆํ•œ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ• ์„ ํƒ(Xavier, He ์ดˆ๊ธฐํ™” ๋“ฑ)
  • ์ž”์ฐจ ์—ฐ๊ฒฐ(residual connections) ์‚ฌ์šฉ

4. ๊ณ„์‚ฐ ํšจ์œจ์„ฑ

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ ์ตœ์ ํ™”
  • GPU ๊ฐ€์† ํ™œ์šฉ
  • ์ž๋™ ๋ฏธ๋ถ„(Automatic Differentiation) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ

์—ญ์ „ํŒŒ์˜ ํ•œ๊ณ„์™€ ์ตœ์‹  ๊ฐœ์„  ๊ธฐ๋ฒ•

๊ธฐ์กด ์—ญ์ „ํŒŒ์˜ ํ•œ๊ณ„์ 

1. ๊นŠ์€ ์‹ ๊ฒฝ๋ง์—์„œ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ์†Œ์‹ค/ํญ๋ฐœ

  • ์ธต์ด ๊นŠ์–ด์งˆ์ˆ˜๋ก ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๊ฑฐ๋‚˜(์†Œ์‹ค) ๋งค์šฐ ์ปค์ง€๋Š”(ํญ๋ฐœ) ๋ฌธ์ œ ๋ฐœ์ƒ
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊นŠ์€ ์ธต์€ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šต๋˜์ง€ ์•Š์Œ

2. ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

  • ์ˆœ์ „ํŒŒ ๊ณผ์ •์˜ ๋ชจ๋“  ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ด์•ผ ํ•จ
  • ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰์ด ๋„คํŠธ์›Œํฌ ๊นŠ์ด์— ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€

3. ์ˆœ์ฐจ์  ๊ณ„์‚ฐ์˜ ํ•œ๊ณ„

  • ๋ณธ์งˆ์ ์œผ๋กœ ์ˆœ์ฐจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฏ€๋กœ ๋ณ‘๋ ฌํ™”์— ์ œํ•œ์ด ์žˆ์Œ
  • ํŠนํžˆ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN)์—์„œ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๊ธธ ๊ฒฝ์šฐ ๋น„ํšจ์œจ์ 

์ตœ์‹  ๊ฐœ์„  ๊ธฐ๋ฒ•

1. ๊ตฌ์กฐ์  ๊ฐœ์„ 

  • ์ž”์ฐจ ์—ฐ๊ฒฐ(Residual Connections): ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ ๋ณด๋‹ค ์ง์ ‘์ ์œผ๋กœ ์ด์ „ ์ธต์œผ๋กœ ํ๋ฅผ ์ˆ˜ ์žˆ๋Š” ์ง€๋ฆ„๊ธธ ์ œ๊ณต
  • ๊ณ ์†๋„๋กœ ๋„คํŠธ์›Œํฌ(Highway Networks): ์ •๋ณด ํ๋ฆ„์„ ์ œ์–ดํ•˜๋Š” ๊ฒŒ์ดํŒ… ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋„์ž…
  • ๋ฐ€์ง‘ ์—ฐ๊ฒฐ(Dense Connections): ๋ชจ๋“  ์ธต์ด ์ด์ „์˜ ๋ชจ๋“  ์ธต๊ณผ ์ง์ ‘ ์—ฐ๊ฒฐ๋˜๋Š” ๊ตฌ์กฐ

2. ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์„ 

  • Adam, RMSprop: ์ ์‘์  ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜๋ ด ์†๋„ ๊ฐœ์„ 
  • Lookahead: ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ๋ฐฉํ–ฅ์„ ๋™์‹œ์— ํƒ์ƒ‰

3. ๊ณ„์‚ฐ ํšจ์œจ์„ฑ ๊ฐœ์„ 

  • ๊ทธ๋ž˜๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…(Gradient Checkpointing): ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ผ๋ถ€ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋งŒ ์ €์žฅ
  • ์—ญ์ „ํŒŒ ์—†๋Š” ํ•™์Šต(Training without Backpropagation): ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์—์„œ ์—ญ์ „ํŒŒ ๋Œ€์‹  ์ง€์—ญ์  ํ•™์Šต ๊ทœ์น™ ์‚ฌ์šฉ

4. ๋ณ‘๋ ฌํ™” ๋ฐ ๋ถ„์‚ฐ ํ•™์Šต

  • ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”: ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ ๋””๋ฐ”์ด์Šค์— ๋ถ„์‚ฐ
  • ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌํ™”: ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

๊ฒฐ๋ก  ๋ฐ ์‹ค์šฉ์  ํŒ

์—ญ์ „ํŒŒ์˜ ์ค‘์š”์„ฑ ์š”์•ฝ

์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹์˜ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋กœ, ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง์˜ ํšจ์œจ์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค. ์ฒด์ธ ๋ฃฐ์„ ํ™œ์šฉํ•˜์—ฌ ์ถœ๋ ฅ์ธต์˜ ์˜ค์ฐจ๋ฅผ ์—ญ์œผ๋กœ ์ „ํŒŒํ•จ์œผ๋กœ์จ ๊ฐ ๊ฐ€์ค‘์น˜๊ฐ€ ์ตœ์ข… ์˜ค์ฐจ์— ๊ธฐ์—ฌํ•˜๋Š” ์ •๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

์‹ค์šฉ์  ํŒ

1. ๋””๋ฒ„๊น… ์ „๋žต

  • ์ˆ˜์น˜ ๋ฏธ๋ถ„(numerical differentiation)์œผ๋กœ ์—ญ์ „ํŒŒ ๊ตฌํ˜„ ๊ฒ€์ฆ
  • ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„(norm) ๋ชจ๋‹ˆํ„ฐ๋ง์œผ๋กœ ํ•™์Šต ์ƒํƒœ ํ™•์ธ
  • ์ž‘์€ ๋„คํŠธ์›Œํฌ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ๋ณต์žก๋„ ์ฆ๊ฐ€

2. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

  • ์ ์ ˆํ•œ ํ•™์Šต๋ฅ  ์„ ํƒ: ๋„ˆ๋ฌด ํฌ๋ฉด ๋ฐœ์‚ฐ, ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ๋А๋ฆฐ ์ˆ˜๋ ด
  • ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ ์กฐ์ •: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ๊ท ํ˜•
  • ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ• ์„ ํƒ: ํ™œ์„ฑํ™” ํ•จ์ˆ˜์— ๋งž๋Š” ์ดˆ๊ธฐํ™” ๊ธฐ๋ฒ• ํ™œ์šฉ

3. ๋ชจ๋‹ˆํ„ฐ๋ง ์ง€ํ‘œ

  • ํ›ˆ๋ จ/๊ฒ€์ฆ ์†์‹ค ์ถ”์ด ๊ด€์ฐฐ
  • ๊ฐ ์ธต์˜ ํ™œ์„ฑํ™” ๋ถ„ํฌ ๋ฐ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ„ํฌ ํ™•์ธ
  • ๊ฐ€์ค‘์น˜ ๋ฐ ํŽธํ–ฅ์˜ ๋ณ€ํ™”๋Ÿ‰ ๋ชจ๋‹ˆํ„ฐ๋ง

์ฐธ๊ณ  ๋ฌธํ—Œ

  1. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
  2. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
  3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.