3.3.2 Types of Neural Networks

Neural networks come in many architectures tailored to different tasks. The simplest is the Multilayer Perceptron (MLP), a fully connected feedforward network (no loops) that transforms inputs through layers of weighted sums and nonlinear activations. In contrast, Convolutional Neural Networks (CNNs) introduce localized weight sharing (convolutions) and pooling to process grid-like data (e.g. images). Recurrent Neural Networks (RNNs) incorporate feedback loops to handle sequences – their hidden state at time t depends on the input at t and the previous state at t–1. Variants like LSTMs and GRUs add gating mechanisms to RNNs for better long-term memory. More recently, Transformer architectures abandon recurrence altogether in favor of self-attention, enabling highly parallel sequence modeling. Separately, Autoencoders are unsupervised nets that learn to compress and reconstruct data, while Generative Adversarial Networks (GANs) pit two networks (generator vs. discriminator) against each other to synthesize realistic data. We examine each type in turn, covering theory, math, implementation, and use cases.

Multi-Layer Perceptron (MLP)

Figure: A simple MLP with one hidden layer. An MLP is a fully connected feedforward network with an input layer, one or more hidden layers, and an output layer. Each neuron performs a weighted sum $z^{(l)} = W^{(l)}x^{(l-1)} + b^{(l)}$ followed by a nonlinear activation (e.g. ReLU or sigmoid). For example, a single hidden layer MLP does:

h = \sigma(W^{(1)}x + b^{(1)}),\quad y = f(W^{(2)}h + b^{(2)}).

Training uses backpropagation to minimize a loss (e.g. cross-entropy). In code, an MLP in PyTorch might be:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(input_dim, hidden_dim),
    nn.ReLU(),
    nn.Linear(hidden_dim, output_dim)
)

MLPs are general function approximators used in classification and regression. They work on vector inputs (tabular data, flattened images) but scale poorly with input size (many parameters) and are not translation-invariant. Typical applications include regression on structured data or simple image tasks (after flattening). For example, an MLP can classify MNIST digits after flattening pixels. To avoid overfitting, one often adds regularization (dropout, weight decay) or constraints (batch normalization) when implementing MLPs.

Convolutional Neural Network (CNN)

Figure: CNN architecture with convolution, pooling, and dense layers. CNNs extend MLPs by replacing dense layers with convolutional layers that scan for local patterns. A convolution layer applies learnable kernels (filters) across the input: $(F * X)[i,j] = \sum_{m,n} F[m,n]\;X[i-m,j-n],$ producing feature maps. This weight sharing drastically reduces parameters and encodes spatial locality. CNNs also use pooling layers (e.g. max pooling) to reduce spatial size and add invariance. Mathematically, a convolutional layer output $Y$ with kernel $K$ and input $X$ is $Y = K * X + b$ , followed by an activation like ReLU. A toy CNN in code (PyTorch) might be:

import torch.nn as nn

model = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),      # downsample by 2
    nn.Flatten(),
    nn.Linear(16*H*W, num_classes)
)

CNNs excel at grid-like data (images, videos). Their strength is capturing local features and hierarchies (edges→shapes→objects) with far fewer weights than a dense net. They achieve state-of-the-art in image classification, object detection, and segmentation. For instance, autonomous vehicles rely heavily on CNNs for visual perception. Tesla’s HydraNet and Waymo’s ChauffeurNet use CNN backbones (often ResNet variants) to process camera images for driving decisions. In practice, CNNs are used in computer vision tasks (face recognition, medical imaging analysis, autonomous driving) and even audio or signal processing (1D convolutions). CNN implementation tips include tuning kernel sizes, depths, and using techniques like batch normalization to improve convergence.

Recurrent Neural Network (RNN)

Figure: RNN unrolled over time showing feedback loops. RNNs introduce temporal memory via a hidden state that is fed back each step. At time step t, an RNN updates its hidden state $h_t = \phi(W_{xh}x_t + W_{hh}h_{t-1} + b_h)$ and produces an output $y_t = g(W_{hy}h_t + b_y)$ . Here $\phi$ is usually $\tanh$ or ReLU, and $g$ an output nonlinearity. This recurrence lets RNNs process sequences of arbitrary length. In practice one implements RNNs as layers in frameworks; e.g., in PyTorch:


rnn = nn.RNN(input_size, hidden_size, num_layers=1)

output, h_n = rnn(input_seq, h_0)

RNNs are suited to sequence modeling – language, speech, time-series. Their use-case examples include text generation and sequence classification. However, vanilla RNNs suffer from vanishing/exploding gradients, making them hard to train on long dependencies. Typical tricks include gradient clipping and using gated variants (below). Applications: sequence labeling (POS tagging, named-entity recognition), time series forecasting, and signal prediction. For example, RNNs can predict the next value in a stock price series by learning from prior values. Text generation (like basic character-level models) is another common RNN task.

Long Short-Term Memory (LSTM)

LSTMs are a gated RNN variant invented to retain long-term context. An LSTM cell has an internal cell state $C_t$ and three gates: input $i_t$ , forget $f_t$ , and output $o_t$ . At time t, LSTM equations are:

\begin{aligned} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f),\\ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i),\\ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o),\\ \tilde{C}_t &= \tanh(W_C [h_{t-1}, x_t] + b_C),\\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t,\\ h_t &= o_t \odot \tanh(C_t). \end{aligned}

These gates control what information is kept or forgotten, allowing the network to learn long-range dependencies without vanishing gradients. In code (TensorFlow/Keras) an LSTM layer is straightforward:

from tensorflow.keras.layers import LSTM, Dense
model = Sequential([
    LSTM(128, input_shape=(timesteps, features)),
    Dense(output_dim, activation='softmax')
])

LSTMs power many sequential tasks in NLP and time-series forecasting. For example, financial institutions use LSTM models to predict stock prices and other economic indicators, often outperforming classical ARIMA models by 20–30% lower error. In natural language processing, LSTMs underlie early language models (e.g. Google’s Smart Reply system used LSTMs for predicting email responses). Tips: for LSTMs ensure data is properly shaped (batch×timesteps×features) and use techniques like dropout on recurrent connections to regularize.

Gated Recurrent Unit (GRU)

Figure: A simplified GRU block unrolled over time. GRUs are a simpler gated RNN (2014) that merge LSTM’s forget and input gates into a single update gate $z_t$ and use a reset gate $r_t$ . The GRU updates are:

\begin{aligned} z_t &= \sigma(W_z [h_{t-1}, x_t]),\\ r_t &= \sigma(W_r [h_{t-1}, x_t]),\\ \tilde{h}_t &= \tanh(W_h [r_t \odot h_{t-1}, x_t]),\\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t. \end{aligned}

These simpler gates make GRUs faster to train while still mitigating vanishing gradients. In PyTorch one can use nn.GRU similarly:

gru = nn.GRU(input_size, hidden_size, batch_first=True)
output, hn = gru(input_seq, h_0)

GRUs work well in similar domains as LSTMs (speech, translation, time-series). For example, sentiment classification or stock prediction tasks often see similar accuracy from GRUs versus LSTMs with less training time. When tuning, one might compare GRU vs LSTM as GRUs have fewer parameters.

Transformer Networks

Transformers use self-attention instead of recurrence to model sequences in parallel. A basic transformer encoder layer computes “scaled dot-product” attention: given queries $Q$ , keys $K$ , and values $V$ ,

\text{Attention}(Q,K,V) = \mathrm{softmax}\!\bigl(\tfrac{QK^T}{\sqrt{d_k}}\bigr) V.

Multi-head attention runs this in parallel with different linear projections. Each transformer block also has feedforward sublayers and layer normalization. In practice, frameworks like PyTorch provide transformer modules:

import torch.nn as nn
transformer = nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6)

Or one can use libraries such as HuggingFace’s Transformers:

from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")

Transformers achieve state-of-the-art in NLP (machine translation, QA, summarization) and increasingly in vision (Vision Transformers). They handle long-range dependencies better than RNNs. In healthcare, transformer-based language models are being used for tasks like disease prediction and medical report analysis. For example, BioBERT (a BERT model pretrained on biomedical text) improves accuracy in clinical NLP tasks. Transformers do require large datasets and compute, but libraries and pretrained models make them more accessible to practitioners.

Generative Adversarial Network (GAN)

Figure: GAN training loop – a generator tries to fool a discriminator. A GAN consists of two neural nets: a generator $G$ that produces fake data from random noise $z$ , and a discriminator $D$ that classifies data as real or fake. They play a minimax game with value function:

\min_G \max_D V(D,G) = \mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1 - D(G(z)))].

Both nets are trained jointly: $D$ learns to distinguish real from $G$ ’s fakes, and $G$ learns to produce more realistic data. A typical PyTorch training loop alternates:

# Train Discriminator
d_real = D(real_data)
d_fake = D(G(z).detach())
loss_D = criterion(d_real, ones) + criterion(d_fake, zeros)
# Train Generator
d_fake = D(G(z))
loss_G = criterion(d_fake, ones)  # wants D to predict real

GANs are widely used for image synthesis (e.g. StyleGAN for faces), data augmentation, and even music generation. The adversarial setup can produce highly realistic samples, but training can be unstable (mode collapse). In industry, GANs are used for creating high-resolution imagery (games, films) and for generating synthetic training data. For example, NVIDIA uses GANs for graphics rendering and to simulate photorealistic environments.

Autoencoder

Figure: An autoencoder compresses inputs to a latent “bottleneck” and reconstructs them. An autoencoder is a neural network that learns to reconstruct its input. It has two parts: an encoder $f$ that maps input $x$ to a latent code $h=f(x)$ , and a decoder $g$ that maps $h$ back to a reconstruction $\hat{x}=g(h)$ . Training minimizes the reconstruction error, e.g. $L = \|x - \hat{x}\|^2$ . For example, a simple autoencoder in Keras:

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(128, activation='relu', input_dim=input_dim),
    Dense(64, activation='relu'),     # latent layer
    Dense(128, activation='relu'),
    Dense(input_dim, activation='sigmoid')
])

By forcing a bottleneck (latent size < input size), autoencoders learn compact features. They are used for dimensionality reduction, denoising, and anomaly detection: anomalies often reconstruct poorly. Indeed, autoencoders are used in practice to detect outliers or novel events, e.g. in fraud detection or equipment fault monitoring, by flagging high reconstruction error instances. They also serve as pretraining for deep networks (learning features). Typical tips: ensure the bottleneck is small enough to enforce learning meaningful patterns; sometimes add constraints (sparsity, regularization) to avoid trivial identity mapping.

Comparative Summary

Network Type	Key Characteristics	Strengths	Typical Use Cases
MLP	Fully connected feedforward layers (dense).	Universal approximator; simple to understand and implement.	Tabular data, basic classification/regression, when data is vector.
CNN	Convolution + pooling layers with local receptive fields.	Captures spatial hierarchies; translation-invariant; fewer parameters than dense nets for images.	Image/video analysis, object detection (e.g. autonomous vehicles), computer vision.
RNN	Recurrent connections; maintains hidden state over time.	Models sequential/temporal dependencies; processes variable-length sequences.	Sequential data: language modeling, speech recognition, time-series prediction.
LSTM	Gated RNN with cell state to remember long-term info.	Remembers long dependencies; mitigates vanishing gradients.	Time-series forecasting (e.g. finance), machine translation, sequential prediction.
GRU	Simplified gated RNN (update/reset gates).	Similar performance as LSTM with fewer parameters; faster training.	Same domains as LSTM (NLP, speech, etc.), when compute is limited.
Transformer	Self-attention based; no recurrence.	Global context at each layer; fully parallelizable; excels on long sequences.	NLP tasks (translation, QA, summarization); also vision (ViT); healthcare/biomedical language (disease prediction).
GAN	Two-network adversarial (generator vs. discriminator).	Generates highly realistic synthetic data; unsupervised learning of data distribution.	Data generation: image synthesis, super-resolution, art generation.
Autoencoder	Encoder–bottleneck–decoder architecture.	Learns compact representations; good for anomaly detection (high reconstruction error).	Denoising, dimensionality reduction (like PCA), anomaly/outlier detection.

Each architecture has trade-offs: for example, MLPs are simple but can overfit large images, while CNNs exploit locality in images. RNNs are natural for sequences but can be slow; transformers handle sequences faster but require more data and memory. GANs can produce stunning results but may be unstable to train, whereas autoencoders provide straightforward unsupervised feature learning. When designing models, practitioners choose based on data structure and task requirements, often using pretrained components or fine-tuning (e.g. using a pretrained CNN backbone for image tasks or a pretrained Transformer for text) to leverage transfer learning in industry applications.

References

[1] B. Kromydas, “Convolutional Neural Network (CNN): A Complete Guide,” LearnOpenCV (blog), Jan. 18, 2023.

[2] GeeksforGeeks, “Autoencoders in Machine Learning,” Last Updated Mar. 01, 2025.

[3] GeeksforGeeks, “Gated Recurrent Unit Networks,” Last Updated Apr. 05, 2025.

[4] GeeksforGeeks, “Multi-Layer Perceptron Learning in Tensorflow,” Last Updated Feb. 05, 2025.

[5] GeeksforGeeks, “Introduction to Convolution Neural Network,” Last Updated Apr. 03, 2025.

[6] GeeksforGeeks, “Introduction to Recurrent Neural Network,” Last Updated Feb. 11, 2025.

[7] GeeksforGeeks, “Generative Adversarial Network (GAN),” Last Updated Mar. 10, 2025.

[8] N. Barla, “Self-Driving Cars With Convolutional Neural Networks (CNN),” Neptune.ai (blog), Apr. 22, 2024.

[9] “How LSTM Networks are Revolutionizing Time Series Forecasting,” Q3 Technologies (blog), Updated Aug. 06, 2024.

[10] H. N. Cho et al., “Task-Specific Transformer-Based Language Models in Health Care: Scoping Review,” JMIR Medical Informatics, vol. 12, no. 1, e49724, 2024.

[11] “Transformer (deep learning architecture),” Wikipedia, accessed May 18, 2025.

이 블로그 검색

7 A.I. Workers for WLB