1.
Introduction to Deep Learning
What is Deep Learning?
Deep Learning is a subset of machine learning inspired by the structure and function of the
human brain, using artificial neural networks (ANNs) with multiple layers (hence “deep”).
These models automatically learn hierarchical representations of data, making them
extremely effective for complex tasks like image recognition, natural language processing,
and speech recognition.
Why Deep Learning?
● Automatically extracts features from raw data, reducing the need for manual feature
engineering.
● Scales well with large datasets and complex data.
● State-of-the-art performance in many domains.
2. Basics of Neural Networks
2.1 The Neuron
The fundamental building block of a neural network is the neuron or node. It receives
multiple inputs, applies weights and bias, sums them, and passes the result through an
activation function.
Mathematically:
z=∑i=1nwixi+bz = \sum_{i=1}^n w_i x_i + bz=i=1∑nwixi+b
where xix_ixiare inputs, wiw_iwiare weights, and bbb is the bias.
The output is:
a=σ(z)a = \sigma(z)a=σ(z)
where σ\sigmaσ is an activation function.
2.2 Activation Functions
Activation functions introduce non-linearity allowing neural networks to learn complex
patterns.
Common functions:
● Sigmoid:
σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1
Output range: (0, 1) — useful for binary classification.
● Tanh:
tanh(z)=ez−e−zez+e−z\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}tanh(z)=ez+e−zez−e−z
Output range: (-1, 1)
● ReLU (Rectified Linear Unit):
ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)ReLU(z)=max(0,z)
Most popular, alleviates vanishing gradient problem.
● Leaky ReLU:
Allows small gradients when z<0z < 0z<0, defined as:
LeakyReLU(z)={zz>00.01zz≤0\text{LeakyReLU}(z) = \begin{cases} z & z > 0 \\ 0.01z & z \leq
0 \end{cases}LeakyReLU(z)={z0.01zz>0z≤0
2.3 Layers and Architecture
Neurons are organized in layers:
● Input layer: Takes raw features.
● Hidden layers: Perform transformations; can be multiple.
● Output layer: Produces final prediction.
3. Forward and Backward Propagation
3.1 Forward Propagation
Input data passes through the network layer by layer applying weights, biases, and
activation functions to generate predictions.
3.2 Loss Function
Measures how far predictions are from true labels. Examples:
● Mean Squared Error (MSE) for regression.
● Cross-Entropy Loss for classification.
3.3 Backward Propagation
Uses the chain rule to compute gradients of the loss with respect to weights. This allows the
network to update weights to minimize loss.
3.4 Optimization
Gradient descent algorithms adjust weights using gradients:
● Batch Gradient Descent: Uses all data at once.
● Stochastic Gradient Descent (SGD): Uses one data point at a time.
● Mini-batch Gradient Descent: Uses small batches (commonly used).
Advanced optimizers:
● Adam: Combines momentum and adaptive learning rates.
● RMSProp, Adagrad, etc.
4. Implementing a Simple Neural Network in Python
(Using NumPy)
Here’s a barebones example of a simple neural network for binary classification.
python
CopyEdit
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
# Sample data (AND logic gate)
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [0], [0], [1]])
np.random.seed(1)
# Initialize weights randomly with mean 0
weights0 = 2 * np.random.random((2, 3)) - 1 # input to hidden
weights1 = 2 * np.random.random((3, 1)) - 1 # hidden to output
learning_rate = 0.1
for epoch in range(10000):
# Forward propagation
layer0 = X
layer1 = sigmoid(np.dot(layer0, weights0))
layer2 = sigmoid(np.dot(layer1, weights1))
# Calculate error
layer2_error = y - layer2
if epoch % 1000 == 0:
print(f"Error at epoch {epoch}:
{np.mean(np.abs(layer2_error))}")
# Backpropagation
layer2_delta = layer2_error * sigmoid_derivative(layer2)
layer1_error = layer2_delta.dot(weights1.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# Update weights
weights1 += learning_rate * layer1.T.dot(layer2_delta)
weights0 += learning_rate * layer0.T.dot(layer1_delta)
# Predictions
print("Output after training:")
print(layer2)
Would you like me to continue with:
● More advanced architectures (CNNs, RNNs)?
● Practical examples with frameworks like TensorFlow or PyTorch?
● Or a different angle?
Just say the word!
You said:
yes please
ChatGPT said:
Awesome! Continuing with the next part focused on Convolutional Neural Networks
(CNNs) — a foundational architecture for deep learning in image tasks.
5. Convolutional Neural Networks (CNNs)
5.1 What are CNNs?
CNNs are specialized neural networks designed to process data with grid-like topology —
mainly images. They automatically learn spatial hierarchies of features through convolutional
layers, reducing the need for manual feature extraction.
5.2 Key Components
● Convolutional Layers: Apply filters/kernels that slide over input data to extract
features like edges, textures.
● Pooling Layers: Downsample feature maps to reduce dimensionality and
computation.
● Fully Connected Layers: After convolution/pooling, these layers perform
classification.
5.3 How Convolution Works
Convolution uses small matrices (filters) to scan the input and produce feature maps.
For example, a 3x3 filter moves over the image, computing dot products between the filter
weights and input patch.
5.4 Example: Simple CNN with Keras
python
CopyEdit
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
# Load data
(train_images, train_labels), (test_images, test_labels) =
mnist.load_data()
# Preprocess data
train_images = train_images.reshape((60000, 28, 28,
1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28,
1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
# Define model
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
# Compile model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train model
model.fit(train_images, train_labels, epochs=5, batch_size=64)
# Evaluate model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc}")
5.5 Why CNNs Work Well for Images
● Local connectivity: Filters focus on small regions.
● Parameter sharing: Same filter applies across the image, reducing parameters.
● Translation invariance: CNNs detect features regardless of location.
6. Recurrent Neural Networks (RNNs)
6.1 What Are RNNs?
RNNs are specialized for sequential data, such as time series or text. Unlike feedforward
networks, RNNs have loops allowing information to persist across steps.
6.2 Basic Structure
At each time step ttt, RNN takes input xtx_txtand previous hidden state ht−1h_{t-1}ht−1,
producing new hidden state hth_tht:
ht=tanh(Wxhxt+Whhht−1+bh)h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} +
b_h)ht=tanh(Wxhxt+Whhht−1+bh)
Output yty_tytcan be computed from hth_tht.
6.3 Problems with Basic RNNs
● Vanishing and exploding gradients make training hard for long sequences.
● Solutions: LSTM and GRU architectures.
7. LSTM and GRU
7.1 Long Short-Term Memory (LSTM)
LSTM adds gating mechanisms to control information flow:
● Forget gate: Decides what to discard.
● Input gate: Decides what to store.
● Output gate: Decides what to output.
This helps retain long-term dependencies.
7.2 Gated Recurrent Unit (GRU)
Simplified version of LSTM with combined gates, often faster to train.
8. Example: Simple LSTM for Text Classification (Keras)
python
CopyEdit
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample data
texts = ['I love machine learning', 'Deep learning is amazing', 'I
dislike spam emails']
labels = [1, 1, 0]
# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=5)
# Build model
model = Sequential([
Embedding(100, 8, input_length=5),
LSTM(16),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(padded, labels, epochs=10)
# Predict
print(model.predict(padded))
9. Attention Mechanisms and Transformers
9.1 What is Attention?
Attention allows models to focus on important parts of input when producing output,
improving performance on long sequences.
9.2 Transformers
Introduced by Vaswani et al. (2017), transformers use self-attention layers and have
revolutionized NLP, powering models like BERT and GPT.
10. Summary
● Deep learning uses multi-layer neural networks.
● CNNs are great for images.
● RNNs (especially LSTM/GRU) handle sequential data.
● Attention and transformers have advanced NLP massively.
● Python frameworks like TensorFlow and PyTorch enable practical implementations.