Every deep learning breakthrough, from AlphaGo to ChatGPT, is built on the same foundation: the artificial neural network. Understanding how neurons, weights, biases, and layers work together is essential before diving into advanced architectures. This guide takes you through the fundamentals step by step.

The Artificial Neuron

The building block of every neural network is the artificial neuron (also called a unit or node). Inspired loosely by biological neurons, it performs a simple three-step computation:

  1. Weighted sum: Multiply each input by its corresponding weight and sum the results: z = w1*x1 + w2*x2 + ... + wn*xn
  2. Add bias: Add a bias term: z = z + b
  3. Apply activation: Pass z through an activation function: output = f(z)

The weights determine how much influence each input has. The bias allows the neuron to activate even when all inputs are zero. The activation function introduces nonlinearity, which is what allows neural networks to learn complex patterns.

"A single neuron computes a weighted vote of its inputs. A network of billions of such neurons, each voting on different aspects of the data, produces intelligence."

Weights: The Knowledge Store

Weights are the learned parameters that encode what the network knows. Before training, weights are initialized randomly (or using careful initialization schemes like Xavier or He initialization). During training, backpropagation adjusts each weight to reduce the prediction error.

  • A large positive weight means the input strongly activates the neuron.
  • A large negative weight means the input strongly inhibits the neuron.
  • A weight near zero means the input has little effect on the neuron's output.

A neural network with millions of neurons can have billions of weights. GPT-3, for example, has 175 billion parameters, nearly all of which are weights.

Biases: Shifting the Decision Boundary

The bias is an additional parameter in each neuron that shifts the activation function left or right. Without bias, a neuron with all-zero inputs would always produce zero before activation. The bias gives the neuron flexibility to activate at the appropriate threshold.

Think of it this way: weights determine the slope of a linear boundary, and biases determine its position. Together, they define a decision surface in the input space.

Key Takeaway

Weights and biases are the "memory" of a neural network. Everything the network learns during training is encoded in these numbers. The architecture defines the capacity to learn; the weights and biases store what was actually learned.

Layers: Building Depth

Neurons are organized into layers, and layers are stacked to form a network. The arrangement of layers defines the network's architecture.

Input Layer

The input layer simply passes the raw data to the first hidden layer. For an image with 784 pixels (28x28), the input layer has 784 neurons, one per pixel. It performs no computation.

Hidden Layers

Hidden layers are where the learning happens. Each layer transforms the data from the previous layer into a new representation. In a well-trained network:

  • Early layers learn low-level features: edges, textures, basic patterns.
  • Middle layers learn mid-level features: shapes, parts, combinations of basic patterns.
  • Later layers learn high-level features: objects, concepts, abstract patterns.

This hierarchical feature learning is the key advantage of deep networks over shallow ones.

Output Layer

The output layer produces the final prediction. Its design depends on the task:

  • Binary classification: One neuron with sigmoid activation, outputting a probability between 0 and 1.
  • Multi-class classification: One neuron per class with softmax activation, outputting a probability distribution.
  • Regression: One neuron with no activation (linear), outputting a continuous value.

Common Layer Types

Dense (Fully Connected) Layers

In a dense layer, every neuron is connected to every neuron in the previous layer. This is the most general type of layer but can be parameter-heavy. A dense layer with 1000 inputs and 1000 outputs has one million weights plus 1000 biases.

Convolutional Layers

Used primarily in CNNs, convolutional layers use small filters that slide across the input, sharing weights across spatial locations. This dramatically reduces parameters while exploiting the spatial structure of images.

Recurrent Layers

Used in RNNs, recurrent layers maintain a hidden state that is updated at each time step, allowing them to process sequential data of variable length.

Normalization Layers

Batch normalization and layer normalization standardize the inputs to each layer, stabilizing training and enabling higher learning rates.

Dropout Layers

Dropout randomly deactivates neurons during training, preventing overfitting and improving generalization.

The Forward Pass

When data flows through a network from input to output, this is called the forward pass. At each layer, the computation is:

output = activation(weights @ input + bias)

The forward pass transforms raw input into a prediction. The network's prediction quality depends entirely on the values of its weights and biases.

Loss Functions

A loss function measures how wrong the network's predictions are. Common choices include:

  • Cross-entropy loss: For classification tasks. Penalizes confident wrong predictions heavily.
  • Mean squared error: For regression tasks. Penalizes large errors more than small ones.
  • Binary cross-entropy: For binary classification tasks.

The loss function guides learning. By computing the gradient of the loss with respect to every weight, backpropagation tells each weight how to change to reduce the error.

Key Takeaway

The forward pass computes predictions; the loss function measures how wrong they are; backpropagation computes how to fix them; and the optimizer updates the weights. This cycle repeats millions of times during training.

Weight Initialization

How weights are initialized matters more than you might expect. Poor initialization can cause:

  • Vanishing gradients: If weights are too small, signals shrink as they pass through layers, and learning stalls.
  • Exploding gradients: If weights are too large, signals grow exponentially, causing numerical instability.

Modern initialization methods like Xavier initialization (for sigmoid/tanh) and He initialization (for ReLU) set weights to values that maintain signal magnitude across layers, enabling stable training of very deep networks.

Putting It All Together

A neural network is a mathematical function defined by its architecture (the arrangement of layers) and its parameters (weights and biases). Training is the process of finding the parameter values that minimize the loss function on the training data. The hope, validated by decades of results, is that a network that performs well on training data will also perform well on new, unseen data, especially when combined with regularization techniques like dropout and early stopping.

Understanding these fundamentals is the key to everything else in deep learning. Every advanced architecture, from ResNets to Transformers, is built from these same basic building blocks, just arranged in clever ways to solve increasingly complex problems.