Deep Dive: 深度学习基础 — 神经网络与 Transformer 革命

Topic: Deep Learning & Transformer (Level 2) Category: AI Fundamentals Level: 入门 Series: AI 小白学习系列 Last Updated: 2026-03-09

中文版

1. 概述

深度学习（Deep Learning）是机器学习中最强大的分支，也是当今 AI 革命的核心引擎。ChatGPT、Claude、自动驾驶、人脸识别 —— 几乎所有让你惊叹的 AI 应用，底层都是深度学习。

深度学习的核心工具是神经网络（Neural Network）—— 一种模仿人脑结构的数学模型。而 2017 年诞生的 Transformer 架构，更是彻底改变了 AI 格局，是 GPT、Claude、Gemini 所有大语言模型的基石。

2. 核心概念

2.1 神经网络：模仿大脑的数学模型

人脑如何工作：你的大脑有约 860 亿个神经元，每个神经元通过突触与其他神经元相连。信号从一个神经元传到下一个，经过层层处理后产生思想和决策。

人工神经网络就是对此的数学模仿：

   输入层          隐藏层(们)         输出层
   Input          Hidden             Output

   ○ ─────┐    ┌── ● ──┐
          ├──▶ │       ├──▶ ● ──┐
   ○ ─────┤    ├── ● ──┤       ├──▶ ◎  预测结果
          ├──▶ │       ├──▶ ● ──┘
   ○ ─────┘    └── ● ──┘

   特征输入      层层计算提取规律     最终判断
  (图片像素、    (每一层发现更       (是猫/是狗？
   文字编码)     高级的特征)         价格是多少？)

概念	类比	说明
神经元 (Neuron)	大脑中的一个脑细胞	接收输入、计算、产生输出的基本单元
权重 (Weight)	突触的强度	决定每个输入有多重要的数值
偏置 (Bias)	神经元的敏感度	调节神经元何时被「激活」
激活函数	阈值开关	决定神经元是否”点亮”（非线性变换）
层 (Layer)	大脑皮层的不同区域	浅层识别简单特征，深层识别复杂特征
“深度”	层数多	“深度学习”就是因为网络层数很多（几十到几百层）

💡 为什么”深”很重要？ 浅层网络只能识别简单模式（如线条、颜色）。层数越多，网络能学到的特征越抽象越高级（如从「边缘」→「纹理」→「五官」→「人脸」）。

2.2 CNN：让 AI 看懂图片

卷积神经网络（Convolutional Neural Network） 专门为处理图像而设计。

核心思想：用一个小窗口（卷积核/滤波器）在图片上滑动扫描，逐步提取越来越高级的特征。

原始图片 → [边缘检测] → [纹理识别] → [部件识别] → [整体识别] → "这是一只猫"
           第1层        第2层        第3层        第4层       输出
           发现线条     发现毛发纹理  发现耳朵眼睛  识别完整的猫

应用场景：人脸识别、医学影像诊断、自动驾驶（识别路牌、行人）、OCR 文字识别

2.3 RNN/LSTM：让 AI 理解顺序

循环神经网络（Recurrent Neural Network） 专门处理有先后顺序的数据（文字、语音、时间序列）。

核心思想：网络有”记忆”，处理当前输入时会参考之前的信息。

"我 爱 北京 天安门"

 我 ──▶ [RNN] ──▶ 爱 ──▶ [RNN] ──▶ 北京 ──▶ [RNN] ──▶ 天安门 ──▶ [RNN]
         │  ▲           │  ▲             │  ▲               │
         └──┘           └──┘             └──┘               │
        记忆传递        记忆传递          记忆传递            最终输出

LSTM（Long Short-Term Memory） 是 RNN 的改进版，解决了 RNN “记性太差”（长序列中忘记早期信息）的问题。

缺陷：RNN 必须按顺序处理，速度慢，无法并行。这就是为什么它被 Transformer 替代了。

2.4 ⭐ Transformer：改变世界的架构

2017 年，Google 发表论文《Attention Is All You Need》，提出 Transformer 架构。这是现代 AI 最重要的发明。

Transformer 解决了什么问题？

RNN 的问题	Transformer 的解决方案
必须按顺序处理，无法并行 → 慢	可以并行处理整个序列 → 快
长文本中容易忘记开头的内容	通过 Attention 直接关注任何位置
难以扩展到超大规模	可以扩展到万亿参数的超大模型

Transformer 的核心结构：

┌─────────────────────────────────────────────┐
│              Transformer                     │
│                                             │
│  ┌──────────────┐    ┌──────────────────┐   │
│  │   Encoder    │    │    Decoder       │   │
│  │  (理解输入)   │──▶ │  (生成输出)       │   │
│  │              │    │                  │   │
│  │ Self-        │    │ Self-Attention   │   │
│  │ Attention    │    │ + Cross-         │   │
│  │ ↕           │    │   Attention      │   │
│  │ Feed-Forward │    │ ↕               │   │
│  │ Network     │    │ Feed-Forward     │   │
│  └──────────────┘    └──────────────────┘   │
└─────────────────────────────────────────────┘

• GPT / Claude = 只用 Decoder（生成式）
• BERT = 只用 Encoder（理解式）
• 原始 Transformer = Encoder + Decoder（翻译等）

2.5 ⭐ 注意力机制（Attention）

注意力机制是 Transformer 的灵魂。它解决的是：当处理一个词时，如何知道句子中其他哪些词更重要？

直觉理解：

当你读”小明把他的书给了小红”这句话时：

读到”他”时，你的大脑自动知道”他”指的是”小明” → 这就是 Attention
你不需要按顺序从头到尾回忆，而是直接”跳着”关注相关的词

"The cat sat on the mat because it was tired"

  处理 "it" 时的 Attention 权重：

  The   cat   sat   on   the   mat   because   it   was   tired
  0.05  0.60  0.05  0.02  0.02  0.08  0.03    1.0  0.05  0.10
        ^^^^                                         
     "it" 主要关注 "cat" → AI 理解了 "it" 指代 "cat"

Self-Attention（自注意力）：让序列中的每个位置都能关注到其他所有位置，计算出相互之间的关联强度。这使 Transformer 能够捕捉到长距离的依赖关系。

2.6 预训练与迁移学习

现代大模型的训练分为两个阶段：

Phase 1: 预训练 (Pre-training)              Phase 2: 微调 (Fine-tuning)
┌─────────────────────────────┐           ┌──────────────────────────┐
│ 用海量通用数据训练            │           │ 用少量特定领域数据训练      │
│ (整个互联网的文本)            │ ────────▶ │ (如医学文献、法律文档)     │
│                             │           │                          │
│ 学到：语法、常识、世界知识     │           │ 学到：领域专业知识          │
│ 成本：数百万美元              │           │ 成本：几百到几万美元        │
└─────────────────────────────┘           └──────────────────────────┘

类比：先读完大学通识教育 ──────────────────▶ 再读研究生专攻某个方向

3. 为什么 Transformer 改变了世界？

之前（RNN 时代）	之后（Transformer 时代）
模型小（几百万参数）	模型巨大（千亿到万亿参数）
训练慢（无法并行）	训练快（GPU 并行加速）
只能处理短文本	能处理整本书的长文本
每个任务需要单独的模型	一个大模型搞定几乎所有 NLP 任务
AI 只是玩具	AI 开始改变世界

4. 小结 & 下一步

🎉 Level 2 完成！ 你现在理解了：

✅ 神经网络是如何模仿大脑的
✅ CNN 处理图像、RNN 处理序列
✅ Transformer 为什么是革命性的
✅ 注意力机制如何让 AI “聚焦”
✅ 预训练 + 微调的两阶段训练模式

📚 下一课：[Level 3] 大语言模型 (LLM) — ChatGPT 和 Claude 到底是怎么工作的？

5. 参考资料

Microsoft AI Fundamentals — AI 基础概念
How Generative AI and LLMs Work — 生成式 AI 原理

English Version

1. Overview

Deep Learning is the most powerful branch of machine learning and the core engine of today’s AI revolution. ChatGPT, Claude, self-driving cars, facial recognition — virtually every impressive AI application is powered by deep learning.

The core tool is the Neural Network — a mathematical model inspired by the human brain. The Transformer architecture (2017) completely changed the AI landscape and is the foundation of all modern LLMs.

2. Core Concepts

Neural Networks

Artificial neural networks mimic brain structure: neurons connected in layers, each layer extracting progressively higher-level features from data.

Concept	Analogy	Description
Neuron	Brain cell	Basic unit that receives input, computes, produces output
Weight	Synapse strength	Determines how important each input is
Layer	Brain cortex regions	Shallow layers find simple patterns; deep layers find complex ones
“Deep”	Many layers	“Deep learning” = networks with many layers (tens to hundreds)

Key Architectures

Architecture	Specialization	Core Idea	Limitation
CNN	Images	Sliding filter scans image, extracting features layer by layer	Designed mainly for grid-like data
RNN/LSTM	Sequences (text, audio)	Has “memory” — uses previous information when processing current input	Sequential processing = slow, can’t parallelize
⭐ Transformer	Everything (text, images, audio)	Self-Attention allows every position to attend to every other position	High memory requirements for very long sequences

⭐ Transformer: The Architecture That Changed the World

The Transformer solves RNN’s fundamental problems:

✅ Parallel processing (not sequential) → much faster
✅ Attention mechanism → directly focuses on any relevant position regardless of distance
✅ Scalable → can grow to trillions of parameters

Key variants:

GPT / Claude = Decoder-only (generative)
BERT = Encoder-only (understanding)
Original Transformer = Encoder + Decoder (translation)

⭐ Attention Mechanism

The soul of Transformer. When processing a word, Attention computes how much to “focus on” every other word in the sentence.

Example: In “The cat sat on the mat because it was tired” — Attention helps AI understand “it” refers to “cat” by assigning high attention weight to “cat” when processing “it”.

3. Summary & What’s Next

📚 Next: [Level 3] Large Language Models (LLM) — How ChatGPT and Claude actually work.

4. References

Microsoft AI Fundamentals — AI fundamentals
How Generative AI and LLMs Work — Generative AI principles

📚 Knowledge Base

AI 学习 Level 2: 深度学习基础 — 神经网络与 Transformer 革命