📊 并行策略
🖥️ 多卡部署 (Tensor Parallelism)
# 4 卡部署 70B 模型
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
# 8 卡部署
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8🌐 多节点部署
节点 1 (Master):
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 0 \
--data-parallel-address <master-ip> \
--data-parallel-rpc-port 13345节点 2 (Worker):
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address <master-ip> \
--data-parallel-rpc-port 13345🔄 负载均衡
生产环境推荐架构:
┌─────────────┐
│ Nginx │
│ Load Balance│
└──────┬──────┘
┌───────────────┼───────────────┐
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ vLLM:8001│ │ vLLM:8002│ │ vLLM:8003│
└──────────┘ └──────────┘ └──────────┘