DeepLearning

Auxiliary-Loss-Free Load Balancing

Implementation Guide Preliminary: the Original Paper of DeepSeekV3 Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term $b_i$ for each expert and add it to the corresponding affinity scores $s_{i, t}$ to determine the top-K routing: ...

Normalization

Q: I am building a ViT-like transformer-based image classification model, where do you recommend for me to place the normalization layer and which type of it you suggest would me the most efficient? A: Excellent question! This is a critical design choice that directly impacts training stability, convergence speed, and final model performance. Let’s break down the placement and the type of normalization layer for your ViT-like model. Summary Recommendation For the best combination of training stability and performance, the standard and highly recommended approach is: ...

Skip-Connection Instruction

A Practical Guide to Using Skip Connections Part 1: The “Why” - What Problems Do They Solve? Before adding them, it’s crucial to understand why they are so effective. Solving the Vanishing Gradient Problem: Problem: In very deep networks, the gradient (the signal used for learning) must be backpropagated from the final layer to the initial layers. With each step backward through a layer, the gradient is multiplied by the layer’s weights. If these weights are small (less than 1), the gradient can shrink exponentially, becoming so tiny that the early layers learn extremely slowly or not at all. This is the vanishing gradient problem. Solution: A skip connection creates a direct path for the gradient to flow. It’s like an “information highway” that bypasses several layers. The gradient is passed back through the addition/concatenation operation, providing a direct, uninterrupted path to the earlier layers, keeping the signal strong. Solving the Degradation Problem: ...

Reinforcement Learning: Preliminary

Okay, here’s a general guide for modeling and training Reinforcement Learning (RL) agents using PyTorch. This guide will cover the core components and steps, assuming you have a basic understanding of RL concepts (agent, environment, state, action, reward). Core RL Components in PyTorch Environment: Typically, you’ll use a library like gymnasium (the maintained fork of OpenAI Gym). Key methods: env.reset(), env.step(action), env.render(), env.close(). Key attributes: env.observation_space, env.action_space. Agent: The learning entity. It usually consists of: ...

DeepLearning Dataclass Guide

Okay, let’s craft a general and uniform guide for building a dataset class for image processing large models, focusing on PyTorch and PyTorch Lightning. This structure is highly adaptable. Core Principles for Your Dataset Class: Uniformity: The interface (__init__, __len__, __getitem__) should be consistent. Flexibility: Easily accommodate different data sources, label types, and transformations. Efficiency: Load data on-the-fly, leverage multi-processing in DataLoader, and handle large datasets without excessive memory usage. Clarity: Code should be well-commented and easy to understand. Reproducibility: Ensure that given the same settings, the dataset behaves identically (especially important for train/val/test splits). We’ll structure this around: ...