Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time ScalingPRETRAIN~01-09-2025
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache RematerializationATTENTION~14-08-2025
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based ModelingRL~01-08-2025
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and AlgorithmsDISTRIBUTED~01-07-2025
Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability MeasuresPAPER~01-06-2025
The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon AlgorithmOPTIMIZERS~01-05-2025
Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model PretrainingPRETRAIN~01-03-2025
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-trainingDISTRIBUTED~01-10-2024
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language ModelsFINETUNING~01-07-2024
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMsQUANTIZATION~01-09-2023
PanGu-alpha: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel ComputationMODEL RELEASE~01-04-2021