Yau, Chung Yiu

Yau, Chung Yiu

Currently a post-doc at University of Minnesota with Prof. Mingyi Hong. I received my Ph.D. (Year 2025, supervised by Prof. Hoi-To Wai) and B.Sc. (Year 2021) at The Chinese University of Hong Kong.

Highlights

EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization
 Code
tldr: This paper stabilizes Nesterov's lookahead acceleration for deep learning by replacing the standard one-step lookahead direction with an exponential moving average of parameter updates. The resulting EMA-Nesterov method preserves accelerated convergence guarantees in convex settings and improves performance across language model pre-training benchmarks compared with prior lookahead methods.
EMA-Nesterov accelerates Muon by 6% in NanoGPT Track 3 speed run.

Publications

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates
tldr: This paper revisits why SGD underperforms Adam in LLM pre-training and identifies the main issue as SGD's difficulty in sustaining the large effective learning rates that Adam naturally achieves. It shows that large weight-to-gradient ratios and occasional severe output-layer gradient spikes constrain SGD, while simple clipping mechanisms stabilize large-learning-rate SGD enough to close most of the performance gap to Adam.
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models
 Code Slides
tldr: This paper re-emphasizes the role of incoherence processing in Quantization-Aware Training by identifying rotated quantization as a bias-reduction for straight-through estimated gradient training. Our algorithm achieves state-of-the-art 4-bits weight-activation quantized fine-tuned LLMs from full-precision pre-trained models such as Llama and Pythia.
RoSTE surpasses the performance of SOTA quantization methods on fine-tuning benchmark. Horizontal axis represents the total amount of hours needed to fine-tune pre-trained LLMs on a server of 8 \(\times\) A100 NVIDIA GPUs.
tldr: We propose the first asynchronous decentralized optimization algorithm that utilizes the primal-dual framework on random graph and randomly sparsified communications. Our algorithms operate in practical scenario such as decentralized systems with unstable pairwise communication and asynchronous gradient computation.
This figure demonstrates how FSPDA-SA tolerates different levels of sparse communication on sparse random graph while converging to the same magnitude of stationarity, due to the transient effect of sparsity error. Only consensus error is dominantly affected by the sparsity error.
tldr: This paper extends the random graph primal-dual decentralized optimization algorithm (FSPDA) with non-linear noisy communication compression such as quantization with noise. This algorithm is the first algorithm capable of non-linear communication and random graph gossip.
EMC\(^2\): Efficient MCMC Negative Sampling for Contrastive Learning with Global Convergence
 Code Poster  Video Slides
tldr: We apply MCMC sampling to draw negative samples for optimizing the global contrastive loss, an upper bound of InfoNCE. Our algorithm EMC\(^2\) improves upon the baselines on small batch size training.
EMC\(^2\) shows fast convergence on the global contrastive loss using a batch size of 4 samples per step, training ResNet-18 on STL-10 subset with SGD.
tldr: We propose a compressed decentralized optimization that utilize contractive compressor and the primal-dual framework, and analyze its convergence when using exact gradient on nonconvex objective functions.
tldr: A decentralized optimization algorithm that supports time-varying graph, communication compression and asynchronous local updates. This algorithm is constructed from a primal-dual framework and closely connected to the class of gradient tracking algorithms.
Network Effects in Performative Prediction Games
tldr: We study the existence of equilibrium solutions in a networked performative prediction game, with performative distribution shift on one graph and cooperative aggregation on another graph.
tldr: DoCoM is a decentralized optmization algorithm based on gradient tracking while supports communication compression such as sparsification and quantization. DoCoM incorporates a variance-reduced gradient tracker and speeds-up the non-convex stochastic optimization convergence to the rate of \(\mathcal{O}(T^{-2/3})\).
tldr: An analysis to the lower bound on communication cost of distributed optimization for optimizing overparameterized problem and a class of algorithms with matching upper bound in terms of model dimension.
tldr: We study the performative prediction problem on a networked setting, where learner's data distribution has local distribution shift while multiple learners on the network seek a consensal solution.
DAP-BERT: Differentiable Architecture Pruning of BERT
 Code
tldr: A model pruning method for BERT by optimizing a knowledge distillation objective.