Yau, Chung Yiu

I received my Ph.D. (Year 2025, supervised by Prof. Hoi-To Wai) and B.Sc. (Year 2021) at The Chinese University of Hong Kong.

Highlights

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

2025 Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang (Katie) Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong

Code Slides

tldr: This paper re-emphasizes the role of incoherence processing in Quantization-Aware Training by identifying rotated quantization as a bias-reduction for straight-through estimated gradient training. Our algorithm achieves state-of-the-art 4-bits weight-activation quantized fine-tuned LLMs from full-precision pre-trained models such as Llama and Pythia.

RoSTE surpasses the performance of SOTA quantization methods on fine-tuning benchmark. Horizontal axis represents the total amount of hours needed to fine-tune pre-trained LLMs on a server of 8 \(\times\) A100 NVIDIA GPUs.

A Stochastic Approximation Approach for Efficient Decentralized Optimization on Random Networks

2025 Chung-Yiu Yau, Haoming Liu, Hoi-To Wai

Code Slides

tldr: We propose the first asynchronous decentralized optimization algorithm that utilizes the primal-dual framework on random graph and randomly sparsified communications. Our algorithms operate in practical scenario such as decentralized systems with unstable pairwise communication and asynchronous gradient computation.

This figure demonstrates how FSPDA-SA tolerates different levels of sparse communication on sparse random graph while converging to the same magnitude of stationarity, due to the transient effect of sparsity error. Only consensus error is dominantly affected by the sparsity error.

Publications

Decentralized Stochastic Optimization over Unreliable Networks via Two-timescales Updates

2025 Haoming Liu, Chung-Yiu Yau, Hoi-To Wai

tldr: This paper extends the random graph primal-dual decentralized optimization algorithm (FSPDA) with non-linear noisy communication compression such as quantization with noise. This algorithm is the first algorithm capable of non-linear communication and random graph gossip.

EMC\(^2\): Efficient MCMC Negative Sampling for Contrastive Learning with Global Convergence

2024 Chung-Yiu Yau, Hoi-To Wai, Parameswaran Raman, Soumajyoti Sarkar, Mingyi Hong

Code Poster Video Slides

tldr: We apply MCMC sampling to draw negative samples for optimizing the global contrastive loss, an upper bound of InfoNCE. Our algorithm EMC\(^2\) improves upon the baselines on small batch size training.

EMC\(^2\) shows fast convergence on the global contrastive loss using a batch size of 4 samples per step, training ResNet-18 on STL-10 subset with SGD.

A Two-timescale Primal-dual Algorithm for Decentralized Optimization with Compression

2024 Haoming Liu, Chung-Yiu Yau, Hoi-To Wai

tldr: We propose a compressed decentralized optimization that utilize contractive compressor and the primal-dual framework, and analyze its convergence when using exact gradient on nonconvex objective functions.

Fully Stochastic Distributed Convex Optimization on Time-Varying Graph with Compression

2023 Chung-Yiu Yau, Hoi-To Wai

tldr: A decentralized optimization algorithm that supports time-varying graph, communication compression and asynchronous local updates. This algorithm is constructed from a primal-dual framework and closely connected to the class of gradient tracking algorithms.

Network Effects in Performative Prediction Games

2023 Xiaolu Wang, Chung-Yiu Yau, Hoi-To Wai

tldr: We study the existence of equilibrium solutions in a networked performative prediction game, with performative distribution shift on one graph and cooperative aggregation on another graph.

DoCoM: Compressed Decentralized Optimization with Near-Optimal Sample Complexity

2023 Chung-Yiu Yau, Hoi-To Wai

Code Video Slides

tldr: DoCoM is a decentralized optmization algorithm based on gradient tracking while supports communication compression such as sparsification and quantization. DoCoM incorporates a variance-reduced gradient tracker and speeds-up the non-convex stochastic optimization convergence to the rate of \(\mathcal{O}(T^{-2/3})\).

Distributed Optimization for Overparameterized Problems: Achieving Optimal Dimension Independent Communication Complexity

2022 Bingqing Song, Ioannis Tsaknakis, Chung-Yiu Yau, Hoi-To Wai, Mingyi Hong

tldr: An analysis to the lower bound on communication cost of distributed optimization for optimizing overparameterized problem and a class of algorithms with matching upper bound in terms of model dimension.

Multi-agent Performative Prediction with Greedy Deployment and Consensus Seeking Agents

2022 Qiang Li, Chung-Yiu Yau, Hoi-To Wai

tldr: We study the performative prediction problem on a networked setting, where learner's data distribution has local distribution shift while multiple learners on the network seek a consensal solution.

DAP-BERT: Differentiable Architecture Pruning of BERT

2021 Chung-Yiu Yau, Haoli Bai, Irwin King, Michael R Lyu

Code

tldr: A model pruning method for BERT by optimizing a knowledge distillation objective.