Yau, Chung Yiu

I am studying the 4th year PhD at CUHK, Department of SEEM, supervised by Prof. Hoi-To Wai. I graduated with a BSc in Computer Science at CUHK in 2021. My PhD research focuses on distributed optimization algorithms for machine learning. My research experience spans across other deep learning problems such as LLM quantization and contrastive learning.

Highlights

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models
 Code
tldr: This paper re-emphasizes the role of incoherence processing in Quantization-Aware Training by identifying rotated quantization as a bias-reduction for straight-through estimated gradient training. Our algorithm achieves state-of-the-art 4-bits weight-activation quantized fine-tuned LLMs from full-precision pre-trained models such as Llama and Pythia.
RoSTE surpasses the performance of SOTA quantization methods on fine-tuning benchmark. Horizontal axis represents the total amount of hours needed to fine-tune pre-trained LLMs on a server of 8 \(\times\) A100 NVIDIA GPUs.
tldr: We propose the first asynchronous decentralized optimization algorithm that utilizes the primal-dual framework on random graph and randomly sparsified communications. This algorithm operates in practical scenario such as decentralized systems with unstable pairwise communication and asynchronous gradient computation.
This figure demonstrates how FSPDA tolerates different levels of sparse communication on sparse random graph while converging to the same magnitude of stationarity, due to the transient effect of sparsity error. Only consensus error is dominantly affected by the sparsity error.

Publications

tldr: This paper extends the random graph primal-dual decentralized optimization algorithm with non-linear noisy communication compression such as quantization with noise. This algorithm is the first algorithm capable of non-linear communication and random graph gossip.
EMC\(^2\): Efficient MCMC Negative Sampling for Contrastive Learning with Global Convergence
 Code Poster  Video Slides
tldr: We apply MCMC sampling to draw negative samples for optimizing the global contrastive loss, an upper bound of InfoNCE. Our algorithm EMC\(^2\) improves upon the baselines on small batch size training.
EMC\(^2\) shows fast convergence on the global contrastive loss using a batch size of 4 samples per step, training ResNet-18 on STL-10 subset with SGD.
tldr: We propose a compressed decentralized optimization that utilize contractive compressor and the primal-dual framework, and analyze its convergence when using exact gradient on nonconvex objective functions.
tldr: A decentralized optimization algorithm that supports time-varying graph, communication compression and asynchronous local updates. This algorithm is constructed from a primal-dual framework and closely connected to the class of gradient tracking algorithms.
Network Effects in Performative Prediction Games
tldr: We study the existence of equilibrium solutions in a networked performative prediction game, with performative distribution shift on one graph and cooperative aggregation on another graph.
tldr: DoCoM is a decentralized optmization algorithm based on gradient tracking while supports communication compression such as sparsification and quantization. DoCoM incorporates a variance-reduced gradient tracker and speeds-up the non-convex stochastic optimization convergence to the rate of \(\mathcal{O}(T^{-2/3})\).
tldr: An analysis to the lower bound on communication cost of distributed optimization for optimizing overparameterized problem and a class of algorithms with matching upper bound in terms of model dimension.
tldr: We study the performative prediction problem on a networked setting, where learner's data distribution has local distribution shift while multiple learners on the network seek a consensal solution.
DAP-BERT: Differentiable Architecture Pruning of BERT
 Code
tldr: A model pruning method for BERT by optimizing a knowledge distillation objective.