research | Xiao Li

My research interests lie in several aspects of mathemtical optimization and machine learning. They are often motivated by practical needs and/or observations. Here are currently ongoing and some past research projects of my group.

LLMs and Optimization for LLMs

Large language models (LLMs) are the core of the ChatGPT moment. It is perhaps one of the most innovative achivements in artificial intellegence. LLMs are mostly based on Transformer architecture. Its attention module extracts the complicated relationships between tokens of a sentence. It then autoregressively predicts the next token using logistic regression and its volcabulary (classifier). LLMs are trained on Trillions of tokens and have billions of parameters. Traning LLMs is essentially an extremely large-scale optimization problem.

My research group is very interested in LLM research, including supervised finetuning (SFT), preference optimization, memory efficient optimization methods, and reasoning models.

Design and Analysis of Stochastic Optimization Methods

Stochastic gradient method (SGD) plays an important role in modern machine learning due to the large-scale feature of the arising problems. Most of the (supervised) learning problem amounts to solving an emperical risk minimization problem, i.e., minimizing a finite-sum objective \(f(w) = \frac{1}{n}\sum_{i = 1}^n f_i(w)\). There are several variants of SGD for solving this type of problem. In the theory side, the existing analyses mainly apply to the following i.i.d. sampling version of SGD:

\[w_{t+1} = w_t - \alpha_t \nabla f_{i_t} (w_t).\]

Here, \(i_t\) is sampled from \(\{1,\ldots, n\}\) uniformly at random and every sampling is independent from the previous ones. Though such a SGD scheme is well known and widely analyzed, a quick criticism of this scheme is that it may not visit all the data points in one epoch, resulting in inefficency in practice. What is instead often implemented in practice (in PyTorch) is the so-called random reshuffling (RR), which in each epoch randomly permutes \(\{1,\ldots, n\}\) (denoted by \(\sigma_t\)) and updates by sweeping the elements of \(\sigma_t\) sequentially. That is, \(i_t\) incrementally takes value in \(\sigma_t\). It is easy to see that such a method will have a much smaller sampling variance and will utilize all the data points in one epoch. In practice, one also uses adaptive step szie (learning rate) and momentum for stablizing and accelerating training, leading to the well known method Adam. These stochastic optimization methods and their variants serve as the fundamental solvers of modern machine learning problems.

Understanding the properties of these methods are of practical importance, which is a major direction of this project. Antoher important direction is to design new stochastic optimization methods for better performance.

Nonsmooth and/or Nonconvex Optimization, and Algorithm Design

In general, nonconvex optimization is not feasible for attaining global optimality. There are a set of nonconvex problems in machine learning and signal processing that exhibit benign properties. For instance, a fundamental problem is to recover \(w^*\) from \(y = Xw^* + e\), given \(y\) and \(X\). We omit the dimension here for simplicity. Note that the lower case \(y\), \(w^*\), and \(e\) are not restricted to be vectors, they can also be matrices. Such a problem has various names in different communities by imposing different structures on \(w^*\). A family of good examples are variants of (robust) PCA when \(w^*\) is assume to be a low-rank matrix. In this project, we consider both the analysis of the nonconvex optimization problem landscape and the design of efficient algorithms.