Datum / čas
Date(s) - 29.10.
15:00 - 16:30
Kategorie
Zijie Zhou – Efficient and Robust Large Language Model (LLM) Inference Scheduling Optimization
Abstract:
We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total completion time. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. There are two key challenges: (i) each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. We show that minimizing total completion time is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios are unbounded. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. (ii) the output length, which critically impacts memory usage and processing time, is unknown. We first design a conservative algorithm, Amax, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose Amin, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that Amin achieves a log-scale competitive ratio.
[Presenter] Zijie Zhou (IEDA, HKUST) [Invited by] |
|