Scheduling seminar – Zijie ZHOU – Efficient and Robust Large Language Model (LLM) Inference Scheduling Optimization

Jana Bílíková

29.10.2025

Datum / čas
Date(s) - 29.10.
15:00 - 16:30

Kategorie

Přednáška

Zijie Zhou – Efficient and Robust Large Language Model (LLM) Inference Scheduling Optimization

Abstract:

We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total completion time. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. There are two key challenges: (i) each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. We show that minimizing total completion time is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios are unbounded. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. (ii) the output length, which critically impacts memory usage and processing time, is unknown. We first design a conservative algorithm, Amax, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose Amin, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that Amin achieves a log-scale competitive ratio.

[Presenter]
Zijie Zhou
(IEDA, HKUST)

[Invited by]
Guohua Wan
(Shanghai Jiao Tong)

UTC	Oct 29, 14:00 Wed
Prague	Oct 29, 15:00 Wed
New York	Oct 29, 10:00 Wed
Shanghai	Oct 29, 22:00 Wed

Cookie	Délka	Popis
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.