leonardlin 's Collections speed
updated
LLM in a flash: Efficient Large Language Model Inference with Limited
Memory
Paper
• 2312.11514
• Published
• 260
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
• 2312.12456
• Published
• 45
Accelerating LLM Inference with Staged Speculative Decoding
Paper
• 2308.04623
• Published
• 26
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper
• 2208.07339
• Published
• 5
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
• 2309.06180
• Published
• 38
Efficient LLM inference solution on Intel GPU
Paper
• 2401.05391
• Published
• 11
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
• 2401.08671
• Published
• 15
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
• 2401.10774
• Published
• 59
Zero Bubble Pipeline Parallelism
Paper
• 2401.10241
• Published
• 25
Lookahead: An Inference Acceleration Framework for Large Language Model
with Lossless Generation Accuracy
Paper
• 2312.12728
• Published
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Paper
• 2401.15077
• Published
• 20
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper
• 2401.15024
• Published
• 73
Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models
Paper
• 2311.03687
• Published
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
• 2402.05099
• Published
• 20
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Paper
• 2402.02750
• Published
• 4
Paper
• 2402.04925
• Published
• 4
FAST: Factorizable Attention for Speeding up Transformers
Paper
• 2402.07901
• Published
• 3
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache
Quantization
Paper
• 2401.18079
• Published
• 8
CATS: Contextually-Aware Thresholding for Sparsity in Large Language
Models
Paper
• 2404.08763
• Published
• 2
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
• 2405.10637
• Published
• 22
Distributed Speculative Inference of Large Language Models
Paper
• 2405.14105
• Published
• 18
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Paper
• 2406.16858
• Published
• 1
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Paper
• 2406.06282
• Published
• 39