Linear attention-based fashions are gaining consideration for his or her sooner processing velocity and comparable efficiency to Softmax transformers. Nonetheless, giant language fashions (LLMs), because of their giant measurement and longer sequence lengths, exert important pressure on up to date GPU {hardware} as a result of a single GPU’s reminiscence confines a language mannequin’s most sequence size.
Sequence Parallelism (SP) strategies are sometimes utilized to divide a protracted sequence into a number of sub-sequences and prepare them on a number of GPUs individually. Nonetheless, present SP strategies underutilize linear consideration options, leading to inefficient parallelism and value points.
Researchers from Shanghai AI Laboratory and TapTap current the linear consideration sequence parallel (LASP) approach, which optimizes sequence parallelism on linear transformers. It employs point-to-point (P2P) communication for environment friendly state alternate amongst GPUs inside or throughout nodes. LASP maximizes using right-product kernel tips in linear consideration. Importantly, it doesn’t depend on consideration head partitioning, making it adaptable to multi-head, multi-query, and grouped-query attentions.
LASP employs a tiling method to partition enter sequences into sub-sequence chunks distributed throughout GPUs. It distinguishes consideration computation into intra-chunks and inter-chunks for using linear consideration’s right-product benefit. Intra-chunks use typical consideration computation, whereas inter-chunks exploit kernel tips. The tactic additionally consists of knowledge distribution, ahead go, and backward go mechanisms to boost parallel processing effectivity.
LASP achieves important throughput enhancement for linear consideration by environment friendly communication design, surpassing DeepSpeed-Ulysses by 38% and Megatron by 136% in throughput at 256K sequence size on 1B mannequin. Furthermore, LASP, with system optimizations like kernel fusion and KV State caching, helps longer sequence lengths throughout the similar cluster, reaching 2048K for the 1B mannequin and 512K for the 7B mannequin.
Key contributions of this analysis are as follows:
- A brand new SP technique tailor-made to linear consideration: Enabling linear attention-based fashions to scale for lengthy sequences with out being restricted by a single GPU.
- Sequence length-independent communication over-head: Their elegant communication mechanism harnesses the right-product kernel trick of linear consideration to make sure that the exchanging of linear consideration intermediate states is sequence length-independent.
- GPU-friendly implementation: Optimized LASP’s execution on GPUs by meticulous system engineering, together with kernel fusion and KV State caching.
- Knowledge-parallel compatibility: LASP is appropriate with all batch-level DDP strategies, comparable to PyTorch/Legacy DDP, FSDP, and ZeRO-series optimizers.
In conclusion, LASP is launched to beat the restrictions of current SP strategies on linear transformers by leveraging linear consideration options to boost parallelism effectivity and value. Implementing P2P communication, kernel fusion, and KV state caching reduces communication visitors and improves GPU cluster utilization. Compatibility with batch-level DDP strategies ensures practicality for large-scale distributed coaching. Experiments spotlight LASP’s benefits in scalability, velocity, reminiscence utilization, and convergence efficiency in comparison with current SP strategies.
Try the Paper and Github. All credit score for this analysis goes to the researchers of this venture. Additionally, don’t neglect to observe us on Twitter. Be a part of our Telegram Channel, Discord Channel, and LinkedIn Group.
In case you like our work, you’ll love our e-newsletter..
Don’t Neglect to affix our 39k+ ML SubReddit