Researchers from the Massachusetts Institute of Technology and Google have developed Pasta, a novel technique that enables large language models to parallelize their own text generation by producing semantically independent parts of responses simultaneously. The system teaches models to identify and annotate opportunities for asynchronous generation, enabling a purpose-built inference system to decode multiple sub-parts […]
Teaching LLMs to Parallelize Their Own Generations
