BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
Despite remarkable advancements in few-shot generalization in natural language processing, the majority of models are developed and evaluated primarily in English. To enable fair model comparisons, we, therefore, propose a new benchmark, called BUFFET, which unifies 16 diverse tasks across 57 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. Using BUFFET, we perform thorough evaluations of state-of-the-art multilingual large language models with different learning methods, namely in-context learning, and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. In particular, ChatGPT with in-context learning often performs worse than much smaller mT5-base models fine-tuned on English task data and few-shot in-language examples. Our analysis suggests various avenues for future research in few-shot cross-lingual transfer, such as improved training, in-context learning{, and future evaluations.
Stay in the loop.
Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.