Natural language processing (NLP) is a main subject for artificial intelligence research, on par with computer vision and scene understanding. Knowledge is encoded and transferred among individuals through text, following a formally defined structure. Nonetheless, distinct languages, context, and subtle particularities of different communication channels are complex challenges that researchers must cope with. Hence, the task of general language modeling and understating was divided into multiple subtasks. For example, question and answering, image captioning, text summarization, machine translation and natural language generation. Recently, attention mechanisms became ubiquitous among state-of-the-art approaches, allowing the models to selectively attend to different words or sentences in order of relevance. The goal of this review is to gather and analyze current research efforts that improve on—or provide alternatives to—attention mechanisms, categorize trends, and extrapolate possible research paths for future works.
Recurrent Neural Networks (RNNs) were broadly adopted by the NLP community and achieved important mile-stones.37 However, it is computationally expensive to encode long-term relations among words in a sentence, or among sentences in a document using RNNs. In tasks such as text generation, encoding these dependencies is fundamental, and the inference time may become prohibitively slow. Pursing a solution for these limitations, the seminal work of Vaswani et al.33 engineered the Transformers architecture. The core idea was to put attention mechanisms in evidence, discarding recurrence. Soon after, the Transformers became the de facto backend model for most NLP. From the original December 2017 publication to date, there have been over 4,000 citations and remarkable work on the concept. Due to its early success, the community focus became the development of larger models, containing scaled-up Transformers and longer context windows, leading to a shift from task-specific training procedures to more general language modeling. On the other hand, although current models are capable of generating text with unprecedented perplexity, they rely mostly on statistical knowledge contained in the text corpora, without encoding the actual meaning behind words. Hence, the produced text has surprisingly middling segments contrasting with meaningless word sequences. This lack of meaning cannot be trivially solved through attention alone.
No entries found