Because of the attention layer, transformers can better capture relationships between words separated by long amounts of text, whereas previous models such as recurrent neural networks (RNN ...