Transformers have really become the dominant architecture for many of these sequence modeling tasks because the underlying attention-mechanism can be easily parallelized, allowing for massive ...