To facilitate this exploration, the typical convolutional backbone is replaced with an enhanced Vision Transformer architecture for Tokenization (ViTok), which integrates Vision Transformers (ViTs ...