We use Gemma3-4B-it as the primary text encoder, conditioning on its penultimate-layer token hidden states. We also extract pooled text features from Jina CLIP v2, project them, and fuse them into the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results