BiXT model weights (Perceiving Longer Sequences with Bi-Directional Cross-Attention Transformers)
BiXT model weights
This collection includes PyTorch weights of various BiXT models trained on the ImageNet dataset, as introduced in the paper:
Markus Hiller, Krista A. Ehinger, and Tom Drummond. "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers." The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. (available here)
These weights are in support of the following github repository containing all code: https://github.com/mrkshllr/BiXT/
In detail, weights for the following models are available:
ImageNet Classification Models
Default BiXT-Tiny models with 64 latents:
- BiXT-Ti/16: bixt_ti_l64_p16.zip
- BiXT-Ti/8: bixt_ti_l64_p16s8.zip
- BiXT-Ti/4: bixt_ti_l64_p16s4.zip
Previous BiXT-Tiny models fine-tuned on larger 384x384 images:
- BiXT-Ti/16-ft384: bixt_ti_l64_p16_ft384.zip
- BiXT-Ti/8-ft384: bixt_ti_l64_p16s8_ft384.zip
- BiXT-Ti/4-ft384: bixt_ti_l64_p16s4_ft384.zip
Convolutional Alternative: BiXT-Tiny w/ conv-tokeniser:
- BiXT-Ti/16 (conv): bixt_conv_ti_l64_p16.zip
Slightly larger models with embedding dimension of 256 instead of 192 (default tiny):
- BiXT-d256/16: bixt_ed256_l64_p16.zip
- BiXT-d256/8: bixt_ed256LS_l64_p16s8.zip
- BiXT-d256/8-ft384: bixt_ed256LS_l64_p16s8_ft384.zip
Models for Dense Downstream Tasks
Note that for standard ImageNet training, we simply use a standard classification loss on the average-pooled latent embeddings for training. This means that for a 12 layer BiXT network, the refined patch tokens only receive a gradient until layer 11 -- which is why we employ only a one-sided cross-attention for the last layer (see BiXT model file here).
For simplicity and easy transfer to dense downstream tasks, we therefore simply create and train BiXT-models with a depth of 13 and train these on ImageNet (see here); Afterwards, the last one-sided cross-attention that exclusively refines the latent vectors is simply discarded and the remaining (fully-trained) 12-layer network is used for finetuning on downstream tasks.
Note: It is, of course, entirely possible to replace or extend our simple classification loss on the averaged latent vectors through other token-side losses (e.g. Masked Image Modelling) to provide a gradient signal for the token side and thereby directly train both, the latent and token refinement for all layers.
Dense (d13) BiXT-Tiny models with 64 latents:
- BiXT-Ti/16 (d13): bixt_ti_l64_d13_p16.zip
- BiXT-Ti/8 (d13): bixt_ti_l64_d13_p16s8.zip
- BiXT-Ti/4 (d13): bixt_ti_l64_d13_p16s4.zip
Previous dense (d13) BiXT-Tiny models fine-tuned on larger 384x384 images:
- BiXT-Ti/16-ft384 (d13): bixt_ti_l64_d13_p16_ft384.zip
- BiXT-Ti/8-ft384 (d13): bixt_ti_l64_d13_p16s8_ft384.zip
Dense Convolutional Alternative: BiXT-Tiny (d13) w/ conv-tokeniser:
- BiXT-Ti/16 (conv, d13): bixt_conv_ti_l64_d13_p16.zip
- BiXT-Ti/8 (conv, d13): bixt_conv_ti_l64_d13_p8.zip
Note
In case you need any further information or model files, please make sure to check out the code in the github repository here and/or reach out!
Additional information as well as details on the performance of each pre-trained model is also provided in the repository.
History
Add to Elements
- No