The University of Melbourne
Browse

BiXT model weights (Perceiving Longer Sequences with Bi-Directional Cross-Attention Transformers)

dataset
posted on 2025-03-10, 07:32 authored by Markus HillerMarkus Hiller

BiXT model weights

This collection includes PyTorch weights of various BiXT models trained on the ImageNet dataset, as introduced in the paper:

Markus Hiller, Krista A. Ehinger, and Tom Drummond. "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers." The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. (available here)

These weights are in support of the following github repository containing all code: https://github.com/mrkshllr/BiXT/

In detail, weights for the following models are available:

ImageNet Classification Models

Default BiXT-Tiny models with 64 latents:

  • BiXT-Ti/16: bixt_ti_l64_p16.zip
  • BiXT-Ti/8: bixt_ti_l64_p16s8.zip
  • BiXT-Ti/4: bixt_ti_l64_p16s4.zip

Previous BiXT-Tiny models fine-tuned on larger 384x384 images:

  • BiXT-Ti/16-ft384: bixt_ti_l64_p16_ft384.zip
  • BiXT-Ti/8-ft384: bixt_ti_l64_p16s8_ft384.zip
  • BiXT-Ti/4-ft384: bixt_ti_l64_p16s4_ft384.zip

Convolutional Alternative: BiXT-Tiny w/ conv-tokeniser:

  • BiXT-Ti/16 (conv): bixt_conv_ti_l64_p16.zip

Slightly larger models with embedding dimension of 256 instead of 192 (default tiny):

  • BiXT-d256/16: bixt_ed256_l64_p16.zip
  • BiXT-d256/8: bixt_ed256LS_l64_p16s8.zip
  • BiXT-d256/8-ft384: bixt_ed256LS_l64_p16s8_ft384.zip


Models for Dense Downstream Tasks

Note that for standard ImageNet training, we simply use a standard classification loss on the average-pooled latent embeddings for training. This means that for a 12 layer BiXT network, the refined patch tokens only receive a gradient until layer 11 -- which is why we employ only a one-sided cross-attention for the last layer (see BiXT model file here).

For simplicity and easy transfer to dense downstream tasks, we therefore simply create and train BiXT-models with a depth of 13 and train these on ImageNet (see here); Afterwards, the last one-sided cross-attention that exclusively refines the latent vectors is simply discarded and the remaining (fully-trained) 12-layer network is used for finetuning on downstream tasks.

Note: It is, of course, entirely possible to replace or extend our simple classification loss on the averaged latent vectors through other token-side losses (e.g. Masked Image Modelling) to provide a gradient signal for the token side and thereby directly train both, the latent and token refinement for all layers.

Dense (d13) BiXT-Tiny models with 64 latents:

  • BiXT-Ti/16 (d13): bixt_ti_l64_d13_p16.zip
  • BiXT-Ti/8 (d13): bixt_ti_l64_d13_p16s8.zip
  • BiXT-Ti/4 (d13): bixt_ti_l64_d13_p16s4.zip

Previous dense (d13) BiXT-Tiny models fine-tuned on larger 384x384 images:

  • BiXT-Ti/16-ft384 (d13): bixt_ti_l64_d13_p16_ft384.zip
  • BiXT-Ti/8-ft384 (d13): bixt_ti_l64_d13_p16s8_ft384.zip

Dense Convolutional Alternative: BiXT-Tiny (d13) w/ conv-tokeniser:

  • BiXT-Ti/16 (conv, d13): bixt_conv_ti_l64_d13_p16.zip
  • BiXT-Ti/8 (conv, d13): bixt_conv_ti_l64_d13_p8.zip


Note

In case you need any further information or model files, please make sure to check out the code in the github repository here and/or reach out!

Additional information as well as details on the performance of each pre-trained model is also provided in the repository.

History

Add to Elements

  • No

Usage metrics

    University of Melbourne

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC