The Vision Transformer (ViT) Large model is a transformer encoder model pre-trained on ImageNet-21k (14 million images, 21,843 classes) and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes), both at a resolution of 224x224. It processes images as a sequence of fixed-size patches (16x16) which are then linearly embedded and fed into the transformer encoder, enhanced with a classification token ([CLS]) and positional embeddings. The model's architecture leverages the attention mechanism to capture global relationships within the image, making it suitable for various downstream image classification tasks. The model weights were converted from JAX to PyTorch by Ross Wightman.

Vision Transformer (ViT) Large

About Vision Transformer (ViT) Large

Core Capabilities

Main Tasks

Image Classification

Feature Extraction

What this tool is best suited for

Shortlist Vision Transformer (ViT) Large against top options

Pros

Cons

Reviews & Ratings

Reviews

Write a Review

Core Tasks

Target Personas

Categories

Alternative Tools

CIFAR-10 and CIFAR-100 Datasets

ConvNeXt

Google AI Gemini API & MediaPipe

Vision Transformer

Hugging Face Fashion Models

Hugging Face Fashion ViT Models

Inference Endpoints

MobileNetV3