AI Engineering Glossary
Search

Vision Transformers

Vision Transformers (ViTs) are a neural network architecture specifically designed for processing and understanding visual data. Unlike traditional convolutional neural networks (CNNs) that focus on local patterns, ViTs use attention mechanisms to capture global features by treating the input image as a sequence of image patches (akin to words in text processing). This is especially useful for tasks like image classification and segmentation. ViTs have shown significant promise in achieving state-of-the-art results in various image-based tasks, often outperforming CNNs.

Search Perplexity | Ask ChatGPT | Ask Clade

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

s

t

u

v

w

z