Abstract
Generative transformers have experienced rapid popularity growth in the
computer vision community in synthesizing high-fidelity and high-resolution
images. The best generative transformer models so far, however, still treat an
image naively as a sequence of tokens, and decode an image sequentially
following the raster scan ordering (i.e. line-by-line). We find this strategy
neither optimal nor efficient. This paper proposes a novel image synthesis
paradigm using a bidirectional transformer decoder, which we term MaskGIT.
During training, MaskGIT learns to predict randomly masked tokens by attending
to tokens in all directions. At inference time, the model begins with
generating all tokens of an image simultaneously, and then refines the image
iteratively conditioned on the previous generation. Our experiments demonstrate
that MaskGIT significantly outperforms the state-of-the-art transformer model
on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x.
Besides, we illustrate that MaskGIT can be easily extended to various image
editing tasks, such as inpainting, extrapolation, and image manipulation.
Users
Please
log in to take part in the discussion (add own reviews or comments).