Vision Transformer-assisted analysis of neural Image compression and generation

Access full-text files




Minchev, Kliment

Journal Title

Journal ISSN

Volume Title



This work investigates a novel application of a Vision Transformer (ViT) as a quality assessment reference metric for reconstructed images after neural image compression. The Vision Transformer is a revolutionary implementation of the Transformer attention mechanism (typically used in language models) to object detection in digital images. The ViT architecture is designed to output a classification probability distribution against a set of training labels. Thus, it is a suitable candidate for a new method for quantitative assessment of generated image quality based on object-level deviations from the original pre-compression image. The metric is referred to as a ViT-Score. This approach complements other comparative measurement techniques based on per-pixel discrepancies (Mean Squared Error, MSE) or structural comparison (Structural Similarity Index, SSIM). This study proposes an original end-to-end deep learning framework for neural image compression, latent vector representation, reconstruction, and image quality analysis using state-of-the-art model architectures. Neural image compression and reconstruction is achieved using a Generative Adversarial Network (GAN). Results from this work demonstrate that a ViT-Score is capable of assessing the quality of a neurally compressed image. Moreover, this methodology provides valuable insights when measuring GAN output quality and can be used in addition to other relevant perceived quality metrics such as SSIM or Frechet Inception Distance (FID).


LCSH Subject Headings