Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Zeyu Liu1,2    Weicong Liang1    Zhanhao Liang1    Chong Luo    Ji Li    Gao Huang    Yuhui Yuan2,3
1interns at microsoft    2core contribution    3project lead   
Microsoft Research Asia         Tsinghua University         Peking University         The Australian National University
Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.

Interpolate start reference image.



Abstract

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs.

Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than 20% to nearly 90% on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts.

Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.



Method Overview

pipeline


Glyph-ByT5 as a flexible region-wise visual text editor


pipeline

Region-wise SDEdit pipeline


Interpolate start reference image.

DALLE-3 image

Interpolate start reference image.

Edited image

Interpolate start reference image.

DALLE-3 image

Interpolate start reference image.

Edited image

Interpolate start reference image.

DALLE-3 image

Interpolate start reference image.

Edited image

Interpolate start reference image.

DALLE-3 image

Interpolate start reference image.

Edited image

Interpolate start reference image.

DALLE-3 image

Interpolate start reference image.

Edited image

Interpolate start reference image.

DALLE-3 image

Interpolate start reference image.

Edited image



Typography layout planning with GPT-4


Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.


BibTeX


    	@article{liu2024glyph,
  		title={Glyph-byt5: A customized text encoder for accurate visual text rendering},
  		author={Liu, Zeyu and Liang, Weicong and Liang, Zhanhao and Luo, Chong and Li, Ji and Huang, Gao and Yuan, Yuhui},
  		journal={arXiv preprint arXiv:2403.09622},
  		year={2024}
	}
    

    	@article{liu2024glyphv2,
  		title={Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering},
  		author={Liu, Zeyu and Liang, Weicong and Zhao, Yiming and Chen, Bohan and Li, Ji and Yuan, Yuhui},
  		journal={arXiv preprint arXiv:2406.10208},
  		year={2024}
	}
    

Acknowledgements


Website adapted from the following template.