Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Abstract

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs.

Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than 20% to nearly 90% on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts.

Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.

Method Overview

Glyph-ByT5 as a flexible region-wise visual text editor

Region-wise SDEdit pipeline

DALLE-3 image

Edited image

DALLE-3 image

Edited image

DALLE-3 image

Edited image

DALLE-3 image

Edited image

DALLE-3 image

Edited image

DALLE-3 image

Edited image

Typography layout planning with GPT-4

BibTeX


    	@article{liu2024glyph,
  		title={Glyph-byt5: A customized text encoder for accurate visual text rendering},
  		author={Liu, Zeyu and Liang, Weicong and Liang, Zhanhao and Luo, Chong and Li, Ji and Huang, Gao and Yuan, Yuhui},
  		journal={arXiv preprint arXiv:2403.09622},
  		year={2024}
	}


    	@article{liu2024glyphv2,
  		title={Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering},
  		author={Liu, Zeyu and Liang, Weicong and Zhao, Yiming and Chen, Bohan and Li, Ji and Yuan, Yuhui},
  		journal={arXiv preprint arXiv:2406.10208},
  		year={2024}
	}

Acknowledgements

Website adapted from the following template.