FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications

1The University of Tokyo, Japan 2National Institute of Advanced Industrial Science and Technology (AIST), Japan 3Reichman University, Israel

Eurographics 2024


FontCLIP is a semantic typography visual-language model for multilingual font applications.


Acquiring the desired font for various design tasks can be challenging and requires professional typographic knowledge. While previous font retrieval or generation works have alleviated some of these difficulties, they often lack support for multiple languages and semantic attributes beyond the training data domains. To solve this problem, we present FontCLIP – a model that connects the semantic understanding of a large vision-language model with typographical knowledge. We integrate typographyspecific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model through a novel finetuning approach. We propose to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters. FontCLIP’s semantic typographic latent space demonstrates two unprecedented generalization abilities. First, FontCLIP generalizes to different languages including Chinese, Japanese, and Korean (CJK), capturing the typographical features of fonts across different languages, even though it was only finetuned using fonts of Roman characters. Second, FontCLIP can recognize the semantic attributes that are not presented in the training data. FontCLIP’s dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization, reducing the burden of obtaining desired fonts.


We finetune a pretrained CLIP model with an existing font-attribute dataset to obtain FontCLIP. During each finetuning iteration, we randomly select attributes based on their scores from the font-attribute dataset and create a compound descriptive prompt for each font. Simultaneously, we generate a font image and apply a random augmentation transformation to enhance variability. We finetuned the last three transformer blocks (highlighted in red) for both encoders using a pairwise similarity loss function.


Overflow of the finetuning process


1. Dual-Modal and Cross-Lingual Font Retrieval

How does it work?

The system calculates the cosine distance between the input text prompt or image and the image of each font in the database, and then the sorted fonts are returned.

User Interface
Retrieve desireble fonts with a text prompt
Retrieve desireble fonts with an image

You can retrieve desirable fonts by inputting either a text prompt, an image, or a combination of both.
Thanks to the generalibility of FontCLIP, you can do a cross-lingual font retrieval, which means you can input an image of non-Roman letters to retrieve Roman fonts, and vice versa.

2. Dual-Modal and Cross-Lingual Vector Optimization

How does it work?

The initial letter in SVG format is rendered to a bitmap differentiably and input into the FontCLIP image encoder to calculate the distance between the latent vector and the reference latent vector that is obtained from a reference text prompt or image. The parameters of bezier curves in the input letter are repeatedly optimized by backpropagating the loss, which is the sum of the distance in the FontCLIP latent space and other constraints to preserve the initial letter shape.

Vector Optimization Guided with a Text Prompt
Optimizing to "serif font"
Optimizing to "not serif font"
Optimizing to "italic font"
Optimizing to "not italic font"
Optimizing to "thin font"
Optimizing to "not thin font"
Vector Optimization Guided with a Reference Image
Optimized serif font Reference image
Optimizing to the reference image
Optimized serif font Reference image
Optimizing to the reference image
Optimized serif font Reference image
Optimizing to the reference image


  author    = {Tatsukawa, Yuki and Shen, I-Chao and Qi, Anran and Koyama, Yuki and Igarashi, Takeo and Shamir, Ariel},
  title     = {FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications},
  journal   = {Computer Graphics Forum},
  year      = {2024},