Virtual Try-On Using Images: An Ideal Application of Generative AI and Pattern Recognition

Author: Arun Subramanian

Machine Learning Engineer

Feb 9, 2024

Category: Retail

Introduction

Image based Virtual-try aims to wrap the in-store garment image, onto the fashion-model image. The fashion model in question may exhibit individualized characteristics, including but not limited to physique, stature, weight, and various poses. While the image of the garment is typical a frontal photograph of the cloth. This holds significance for the ultimate consumer, as it allows them to virtually try on a garment before making an online purchase, conducting an initial evaluation online before visiting the physical store, or even experiencing a virtual try-on at the store without the need for a changing room. The significance of the term “image based” arises from the fact that traditional methods required a more cumbersome 3-d meshes to achieve the same, whereas the current Machine Learning methods do not, making it a robust application of Generative AI and pattern recognition.

Goals of a Virtual Try-On Solution

The synthesized image, after virtual try-on, is expected to be perceptually convincing, meeting the following desiderata (as postulated in the research paper VITON: An Image-Based Virtual Try-On Network (thecvf.com).

(1) The body parts and pose of the person are the same as in the original image.

(2) The clothing item in the product image deforms naturally, conditioned on the pose and body shape of the person.

(3) Detailed visual patterns of the desired product are clearly visible, which include not only low-level features like color and texture but also complicated graphics like embroidery, logo, etc.

Challenges of a Virtual Try-On Solution

Warping the Cloth

The first challenge is how to warp the cloth such that the texture, and text are preserved.

Generating Human Body Parts

The second challenge is how to generate human body parts such as arms/legs, if the target image has these parts already occluded (for instance the target image is already wearing a full t-shirt), where the source cloth image is shorter.

Traditional Virtual Try-On Approaches

3D approaches such as Multi-Garment Net: Learning to Dress 3D People From Images (thecvf.com) were used to warp the cloth over 3D meshes. However, this requires expensive 3D scanning equipment. 2D approaches to warp a 2D cloth image over a 2D model image of varying shape, pose have therefore been gaining attention.

How 2D Virtual Try-On Methods Typically Solve the Challenges

The 2D Viton challenge is solved by formulating the problem as generating the warped cloth, by using the target image's information such as pose, body segmentation, etc. The warping can be determined by Thin-Plate Spline (TPS) based transformation or determining flow vectors that indicate where each pixel of the cloth moves. The latter is sometimes implicitly achieved by using GAN, but usually explicit flow vectors and TPS are preferred. However, this approach too fails because they only help in determining a global flow field, while it is known that diverse garment transformations in different parts of the body are usually required. These deformation flows are usually learned from data-driven approaches a.k.a using machine learning.

Machine Learning Approaches and Problem Statement and Dataset and Key Inputs

The standard datasets used to train this mapping typically consist of data sets that include images of clothing and individuals wearing the clothing, showcasing specific poses or shapes. While the pose key points (open pose, dense pose, etc.), and body segmentation are inferred from images of the person, the cloth segmentation and mask are inferred from images of the cloth, and both together serve as the priors/inputs to determine/learn the warping needed.

An example of inputs 'inferred' from cloth and person image is demonstrated above. The pose map is inferred from the person's image.

Example inputs inferred from cloth and person image

Image credits to VITON: An Image-based Virtual Try-on Network.

The machine learning models used here typically first extract deep learned features of the input cloth and input target fashion model image, either through convolution networks or pyramid-cascades, and further learn the warping flow-field using U-Net image to image translation framework. Once the warping flow-field is learned, it is applied to the cloth image and applied over the target fashion model, to complete the synthesis of a virtual try-on.

Conclusions

Image-based virtual try-on is an ideal use case for exploiting the advances in both pattern recognition, machine learning and Generative AI. As and when advances in the inductive biases of deep learning models evolve, in addition to novel architectures and approaches, the ability to learn from a huge corpus of data involving cloth and fashion-model adorning the cloth promises to produce extremely novel and real-time virtual try-on outputs. Further, for the end user, the application can present itself in the form of a browser application, mobile application or even virtual try-room on the store, providing a seamless shopping experience for the customer and conversion of sales for the merchant.