Tony Ho | Notion

Hello! Welcome to my blog post! My time as an intern for Hypothetic, under the guidance of Simon Wiedemann, has been packed with experiments, learning, and fun times. Over the 8 months, I’ve spearheaded research and engineering on the training and implementation of 3D segmentation AI models. I hope you enjoy reading about my journey and take an interest in the things I learned along the way!

Hello!

TLDR

At Hypothetic, I focused on 3D segmentation AI models, transitioning my 2D segmentation skills to tackle diverse 3D datasets. I initially used Meta’s Segment Anything Model (SAM) in a web browser but it lacked accuracy. Using a more refined version called SAMHQ showed improvements but not enough.

Leveraging artist-created UV maps, which are images that map 3D surfaces into 2D, we can isolate “UV islands” and treat them as segments. We create a dataset using this method and train a PointNet++ model. Known for its effectiveness in 3D point cloud processing, learning hierarchical features to handle complex structures, PointNet++ showed promising results, especially on extremities, but complex areas like detailed accessories posed challenges.

From 2D to 3D

When I joined the company, I brought with me my experience in segmentation, which is the art of drawing lines around an object in a 2D image. From my work at DaoAI Robotics, I understood well the challenges of detecting such objects and drawing very nice, precise lines around them. The process involves having a human painstakingly draw these lines on as many images as possible, feed this data to a machine learning model, and hope the model can do the same on similar images. It’s an expensive and time-consuming process, especially the part involving the human. The upside is that in a controlled environment, this process is quite reliable and can produce pretty effective machine learning models. The challenge at Hypothetic was to use this skill-set to do the 3D equivalent with their vast and diverse 3D model dataset.

Understanding the data you’re working with is crucial to really understanding the challenge you’re tackling. In this case, I was working with data that was extremely diverse; some were character models, some were vehicles, some didn’t have textures, some had an extremely high amount of vertices, etc. It wouldn’t make sense to hire an army of people to segment parts of the 3D models since doing this in 3D is way more complicated than just drawing lines, as well as the vast amount of variation in the segments would contribute to many different interpretations of what is a segment. Ideally, we’d like to train a machine learning model on this data and then be able to identify and segment parts of a given model, like a hand or a wheel.

Around the time I became an intern at Hypothetic, Meta had recently released their Segment Anything Model (SAM), built to take any image and create segments automatically! This was a game changer! When given a point in the image, SAM would output a mask that overlaps that part of the image, highlighting the intended object or part. SAM was reasonably very good at this. We came up with the idea of using SAM to create these masks from images of a 3D model and project them back onto the model as a way of doing segmentation.

Test model

Untitled

Although the results were mostly good, the accuracy wasn’t high enough for a person to use without spending considerable time fixing the prediction. Occluded geometry was also an issue and would require a novel algorithm to capture every surface. We investigated using a fine-tuned, i.e. specialized, version of SAM called SAMHQ. It was meant to produce cleaner segments, and although the segments were better, the accuracy still was not good enough. The most likely issue is that SAM and SAMHQ are trained on datasets containing photographs and real images, whereas our data was all virtual 3D models. It’s too bad this approach didn’t work out since SAM has a really nice feature where it can output an image with all the non-overlapping segments composed on it. Ideally, we’d use this to solve the issue of human subjectivity and consistency, but the results weren’t of high enough quality to be usable.

The results of asking the model for all detectable segments in the image.

Unfortunately, the results were inconsistent and didn’t detect everything.

UV Maps to Treasure Island

The anatomy of a 3D model includes data for the vertices, how the vertices connect and form a face, how the faces connect to form a mesh, and finally a texture image to determine how the surface of the mesh is colored. Of course, this is a simplified breakdown, but it highlights the important parts I want to talk about. After a 3D artist has built their model’s mesh, they paint it. The faces of the model are colored in and need to be projected onto a 2D image to as part of the artist’s workflow. The projection is a manual process called “UV unwrapping” where the artist selects a collection of connected faces, flattens them out, and assigns this “island” a section on the 2D “texture” image. The result is a 2D image depicting many instances of what we call “UV islands”. We use these UV islands as segments of the 3D model! If you’re wondering what UV stands for , they just symbolize the horizontal and vertical axis respectively, like X and Y on a graph.

Texture image for a model

Faces projected back onto the image to create UV islands

Assigning a part of the mesh a UV coordinate essentially means to map part of the texture image to that part of the mesh. By identifying these UV islands in the texture image, we can assign each of them a unique color and then project them back onto the mesh to produce a segmented model!

Untitled

We now have data we can use to train our machine learning model! A treasure indeed!

Head in the Clouds

A machine learning model is trained by taking data as input and then producing output. Typically, this data needs to adhere to a standardized structure. Images are easy; crop and scale your image so that it fits a 256x256 image and then feed it to the model. For 3D models, we have to convert it to a point cloud. A point cloud represents the 3D model in the form of a collection of points in space.