Study/Paper_review

[ICCV] Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning

기존 PointCLIP 모델의 성능 개선뿐만 아니라, task를 확장하여 3D point cloud data의 다양한 활용을 가능하게 하는 통합 프레임워크를 제안한 논문 PointCLIPv2입니다.

[논문의 기여]

CLIP과 GPT를 통합하여 3D point cloud data에 대한 zero-shot 학습 능력을 향상
3D classification, part segmentation, object detection 등 다양한 태스크에 적용 가능한 통합 프레임워크 제안

1. Introduction & Background

CLIP과 LLM을 적절히 활용하여 통합된 3D 데이터를 활용해 open-world understanding을 달성할 수 있을까?

[기존 PointCLIP의 문제점]

Sparse projection

단순히 depth map으로만 투영시켰기 때문에 실제 pretrained 이미지와 특성이 달라 CLIP의 visual encoder가 헷갈리는 원인이 됨

Naive text

2D에 대한 text input (e.g a photo of a CLASS)를 가지는 CLIP이 target object를 제대로 인식하지 못함

2. Method

📌 Realistic projection

Quantize, Densify, Smooth, Squeeze 총 4단계를 통해 3D data로부터 현실 이미지와 비슷한 depth map $ V $ 를 얻음

Quantize: 3D point cloud data를 $ H \times W \times D $ 크기의 3D grid로 변환하여 voxel grid $ G $ 에 각 포인트들을 다음 식과 같이 할당함. 이때 한 개의 그리드에 여러 개의 포인트가 해당할 경우 최소 깊이 값이 할당됨

$ G([sHx], [sWy], [Dz])=z $

Densify: visual continuity 를 위해 local mini value pooling을 수행하여 객체에 대한 그리드는 채워지고, 배경에 해당하는 그리드는 빈 채로 유지함

Smooth: Non-parametric Gaussian kernel을 사용해 형태를 부드럽게 만들고 노이즈를 제거해줌

Squeeze: 그리드들로부터 depth dimension을 압축하여 2D depth map을 형성한 뒤, CLIP의 입력 형태로 만들기 위해 세 개 (RGB)채널로 복사함

다른 projection 방법과의 비교

기존의 PointCLIP에서 사용했던 단순한 투영 방식에 비해 latency는 증가했으나 전통적인 방법에 비해 여전히 빠르고, 성능이 굉장히 향상되었음을 확인할 수 있다.

📌 Prompting with GPT-3

GPT를 활용하여 CLIP의 textual encoder에 입력할 input 생성
Caption generation / Question answering / Paraphrase generation / Words to sentence

📌 Unified Open-world learning

기존 PointCLIP 모델이 3D zero-shot 및 few-shot classification task만 제안하였으나, PointCLIPv2에서는 향상된 성능을 기반으로 다양한 task에 적용이 가능함을 보여줌

zero-shot classification
few-shot classification: smoothing 모듈에서 학습 가능하도록 변경
zero-shot part segmentation: 마지막 Pooling 연산을 하기 전의 dense feature ($ F_{i} $)를 활용하여 각 픽셀마다 text feature를 대응시킴 → pixel-wise classification logit을 구해 3D space 에 역투영함
object detection: 3DETR-m pretrained 모델을 사용해 3D bounding box 후보를 생성하고, 해당 박스를 PointCLIPv2에 입력해 zero-shot classification 수행

3. Experiments

3D zero-shot

3D few-shot

part segmentation & object detection

object detection은 pretrained model의 성능에 너무 의존적이지 않을지?하는 생각이 들었다.

4. Conclusion

기존의 PointCLIP 모델을 능가하는 3D open-world 학습 모델 PointCLIPv2 를 제안함
더 현실적인 depth map을 생성하기 위한 모듈과 3D 에 더 적합한 텍스트를 생성하기 위해 GPT를 활용했음
분류 외의 태스크에 대한 성능을 확인하였음
향후 더 넓은 응용 분야(e.g outdoor 3D detection, visual grounding) 에 적용할 수 있는 방법에 대한 연구 예정

저작자표시 비영리 변경금지 (새창열림)

'Study > Paper_review' 카테고리의 다른 글

[CVPR] PointCLIP: Point cloud understanding by CLIP (1)	2024.12.22
[IJCNN] ObjectAug: Object-level Data Augmentation for Semantic Image Segmentation 논문 리뷰 (0)	2023.04.02
WSI image segmentation과 분석을 위한 general DL framework (0)	2022.12.30
DeepGazeII 논문 리뷰 (0)	2022.11.07
Deep saliency model 논문 리뷰 (0)	2022.10.07

Contents

새소식

인기 검색어