Study/시행착오

[해결] Watchdog caught collective operation timeout, Multi-GPU (DDP) NCCL 타임아웃 시간 설정

You can find out english explanation on below!

3일 동안 이거 생각만 하다가 결국 해결해 버린.. 멀티 GPU에서 큰 데이터셋으로 학습시킬 때 뜨는 timeout 에러입니다.

에러 메세지

끝부분만 보면 subprocess.run()에서 뭔가 이상한 값을 리턴 받았다는 내용이라 파라미터 설정을 잘못했나 싶지만, 에러가 난 시점을 살펴보면 아래와 같은 메시지들을 찾아볼 수 있습니다.

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806474 milliseconds before timing out.
torch.distributed.elastic.multiprocessing.api:Sending process 3274022 closing signal SIGTERM
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

진짜 온갖 시행착오 끝에 이미지 스캐닝을 하다가 nccl? 에서 일정 시간이 지나면 프로세스를 종료시켜 버린다는 사실을 깨닫고 분산통신 패키지의 환경 변수 설정을 통해 해결해 보려다가 안 되는 것 같아서, 패키지 내의 파일을 직접 수정해 오류 시 timeout을 시키지 않고 대기 시간을 2시간으로 늘려주었습니다.

해결 방법

def _setup_ddp(self, world_size):
        """Initializes and sets the DistributedDataParallel parameters for training."""
        torch.cuda.set_device(RANK)
        self.device = torch.device('cuda', RANK)
        LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
        os.environ['NCCL_BLOCKING_WAIT'] = '0'  # not to enforce timeout
        dist.init_process_group('nccl' if dist.is_nccl_available() else 'gloo',
                                timeout=timedelta(seconds=7200000), # 1800000
                                rank=RANK,
                                world_size=world_size)

시도해 본 것들

docker latest image 사용 - 패키지 문제인가 싶어서 해봤지만 실패
배치 사이즈 줄이기
shared memory (RAM) 늘리기
GPU 한 장만 쓰기 -> 스캐닝은 되는데 (???), 한 장으로 감당 안 되는 시간이 걸려서 포기

Watchdog caught collective operation timeout

Error message

It seems like that subprocess.run() returns something weird, so I thought the parameters were not appropriate at first. However you can find the messages below when you scroll up to the line the error occur. (In my case, it was during scanning images for training)

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806474 milliseconds before timing out.
torch.distributed.elastic.multiprocessing.api:Sending process 3274022 closing signal SIGTERM
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

How to fix

def _setup_ddp(self, world_size):
        """Initializes and sets the DistributedDataParallel parameters for training."""
        torch.cuda.set_device(RANK)
        self.device = torch.device('cuda', RANK)
        LOGGER.info(f'DDP info: RANK {RANK}, WORLD_SIZE {world_size}, DEVICE {self.device}')
        os.environ['NCCL_BLOCKING_WAIT'] = '0'  # not to enforce timeout
        dist.init_process_group('nccl' if dist.is_nccl_available() else 'gloo',
                                timeout=timedelta(seconds=7200000), # 1800000
                                rank=RANK,
                                world_size=world_size)

I directly made changed trainer.py of ultralytics package, setting longer limit for timeout. Please find the commented line on my change.

저작자표시 비영리 변경금지 (새창열림)

'Study > 시행착오' 카테고리의 다른 글

[해결] 기계식 키보드 맥 셋팅 (0)	2024.06.10
[해결] ImportError: cannot import name 'NDArray' from 'numpy.typing' (0)	2024.04.24
[ongoing] 깃허브 키체인 항목 접근 권한 변경에 대한 암호 입력 (1)	2023.05.08
[해결] CUDA capability sm_86 is not compatible with the current PyTorch installation. (0)	2023.04.16
[해결] 원격 서버에서 도커 실행이 안될 때 (0)	2023.03.08

Contents

새소식

인기 검색어