Comparing deep learning (DL) algorithms to human interobserver variability, one of the largest sources of noise in human-performed annotations, is necessary to inform the clinical application, use, and quality assurance of DL for prostate radiotherapy.