Deep Learning approaches have achieved revolutionary performance improvement on many computer vision tasks from understanding natural images and videos to analyzing medical images. Besides building more complex deep neural networks (DNNs) and collecting giant annotated datasets to obtain performance gains, more attention has now been focused on the shortcomings of DNNs. As recent research has shown, even when trained on millions of labeled samples, deep neural networks may still lack robustness to domain shift, small perturbations, and adversarial examples. On the other hand, in many real-world scenarios, e.g. in clinical applications, the number of labeled training samples is significantly smaller than for large existing deep learning benchmarks. Moreover, current deep learning models cannot generalize to samples with novel combinations of seen elementary concepts. Therefore, in this thesis, I focus on handling the critical needs to make modern deep learning approaches applicable in the real-world with a focus on computer vision tasks. Specifically, I focus on data efficiency, robustness, and generalization. I propose (1) DeepAtlas, a joint learning framework for image registration and segmentation that can learn DNNs for both tasks from unlabeled images and a few labeled images. (2) RandConv, a data augmentation technique that applies a random convolution layer on images during training to improve the generalization performance of a DNN in the presence of domain shift and robustness to image corruptions. (3) CompGen, a comprehensive study of compositional generalization in unsupervised representation learning on disentanglement and emergent language models.