Estimating depth is a crucial component in computer vision tasks, enabling many further applications such as robot vision, 3D modeling and above all, 2D to 3D image/video conversion. Since there are an infinite number of possible world scenes that can produce a unique image, without any prior information about scene, single image depth estimation is a highly challenging task. Humans, however, thanks to the data and knowledge they accumulated over years, can perceive depth from a monocular image with no difficulty. This suggests that using monocular depth cues in simulating human visual system for depth perception, should make single image depth estimation, an achievable goal. This observation has been the motivation for several recent approaches called data-driven approaches, which exploit the relationships between depth and these cues from a pool of images for which the depth is known. It is obvious that for solving such an ambiguous problem with a large source of uncertainty, it is not enough to have only local or global perspective for the precise single image depth estimation. To this end, in this thesis, considering a number of robust and effective depth related features, we introduce a patch-based framework which jointly benefit from local and global structures of a scene. We formulate monocular depth estimation as a similar image patches retrieval method and a single level and multi-level learning models as well. Our experimental results demonstrate that our depth estimation models are more accurate than existing methods on a standard dataset. Keywords: Depth estimation, 2D to 3D image/video conversion, Monocular depth perception cues, Data-driven approaches, Multi-level learning model