One of the most important aspects of image processing algorithms is their real-time implementation feasibility. Real-time processing is necessary in many applications such as cell phones, machine vision, industry observation, and data cyphering. Therefore approaches that reduce computational complexity are important in these applications. Integral image is one of these concepts. It is used in different algorithms such as object detection and tracking, point correspondence in different images, and thresholding. Although using integral image speeds up computation and reduces the complexity, it is still a data intensive and time-consuming process. Hence, many efforts have been spent in order to accelerate integral image computation on different platforms. On the other hand by developing in CMOS technology and reducing transistor sizes, it is possible to integrate processing circuits with image sensors in order to perform image capturing and data processing simultaneously, which leads to speed improvement and cost, power, and area reduction. The goal of this thesis is to introduce an architecture in which integral representation, computation and capturing the image are done simultaneously. In order to calculate the integral image, it is divided into adjacent non-overlapping blocks. The integral image of each block is calculated independently from the others and sent outside of the sensor along with the image data. The final integral image is computed from the block values according to the proposed method. Each block is assumed to be 16×4 and the CMOS image sensor has two transistors per pixel. Integral image computation in each block is similar to the conventional serial method, however, in the proposed method only saving the integral value of one pixel is needed because of changing the scan mode from row by row to column by column. In order to evaluate the performance of the proposed architecture, the block output is computed for different inputs via simulations and the error for each input is presented. The blob detector used in SURF algorithm is also performed on a 160×160 sample image, whose integral image is computed by both conventional serial and the proposed methods. The outputs for both methods are very similar, while computation speed of the proposed method is increased up to 500 times. Keywords: CMOS image sensor, Integral image, Hardware implementation, Parallel processing