First Things First: The Birth of Computer Vision
The explanation of Computer Vision brings the necessity of defining Vision at the first place. Vision is a process that produces a description from images of the external world for the viewer. These images are not cluttered with irrelevant information to make it useful. Briefly, a mapping from one representation to another is vision. In other words, this is a three dimensional description of a world from a detected two dimensional visual information. As humans, we can comprehend ‘what’ i.e., detected visual information before ‘how’ i.e., a mapping from one representation to another.
- For instance, consider a computer system lying on the office desk. Human vision can perform subtle tasks such as recognition and segmentation. As a result, humans can segment out monitor, mouse, CPU, and keyboard from a table and can recognize each object.
- Vision researchers have been working to understand the ‘how’ part of the working of the visual process of human vision for many centuries. Their discoveries encouraged computer scientists to replicate human vision in program, and the birth of computer vision took place.
Computer Vision and Artificial Intelligence:
- A field of artificial intelligence that aims to interpret an inverse problem of describing a world of digital images or videos by extracting meaningful information out of it defines Computer Vision. This field has emerged as a vital part that has provided a solution of categorizing, indexing and classifying massive amount of images and videos available on internet.
- Moreover, as a key driver for the growth of applications in fields of healthcare, automation processes, surveillance systems, cashier systems, augmented reality, self-driving car, and robotic vision has magnified the importance of Computer Vision. Deep Learning, a segment of machine learning based on Artificial Neural Networks (ANNs) made it possible for Computer Vision to be successfully applied in above-mentioned applications. This is due to its capability of constructing a high degree of learning representations with knowledge gained at various levels.
Image Basics:
Computers displays a digital image which are two dimensional array representations of picture elements as known as pixels. The values of these pixels are finite and discrete. Thinking of this 2D array representation in the form of a matrix, an image matrix can be described with W columns and H rows. The size of an image in pixels can be determined by multiplying W columns and H rows i.e., W X H. The point of origin (0,0) is the top-left pixel position of the image. If 2D coordinate system is (x, y), then x and y increase going from left to right and top to bottom, respectively.

A gray-scale image comprises only one channel, whereas a color image comprises three channels i.e., red, green, blue (RGB). The pixel values of a digital image are generally in the range from 0-255. For instance, (0,0,0) value of a pixel defines a black pixel, and (255,255,255) value of a pixel defines a white pixel. As the number increases, a shade of gray becomes more darker in the gray-scale image and the brightness of color increases in the color image. In general, digital images are represented as (C, H, W), where C is the number of channels in the image, H is the height of the image, W is the width of the image.
Image Recognition:

Image Recognition is a field of Computer Vision. It is a task that involves one or combination of more than one tasks of image classification, image localization, object detection and image segmentation. Image classification is the task in which the given image is classified to particular class. In image localization task, the location of an object in the given image is the main task. Object detection is the combination of image classification and image localization. In this task, objects in the given image are found and assigned a label to each bounding box. Image segmentation is the task of getting the information of an image by segmenting the image into various parts, and processing the relevant parts of the image.