Explore the field of computer vision covering a brief history, techniques, common applications and industry use cases.
What is Computer Vision?
Computer vision is a field of artificial intelligence that trains computers to understand the visual world. Machines with computer vision capability interpret, identify, and locate objects within a visual scene captured from digital images, videos, or camera feeds. Further, they can make decisions based on the interpretation of what they “see” using machine learning models.
The key aim is to teach computers to extract and understand data “hidden” within digital images and videos. Computer vision employs a host of techniques to understand complex higher-dimensional data from images and videos depicting our environment.
This blog post covers these techniques and touches upon the various means of scientific and technical explorations and the practical use cases of computer vision.
Computer Vision & Artificial Intelligence
As per the definition provided by Prof. Fei-Fei Li, computer vision is “a subset of mainstream artificial intelligence that deals with the science of making computers or machines visually enabled, i.e., they can analyze and understand an image.” Computer Vision emulates human vision using digital images. It also relies on techniques such as pattern recognition, widely used across other areas of artificial intelligence. Consequently, computer vision is sometimes considered a part of the artificial intelligence field or the computer science field in general.
History of Computer Vision
During the late 1950s and early 1960s, computer vision was mostly limited to image analysis to mimic human vision systems and see how much a computer can resemble that.
One of the first breakthroughs came during the 1950s. It was in the form of neural networks to detect the edges of an object in an image. This approach made it possible to sort simple shapes into categories such as circles and squares. Later, during the 1970s, commercially deployed computer vision algorithms could interpret typed or handwritten text using optical character recognition.
In the 1990s & 2000s, large datasets of human images were made available online for analysis, thereby providing an impetus for facial recognition research. Accuracy rates of computer vision algorithms have also gone up significantly. Starting with a mere 50% accuracy for object identification and classification, the techniques used today have shown near 99% accuracy within just a span of a few decades.
Common Techniques and Processes in Computer Vision
Computer vision is not just about converting a picture into pixels and then using some algorithms to make sense of those pixels. It is imperative to understand the bigger picture of extracting information from the pixels and interpreting what they represent.
In principle, computer vision depends on pattern recognition techniques to self-train and comprehend visual data. It involves a processing pipeline in the following stages:
1. Input Data: Data is captured from images/videos by breaking them down into pixels, which are considered to be the granular elements of the picture or the smallest unit of information that make up the picture. Each pixel’s brightness is represented by a single 8-bit number, ranging from 0 (black) to 255.
2. Pre-processing: Computers usually read color as a series of three values — red, green, and blue (RGB), on the 0–255 scale. Each pixel has three values for the computer to store in addition to its position. The pixel-based representation undergoes specific pre-processing steps such as normalization, standardization, color correction, scaling, resizing.
3. Selecting areas of interest: A lot of memory is required for one image and has too many pixels for an algorithm to iterate over. Therefore, it is essential to select specific areas of interest within the image by choosing a subset of pixels representing the main subject or the various objects in the scene. These areas are isolated with the help of cropping and image segmentation, among other techniques.
4. Feature extraction: Selected areas are then analyzed separately to build a vector representation of the scene, suitable for extracting features based on edge detection, object recognition, and transformation.
5. Prediction/Recognition: Finally, the featured extracted dataset is run through a machine learning model for arriving at predictions. The model has to be trained with thousands of images containing a similar scene to arrive at a meaningful accuracy.
Deep Learning for Computer Vision
Traditional machine learning models offer a set of elementary techniques for solving computer vision problems. These models rely on statistical learning algorithms such as linear regression, logistic regression, decision trees, or support vector machines (SVM) to detect patterns and classify images and detect objects within them. This approach is a bit limited in terms of building truly intelligent computer vision systems. It also requires a lot of manual coding and training effort by developers and human operators.
On account of advances in artificial intelligence and innovations in deep learning and neural networks, the field of computer vision has taken incredible leaps in recent years, to the extent that it outperforms humans in specific tasks related to detecting and labeling objects. The wide accessibility of data and organizations' readiness to share datasets has made it workable for deep learning specialists to utilize this information to make the procedure increasingly accurate.
Deep learning relies on neural networks. It is a general-purpose function that solves a problem representable through examples. When a neural network is provided with many labeled examples of a specific kind of data, it can extract common patterns between those examples and transform them into a mathematical equation that classifies future pieces of information.
In most cases, creating a good deep learning algorithm comes down to gathering a large amount of labeled training data and tuning the parameters such as the type and number of layers of neural networks and training epochs. Compared to machine learning, deep learning is both easier and faster to develop and deploy.
Convolutional Neural Network
The classical problem in computer vision, image processing, and machine vision determines whether or not the image data contains specific objects, features, or activities. Currently, the best algorithms for such tasks are based on convolutional neural networks.
Artificial neural network (ANN) is excellent for the task, but in processing images with fully connected hidden layers, ANN takes a very long time to be trained. Due to this reason, convolutional neural networks (CNN) are used. A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image scene, and be able to differentiate one from the other.
A CNN first reduces images using convolutional layers and pooling layers and then feeds the reduced data to fully connected layers.
CNN is a class of deep feedforward neural networks that are primarily inspired by the biological system. The connectivity pattern between neurons depicts where each individual cortical neuron responds to stimuli only in the restricted region of the visual field known as the receptive field, i.e., a restrictive subarea of the input. The cortical neurons of different fields overlap in such a way that they collectively represent the entire image.
In a CNN, each convolution neuron processes data only for its receptive field, and they are organized in such a way that they collectively also represent the entire image. Moreover, both the biological visual system and CNN have a hierarchy of layers that progressively extract more and more features. These layers are arranged in increasing order of complexity, starting from simple visual representations such as edges, lines, curves, and gradually more complex representations such as faces, instances, etc. This results in the ability to understand complex images.
There are three main components of CNNs: the convolutional layer, the pooling layer, and the fully connected layer. Images are fed as input, which will be converted to tensors and passed on to CNN Block. CNN block has multiple convolutional layers stacked one after another to extract edges and gradients, followed by textures & patterns, resulting in parts of objects or the dominant subject. Towards the final convolution layers, we can expect channels to resemble the original object we are attempting to classify.
Image classification in CNN has two parts, forward pass and backpropagation. The forward pass, followed by the backpropagation, keeps happening the number of times we choose to train the model.
CNNs are very good at classifications of fine-grained objects, which is very difficult for human vision.
Computer Vision Applications
Computer vision tasks range across various applications. These are more or less well-defined measurement or processing problems, which can be solved using a variety of pre-existing methods. Some common examples of typical computer vision tasks are:
Object Identification & Verification
Object tracking and action recognition
In recent years, computer vision has also helped solve more intricate problems such as motion analysis, image restoration, and scene reconstruction. One of the contentious applications resulting from these advancements is Deepfake, a unique image synthesis technique that mimics a human character in its true facial expression and vocal tone.
Computer Vision Use Cases
With the technological advancements around deep learning coupled with cheaper and more powerful hardware, computer vision has garnered a lot of interest from the business world.
Let's take a look at some of the industry-specific use cases of computer vision:
Transportation and Motor Vehicles: This is perhaps the most talked-about use case of computer vision. By interpreting the surroundings' images, fed through a set of cameras, a computer vision system augments a vehicle's visual sense to detect the extremities of roads, read traffic signs, and detect other cars, objects, and pedestrians. It helps the self-driving vehicle to steer its way through the streets and highways, avoid hitting obstacles, and (hopefully) safely drive its passengers to their destination.
Law Enforcement: Law enforcement agencies increasingly rely on facial recognition technology to identify criminals in video feeds. In the aftermath of Covid-19, several ancillary applications have been developed to augment facial recognition technology to aid in the enforcement of masks and social distancing in public places.
Digital Forensics: In some specialized scenarios, computer vision is used to estimate a person's height in an image, an important measurement to corroborate pieces of evidence. Computer vision techniques are used to obtain a three-dimensional model from photographs taken from a footwear impression, evidence commonly found in crime scenes. Also, computer vision reconstruction techniques are used in need of reconstructing shredded images. Besides, these techniques are also employed for aiding in traditional forensic tasks such as recognizing handwriting text or fingerprints.
Real estate: In recent times, computer vision applications have been integrated with augmented reality (AR) gear to recreate virtual reality scenes fused with real-world locations. Computer vision algorithms also help AR applications to detect surfaces such as tabletops, walls, and floors, a significant part of establishing depth and dimensions and placing virtual objects in the physical world. These methods find applications in real-estate environments, for building mock interiors, providing virtual walkthroughs.
Healthcare: Computer vision algorithms have achieved some groundbreaking progress in detecting cancerous moles in skin images or finding similar symptoms in x-ray and MRI scans. The scope of computer vision in medical diagnosis continues to expand every year. It is expected to achieve significant breakthroughs in telemedicine and remote patient monitoring in the coming years.
Manufacturing: Computer Vision helps in automatic inspection in manufacturing processes. For example, industrial robots monitor the assembly lines for any exceptional incident and aid in anomaly detection of raw material or finished products.
Retail: CCTV systems augmented with computer vision algorithms help detect events, for visual surveillance or people counting in retail stores. Additionally, they can also enhance the internal processes related to inventory and stocking by visually monitoring the shelves' depletion and replenishment rates.
Agriculture: Computer vision finds a lot of applications in crop monitoring and disease detection. Many of the common anomalies in the crops can be detected with naked eyes. Therefore computer vision applications can be trained to monitor crops at a larger scale. Similar techniques in computer vision are also applicable for food quality inspection.
Military: There are a wide array of use cases of computer vision in a military environment. Some of the typical applications involve the detection of enemy soldiers or vehicles and for missile guidance. Some of these operations rely on thermal imaging and night vision cameras that require specialized computer vision algorithms to boost the "battlefield awareness" of the field troops. These applications also rely on unique, military-grade sensors, including image sensors, to get a rich set of information about a combat scene that can then support tactical decisions.
The Way Forward
The advancement in computer vision has mostly aligned with the newer technologies that have augmented artificial intelligence. But there are challenges.
Like all applications of artificial intelligence, computer vision is also data-dependent. Algorithms related to computer vision face different challenges related to data quality. It can receive data that could be incomplete, noisy, or just too big to be limited by computers' memory or processing ability.
However, given the current hype around AI, and the increasing funding trends, there is hope. Computer Vision's market is progressing as fast as its capabilities and is estimated to reach $26.2 billion by 2025. This is almost a 30% increase every year. It is evident in the recent technological improvements that if AI is the future, then computer vision will be the most fantastic application.
Without a doubt, computer vision applications will grow from strength to strength in the coming years to realize the dream of a self-learning, fully automated computer vision system that can completely replace human intervention for a generic use case.