By: Rohan Bhambhoria

Computer vision is a subcategory of the broader field of Artificial intelligence which works on understanding visual inputs. Data which is of interest to computer vision researchers includes images, videos and sometimes even text, as natural language processing can promote interpretability of images by means of captions or relationships between various objects in an image.

How does it work?

Computer Vision mainly deals with utilizing images as inputs to convolutional neural network architectures such as LeNet[1], VGG-16[2], and more recently, ResNet[3], DenseNet[4], among others which are capable of automatically extracting significant aspects of the inputs. An intuitive way to understand this would involve the earlier layers of the neural network extracting edges from the image, deeper layers would further extract corners and contours, and even deeper layers are capable of extracting finer details such as object parts from an image as shown in Figure – 1. In doing so, a neural network is able to distinguish between different objects within the input images.

Figure-1: Interpretable representations[5] of neural networks for input images by considering a classification example for a car

How can it be applied to the real world?

There are various tasks which are very easy for a human to perform well at, one of them being grounding images. After looking at a few cars, it’s easy for even a child to determine that a truck which also has four wheels is very similar to the aforementioned cars. It is possible for deep learning models to categorize these objects when we have a large amount of distinctly pre-labelled data to train our models as a simple image classification task.

On the other hand, when we do not have a large amount of data under specific categories, the problem becomes more interesting. The challenge is now geared more towards grounding certain aspects of the images to a specific category which may or may not be partially present in another image. This type of work can more broadly be categorized into two tasks which are actively being explored in real-world scenarios but not so much in the area of trademarks or graphics created by artists – 1) Image Captioning, 2) Object Detection and 3) Image Retrieval

Image captioning can be seen as a task in which grounding elements of an image are described in the form of a sentence. In the real-world placements of grounding objects can also be seen as important whereas in logos or graphics, the relative positions may not be seen as important.

Object detection on the other hand, is the task of being able to correctly identify the objects contained within an image with bounding boxes. The challenging problem again lies with research mainly being done in real-world scenarios on large datasets containing everyday objects and models being trained on the same. When dealing with logos, graphics may be seen as abstract.

Finally, image retrieval specific for trademark data makes use of colors, shapes and other identifying aspects in an image database and given a new unseen image, enables us to find similar images from our database.

The aforementioned tasks can still be benefitted by current trends in artificial intelligence research where pretrained models on large amounts of data are capable of understanding basic shapes, and contours through deep neural networks, which in turn are fine-tuned on more specific datasets. One of the interesting areas involving these methods is being explored at the Conflict Analytics Lab as the task of determining Risk of Confusion.


[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.

[2] Simonyan, Karen & Zisserman, Andrew. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556.

[3] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.

[4] G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, “Densely Connected Convolutional Networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 2261-2269, doi: 10.1109/CVPR.2017.243.

[5] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. 2011. Unsupervised learning of hierarchical representations with convolutional deep belief networks. Commun. ACM 54, 10 (October 2011), 95–103. DOI:https://doi.org/10.1145/2001269.2001295