Skip to content

vishalprabha/SCIIS

Repository files navigation

Logo

Scalable contextual image indexing and search

CSCI 5253 - Data Center Scale Computing - Final Project

Table of Contents
  1. About The Project
  2. Getting Started
  3. Workflow of the system
  4. Authors
  5. Acknowledgments

About The Project

There are about 2.5 quintillion bytes of data produced by humans every day and much of it includes images. With this massive amount of data, it can be substantially hard to find one particular image even if organized well. So thinking on those terms, we have built a service that enables users to store and retrieve images based on contextual keywords. The functionality of the application would be to provide a service where users will be able to upload images, followed by the system identifying and extracting contextual keywords from the same. Later the service can be used by the user to search for any image in the collection based on contextual keywords and similar images are returned along with a safe search tag added to it.

The project goal can be broadly classified to have the following major components:

  • Scalable context feature extraction - The app is devised to take in images as input, store them and extract the contextual keywords all while being able to auto-scale based on load.

  • Safe search tagging - Based on details extracted from the image using the Google cloud vision API, the content is tagged for violence, racy, spoofed, and adult content.

  • Scalable search - The system is also built to search for the images based on contextual keywords and auto-scale when the load increases.


Project Video link

(back to top)

Components


  • Kubernetes
  • RabbitMQ
  • Redis
  • CloudSQL
  • Flask REST server
  • Google Cloud storage(Bucket)
  • Google Cloud Vision API

(back to top)

Architecture



(back to top)

Getting Started

The project follows a microservice architecture so the application can be broadly divided into 5 components:

  • Rest-server
  • Worker
  • Redis
  • RabbitMQ
  • Log server

Prerequisites

The following software, accounts and tools are required to get the project up and running:

  • Google cloud account with active credits
  • gcloud command-line tool
  • Docker
  • Kubernetes enable on Docker or Google Kubernetes engine
  • Python
  • Redis-CLI
  • Python libraries found in the requirements file of rest-server and worker

(back to top)

Installation

  1. gcloud container clusters create --preemptible mykube - Create a cluster of nodes on GKE
  2. sh deploy-all.sh - Launches all the services and deployments required on GKE.
  3. Setting up the MySQL Database -
    python worker/sqlsetup.py createDB
    python worker/sqlsetup.py create
    
  4. Setting up and deploying the ingress on GKE :
    kubectl create clusterrolebinding cluster-admin-binding 
    -- clusterrole cluster-admin
    -- user $(gcloud config get-value account)
    
    kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.0.4/deploy/static/provider/cloud/deploy.yaml
    
  5. Enabling the load balancer -
    gcloud container clusters update mykube --update-addons=HttpLoadBalancing=ENABLED
    
  6. Enabling horizontal pod autoscale on CPU usage -
    kubectl autoscale deployment sciis-rest --cpu-percent=50 --min=1 --max=10
    kubectl autoscale deployment final-worker --cpu-percent=50 --min=1 --max=10
    

(back to top)

Workflow of the system


  • The orchestration starts with the user accessing the webpage from the base URL endpoint, which is served from the Flask server as the request is routed through the network load balancer.

  • Users can then upload an image in the web application and the contents are passed to the server using a POST endpoint.

  • The MD5 value of the image is generated and used as the ID to store the image in Google cloud storage.

  • This is followed by the curation of the input data, which is sent to the worker node via RabbitMQ.

  • Worker nodes then dequeue the message from RabbitMQ, parse the data and retrieve the image from the Cloud Storage bucket using the MD5 value and then run it against the Google Cloud Vision API.

  • All the relevant contextual details and safe search tag extracted from the image is manipulated in the right format and stored in Redis and MySQL.

  • As these processes run, all the relevant debug and event information logs are written into a log server using RabbitMQ.

  • The end-user can search for the stored images using contextual keywords. The query is first run on Redis then MySQL to extract relevant images, if they are found, their public URLs along with safe search tags are rendered back else an appropriate message is displayed.

  • All the components run in their individual pods on the Google Kubernetes engine, thereby making them easy to deploy, and maintain. Horizontal scaling is enabled on the rest server and worker nodes which spin up new pods when the CPU utilization goes beyond a specified threshold.

(back to top)


Authors

Vishal Prabhachandar

Srinivas Akhil Mallela

(back to top)

Acknowledgments

Links to the resources used in the project.

(back to top)