Building an AI-powered security camera

In my spare time, I have been experimenting with the potential of facial recognition using Python. Facial recognition and computer vision are becoming much simpler thanks to high level libraries such as Face Recognition.

My goal with this script was to see if I could make something which recognised that somebody was present, then would recognise whether it was me or not. If it was, then nothing happens - otherwise I use the computer’s text-to-speech utility to tell the intruder to ‘Go away!‘.

Here’s an example:

The code, and how it works

You can see the repo, otherwise the script is below:

import face_recognition
import cv2
import numpy as np
import os
import glob

# Get a reference to webcam #0 (the default one)
video_capture = cv2.VideoCapture(0)

# make array of sample pictures with encodings
known_face_encodings = []
known_face_names = []
dirname = os.path.dirname(__file__)
path = os.path.join(dirname, 'people/')

# make an array of all the saved jpg files' paths
list_of_files = [f for f in glob.glob(path+'*.jpg')]
# find number of known faces
number_files = len(list_of_files)

names = list_of_files.copy()

for i in range(number_files):
    globals()['image_{}'.format(i)] = face_recognition.load_image_file(
        list_of_files[i])
    globals()['image_encoding_{}'.format(i)] = face_recognition.face_encodings(
        globals()['image_{}'.format(i)])[0]
    known_face_encodings.append(globals()['image_encoding_{}'.format(i)])

    # Create array of known names
    names[i] = names[i].replace("people/", "")
    known_face_names.append(names[i])

# Initialize some variables
face_locations = []
face_encodings = []
face_names = []
owner = 'Ruairidh'
process_this_frame = True

while True:
    # Grab a single frame of video
    ret, frame = video_capture.read()

    # Resize frame of video to 1/4 size for faster face recognition processing
    small_frame = cv2.resize(frame, (0, 0), fx=0.25, fy=0.25)

    # Convert the image from BGR color (which OpenCV uses) to RGB color (which face_recognition uses)
    rgb_small_frame = small_frame[:, :, ::-1]

    # Only process every other frame of video to save time
    if process_this_frame:
        # Find all the faces and face encodings in the current frame of video
        face_locations = face_recognition.face_locations(rgb_small_frame)
        face_encodings = face_recognition.face_encodings(
            rgb_small_frame, face_locations)

        face_names = []
        for face_encoding in face_encodings:
            # See if the face is a match for the known face(s)
            matches = face_recognition.compare_faces(
                known_face_encodings, face_encoding)
            name = 'Unknown'

            face_distances = face_recognition.face_distance(
                known_face_encodings, face_encoding)
            best_match_index = np.argmin(face_distances)
            if matches[best_match_index]:
                name = known_face_names[best_match_index]

            face_names.append(name.rsplit('.', 1)[0].capitalize())

        if owner not in face_names and name is 'Unknown':
            os.system('say Go away!')

    process_this_frame = not process_this_frame

    # Display the results
    for (top, right, bottom, left), name in zip(face_locations, face_names):
        # Scale back up face locations since the frame we detected in was scaled to 1/4 size
        top *= 4
        right *= 4
        bottom *= 4
        left *= 4

        # Draw a box around the face
        cv2.rectangle(frame, (left, top), (right, bottom), (0, 0, 255), 2)

        # Draw a label with a name below the face
        cv2.rectangle(frame, (left, bottom - 35),
                      (right, bottom), (0, 0, 255), cv2.FILLED)
        font = cv2.FONT_HERSHEY_DUPLEX
        cv2.putText(frame, name, (left + 6, bottom - 6),
                    font, 1.0, (255, 255, 255), 1)

    # Display the resulting image
    cv2.imshow('Video', frame)

    # Hit 'q' on the keyboard to quit!
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Release handle to the webcam
video_capture.release()
cv2.destroyAllWindows()

As you can see, it’s a pretty short script! To achieve the desired goal however, it needs to do four things:

Establish whether there are any faces present.
Create a ‘map’ of the face and transform it to centre it as much as possible.
Encode the face through an embedding (more on this later).
Compare the embedding to that of a previously recognised face.

If you save a photo of yourself in ‘people’ and run this, the script will draw a boundary box over your face, labelled with your name. Otherwise, it will tell anyone else to go away!

How face detection works

Face detection is everywhere. It’s on your phone’s camera to help with focusing, it’s used by Snapchat for all those fun filters, and when Facebook suggests who is in a photo, it’s all face detection.

So how does it work? A typical process is to convert the image to black and white since we don’t need colours to recognise a face. Then we can look at every single pixel of the image and note the surrounding pixels. We figure out in which direction the image becomes darker, and then draw an arrow in that direction, this arrow is known as a gradient.

This allows us to make meaningful comparisons of faces, even with light differences. Since the direction of the gradient will remain the same in different conditions, we end up with the same representation of the face.

Ok, but isn’t that a bit computationally heavy? Instead we can split the image into segments and set a gradient for the direction in which we have the most incidences.

This is known as a Histogram of oriented gradients, or HOG for short.

How face mapping works

So we can now detect there is a face present somewhere. But how do we handle the fact that different photos contain a multitude of poses, and it’s not like a series of passport photos where you get the same predictable pose and expression.

Well, we can adjust the positioning of the face itself so that it’s roughly in the same place. It’s kind of creating that passport standard, but with any photo.

To achieve this, a technique called face landmark estimation is used which creates a series of data points, or ‘landmarks’ that exist on nearly every face. Things like the space between eyes, the end of your chin, the edges of your brow, etc. Then our machine learning algorithm will look for those datapoints in order to make comparisons between faces and make predictions.

Now we can find out where the features are, we edit the image to place them as close to the centre as possible. It’s important to note that no 3d transforms happen to the image as that would warp it and ruin our comparisons. Instead we use things like rotation and scaling.

How encoding a face works

Now that we have our face positioned nicely, we want to extract a few basic measurements so that we can then find the face with the closest measurements, and so be most likely to be our match. People do this intuitively since we have evolved to do so.

Computers are less able to do this, and so we rely on deep learning to allow the computer to come up with its own way of recognising a face. This approach was previously used by Google researchers. For those that don’t want to spend a ton of time doing this themselves, there’s a handy dataset called OpenFace.

How facial recognition works

Now that we have our detected and embedded photos, we want to check if there’s a match! This is one of the easiest parts as it’s a simple classification problem. We provide the embedding of our new face and see if we can categorise it with one of our previously known faces. The closest match is then returned.

In practice

If you’re really into this, then you could implement it yourself, otherwise you could use the excellent Face Recognition python library. The creator of the library has written an excellent article on the process which inspired this one to solidify my own learning.