Image Preparation for Convolutional Neural Networks with TensorFlow's Keras API

video

expand_more

text

expand_more

Image preparation for a convolutional neural network with TensorFlow's Keras API

In this episode, we'll go through all the necessary image preparation and processing steps to get set up to train our first convolutional neural network (CNN).

Our goal over the next few episodes will be to build and train a CNN that can accurately identify images of cats and dogs.

Obtain the data

The first thing we need to do is get and prepare the data for which we'll be training the CNN.

The data set we'll use is a random subset of the data set from the Kaggle Dogs Versus Cats competition. You can download the data here.

Organize the data

We now need to organize the directory structure on disk to hold the data set. We'll manually do some parts of the organization, and programmatically do the rest.

Copy the downloaded dogs-vs-cats.zip file, and paste it into whichever directory you'd like to work from. Relative to the Jupyter notebook we'll be working in, I'm placing dogs-vs-cats.zip within a directory called data.

Now, extract dogs-vs-cats.zip, and then extract train.zip from within. We won't be working with the provided test set for the moment, so you can move the test1.zip elsewhere if you'd like.

Next, select all of the image files from within the train directory and move them into the base dogs-vs-cats directory. Lastly, delete the empty train directory.

That's it for the manual labor! At this point, we have 25,000 labeled images of cats and dogs in our dogs-vs-cats directory. Note, the labels are included in the file names. Now, we'll do the remainder of the data organization programmatically.

First let's import all the libraries we'll be making use of over the next several CNN episodes.

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Flatten, BatchNormalization, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix
import itertools
import os
import shutil
import random
import glob
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline

Using the script below, we'll organize our data into train, validation, and test sets. We'll do this by moving subsets of the data into sub-directories for each separate data set.

# Organize data into train, valid, test dirs
os.chdir('data/dogs-vs-cats')
if os.path.isdir('train/dog') is False:
    os.makedirs('train/dog')
    os.makedirs('train/cat')
    os.makedirs('valid/dog')
    os.makedirs('valid/cat')
    os.makedirs('test/dog')
    os.makedirs('test/cat')

    for i in random.sample(glob.glob('cat*'), 500):
        shutil.move(i, 'train/cat')      
    for i in random.sample(glob.glob('dog*'), 500):
        shutil.move(i, 'train/dog')
    for i in random.sample(glob.glob('cat*'), 100):
        shutil.move(i, 'valid/cat')        
    for i in random.sample(glob.glob('dog*'), 100):
        shutil.move(i, 'valid/dog')
    for i in random.sample(glob.glob('cat*'), 50):
        shutil.move(i, 'test/cat')      
    for i in random.sample(glob.glob('dog*'), 50):
        shutil.move(i, 'test/dog')

os.chdir('../../')

We first check to see if the directory structure is already in place. If it's not, we proceed with the script. The script first makes train, valid, and test directories, which each contain sub-directories called dog and cat.

The full data set contains 25,000 images, half of which are cats, and half are dogs. We don't need nearly this much data for the tasks we'll be carrying out, so we're going to work with a subset. The remainder of the script moves 1000 samples into the training set, 200 samples into the validation set, and 100 samples into the test set. Each set has have an equal amount of cats and dogs.

The remainder of the unused data will remain in the base dogs-vs-cats directory. Feel free to remove this data if you'd like.

Scenario of not having test labels

Note, for this data set, we already have labels for the test set. Many times, we may not have corresponding labels for the test data. In fact, if we used the downloaded test set from Kaggle, rather than making our own, then we would see that the corresponding labels are not included.

In the scenario when we don't have labels for the test set, the test directory structure should instead look like this:

test\unknown\

All unlabeled test files will go into the unknown sub-directory. Note that you can name the directory something other than unknown if you prefer.

There is one additional step that needs to be done differently if you do not have access to the test labels, which is mentioned at the end of the next section below.

This is it for all the data organization on disk. Now let's check out the data processing that needs to be done before we can pass this data to the network.

Process the data

We've already imported all the TensorFlow and Keras modules above. If you're using a GPU (not required), then we can check to be sure that TensorFlow is able to identify the GPU using the code below. It's also useful to enable memory growth on the GPU for reasons we covered in a previous episode.

physical_devices = tf.config.experimental.list_physical_devices('GPU')
print("Num GPUs Available: ", len(physical_devices))
tf.config.experimental.set_memory_growth(physical_devices[0], True)

We then create variables for which the the paths to the train, valid, and test data directories are assigned.

train_path = 'data/dogs-vs-cats/train'
valid_path = 'data/dogs-vs-cats/valid'
test_path = 'data/dogs-vs-cats/test'

We can either specify an absolute path or a relative path relative to where our Jupyter notebook resides. I'm specifying the relative path.

Now, we use Keras' ImageDataGenerator class to create batches of data from the train, valid, and test directories.

train_batches = ImageDataGenerator(preprocessing_function=tf.keras.applications.vgg16.preprocess_input) \
    .flow_from_directory(directory=train_path, target_size=(224,224), classes=['cat', 'dog'], batch_size=10)
valid_batches = ImageDataGenerator(preprocessing_function=tf.keras.applications.vgg16.preprocess_input) \
    .flow_from_directory(directory=valid_path, target_size=(224,224), classes=['cat', 'dog'], batch_size=10)
test_batches = ImageDataGenerator(preprocessing_function=tf.keras.applications.vgg16.preprocess_input) \
    .flow_from_directory(directory=test_path, target_size=(224,224), classes=['cat', 'dog'], batch_size=10, shuffle=False)

ImageDataGenerator.flow_from_directory() creates a DirectoryIterator, which generates batches of normalized tensor image data from the respective data directories.

Notice, to ImageDataGenerator for each of the data sets, we specify preprocessing_function=tf.keras.applications.vgg16.preprocess_input. For now, just understand this does an additional processing step on the images. We'll cover what exactly this processing is when we work with the pre-trained VGG16 CNN in a future episode.

To flow_from_directory(), we first specify the path for the data. We then specify the target_size of the images, which will resize all images to the specified size. The size we specify here is determined by the input size that the neural network expects.

The classes parameter expects a list that contains the underlying class names, and lastly, we specify the batch_size.

We also specify shuffle=False only for test_batches. That's because, later when we plot the evaluation results from the model to a confusion matrix, we'll need to able to access the unshuffled labels for the test set. By default, the data sets are shuffled.

Note, in the case where you do not know the labels for the test data, you will need to modify the test_batches variable. Specifically, the change will be to set the parameters classes = None and class_mode = None in flow_from_directory().

Visualize the data

We now call next(train_batches) to generate a batch of images and labels from the training set. Note that the size of this batch is determined by the batch_size we set when we created train_batches.

imgs, labels = next(train_batches)

We then use this plotting function obtained from TensorFlow's documentation to plot the processed images within our Jupyter notebook.

def plotImages(images_arr):
    fig, axes = plt.subplots(1, 10, figsize=(20,20))
    axes = axes.flatten()
    for img, ax in zip( images_arr, axes):
        ax.imshow(img)
        ax.axis('off')
    plt.tight_layout()
    plt.show()

plotImages(imgs)
print(labels)

[[0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]

This is what the first processed random batch from the training set looks like. Notice that the color appears to be distorted. This has to do with the VGG16 processing we applied to the data sets, which we'll talk about in an upcoming episode. Don't worry about it for now, just know that the RGB pixel data has been processed in such a way that the image data now looks like this before being passed to the network.

Note that dogs are represented with the one-hot encoding of [0,1], and cats are represented by [1,0].

We're now all set up to work with this data! In the upcoming episodes, we'll use this data to train a convolutional neural network.

quiz

expand_more

resources

expand_more

In this episode, we'll go through all the necessary image preparation and processing steps to get set up to train our first convolutional neural network (CNN) using TensorFlow's Keras API. 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:26 Obtain the Data 00:41 Organize the Data 08:05 Process the Data 13:29 Visualize the Data 18:02 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 https://youtube.com/deeplizardvlog 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 https://neurohacker.com/shop?rfsn=6488344.d171c6 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard 🎓 Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more

DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

Updated
Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.

TensorFlow - Python Deep Learning Neural Network API