Neural Algorithm of Artistic Style Transfer: Understanding with PyTorch examples

Convolutional Neural Network brings several breakthroughs for supervised tasks in Computer Vision and other visual problems in Artificial Intelligence. Moreover, semi-supervised or unsupervised learning has remarkable milestones in attempts to understand how models work. These insights unfold impressive studies which utilitize pre-trained models to extract deep-learning-based features or solve various tasks.

This post introduces Artistic Style Transfer with Neural Algorithm as the typical method in such success. First, solving the object recognition problem with Convolutional Neural Network is presented, then Section 2 provides explanations of how it works by visualizing the pre-trained model. And the final is applying to reconstruct style and content of images. In addition, throughout sections, some experiments with PyTorch code are included.

1. Object Recognition with Convolutional Neural Network

1.1 Object Recognition

Object recognition is a typical task in Artificial Intelligence. Supporting a wide range of applications but facing many challenges, there are a huge number of studies contributing to discovery of this topic. In academy, to officially validate the performance of methods, there is an online competition, called Image Classification with ImageNet dataset. It provides large collection of pictures and their labels, includes various kinds of objects: animal, plant, scene, instrumentation, etc. From 2010, there are a few winning models with different architectures and strategies. All of them are Convolutional Neural Network.

Opening deep learning frameworks such as Keras, Pytorch, prepare access to not only this dataset but also pre-trained CNN models.

1.2 Convolutional Neural Network

Convolutional Neural Network is a popular deep learning model applied commonly in visual tasks. It is inspired by biological processes in the visual cortex of cats. There are two mechanisms: globally observing and locally focusing, which correspond to pooling and convolution operator. The name of this model indicates that it performs this mathematical operation.

Convolutional Neural Network includes several blocks. Each block contains 3 components: convolutions, non-linear activation function and pooling. Convolutions employ convolution operators between small areas of given data and k filters in parallel, produce k images simultaneously. Using the same input, the values of these images depend on values of filters. They are computed by training with labels in particular tasks. Next, the common activation function in CNN is ReLU. Finally, pooling summarizes local information of adjacent positions and generates the outputs with smaller size. For example, it gets the maximum of 2×2 regions so that the image size reduces 2 times in height and 2 times in width.

After convolution blocks is the fully connected layer. The last layer is soft-max function, which normalizes all values to approximate probabilities of object labels.

**Figure 1.1**: Architecture of VGG19. It receives the input with size (224, 224, 3) corresponding to an image with width=224, height=224 and depth=3 for color channels RGB. The first layer includes 64 filters, each filter has the size 3×3. After that, the maxpool summarizes for each 2×2 region so that the outputs become 112×112. Source: https://www.researchgate.net/publication/325137356_Breast_cancer_screening_using_convolutional_neural_network_and_follow-up_digital_mammography

For more detail in Convolution Neural Network, please read this article.

1.3 Visualization of CNN models in object recognition task

Deep learning models are reputed as black box, means we know the inputs, the labels, the weights learned as outcomes, but we have no idea how to explain the results. Fortunately, previous studies discovered ways visualizing CNN models, which help us in understanding and leveraging them better. Some simply show values of filters as images with short discussion, like relationship between filters or their values. Others visualize results of convolution operators between the input image and filters.

These figures below illustrates visualization of some filters of AlexNet and VGG19 and their results when feeding the input (figure 1.2).

Figure 1.2: This is a photo of a famous dog, Maru Taro. I use it as input for models.

Structure of VGG19:

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace=True)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace=True)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace=True)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace=True)
  (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (17): ReLU(inplace=True)
  (18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (20): ReLU(inplace=True)
  (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (22): ReLU(inplace=True)
  (23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (24): ReLU(inplace=True)
  (25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (26): ReLU(inplace=True)
  (27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (29): ReLU(inplace=True)
  (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (31): ReLU(inplace=True)
  (32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (33): ReLU(inplace=True)
  (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (35): ReLU(inplace=True)
  (36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

Visualization for VGG19 is in figure 1.3 and figure 1.4. Its first layer has 64 filters. The 7th layer has 128 filters.

Figure 1.3A VGG19: 64 filters of the first convolution layer

*Figure 1.3*B VGG19: After applying the first convolution layer

Figure 1.4A VGG19: After applying the 7rd convolution layer (the first 64 filters)

Figure 1.4A VGG19: After applying the 7rd convolution layer (the last 64 filters)

Visualization for AlexNet is in figure 1.5. Its first layer has 64 filters.

*Figure 1.5A* AlexNet: 64 filters of the first convolution layer

*Figure 1.5B* AlexNet: After applying the first convolution layer

In PyTorch, it is very easy to access values of weights as well as their outputs. The script below shows how to load an image, models and extract these values in order to visualize models.

import torch
import torchvision
import torchvision.models as models
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

'''read image as numpy matrix with shape (72, 72, 3)'''
datapath='maru.jpg'
np_img = mpimg.imread(datapath)

'''convert image from numpy to tensor with correct shape. 
Tensor has shape (batch, channel, height, width)'''
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
tensor_img = transform(np_img).float().resize_(1, 3, 72, 72)

'''load pre-trained models'''
vgg19 = models.vgg19(pretrained=True) #load VGG19
alexnet = models.alexnet(pretrained=True) #load AlexNet

'''access layers and outputs of pre-train models'''
seq = vgg19.features #type: torch.nn.Sequential()
weight_layer0 = seq[0].weight.data
weight_layer10 = seq[10].weight.data
output_layer0 = seq[0](tensor_img)
output_layer10 = seq[0:11](tensor_img)
output_final = seq(tensor_img)

2. Neural Algorithm of Artistic Style

The algorithm leverages models pre-trained with the image classification task. It does not use the classifying step, but just weights of inner layers that transform inputs to probabilities of classes. It figures out information representing objects of images. Furthermore, it extracts abstract style rather than visual features such as contrast, color, edges, etc, comparing to previous studies. That means the authors do not need to deal with challenges when conveying and fusing such meaningful information to computers. Instead, models could directly work with features. Then, the article shows how to reconstruct individually the content and style of images and even re-combine them. To prove this ability, the authors conducted experiments that re-drawn images with artistic style from famous paintings in art, called style transfer.

Figure 2.1 An example of style transfer: the left is the raw image, the middle is artistic image, and the right is the result generated.

2.1 Extract features of Content and Style

Content Extractor

After training with the classification task, CNN models can recognize objects in images. Due to their success in dealing with a wide range variation of object representation, pre-trained models are able to distinguish which information is related to objects or not. Some previous works also present methods that visualize which pixels or features have significant impacts on recognition results. Thus, it is appropriate to think that information of content is existed and can be captured as features from pre-trained models. According to the authors, it is the set of results of convolution layers. By experiments, they selected the second convolution layer of the forth block (conv4_2) to represent content of images.

To implement, I define a content extractor, inheriting nn.Module class of PyTorch. Check this link to see how to custom nn.Module. Follow the description above, here is Content_Extractor class:

class Content_Extractor(torch.nn.Module):
    def __init__(self, model):
        super(Content_Extractor, self).__init__()
        self.model = model
        self.layers=[21] #[21] #conv4_2    
    
    def forward(self, x):
        return [self.model.features[0:i+1](x) for i in self.layers]   

content_extractor = Content_Extractor(vgg19) # a Content_Extractor instance

In forward function, computing values of low-level layers over and over makes the time cost high. I adjust this by calculating the current layer from previous results as the script below:

def forward(self, x):
    # return [self.model.features[0:i+1](x) for i in self.layers]
    if len(self.layers) == 0:
        return []
    prev_layer = self.layers[0]
    content = [self.model.features[0:self.layers[0]+1](x)]
    for i in self.layers[1:]:
        content.append(self.model.features[prev_layer+1:i+1](content[-1]))
        prev_layer = i
    return content

Style Extractor

The authors suppose that style is constant with different filters in a layer. That is to say, how an image is drawn is global information like the content. Suffering different convolution filters, the image is transformed but still keeps the style consistent in different responses of a layer. Thus, obtaining style is achieved by getting correlation between them. The correlation is given by Gram matrix $G$ , where $G_{ij}^l$ is the inner product between the vector filter response $i$ and $j$ in layer $l$ :

$G_{ij}^l = \sum_k{F_{ik}^l F_{kj}^l}$

By experiments, the authors selected the the first convolution layer of each block (conv1_1, conv2_1, conv3_1, conv4_1, conv5_1) to represent style of images.

def gram_matrix(feature_map):
    n_channel, n_filter, w, h = feature_map.shape
    features = feature_map.view(n_channel * n_filter, w * h)
    G = torch.mm(features, features.t())
    return G.div(n_channel * n_filter * w * h) 

class Style_Extractor(torch.nn.Module):
    def __init__(self, model):
        super(Style_Extractor, self).__init__()
        self.model = model
        self.layers= [0, 5, 10, 19, 28] #conv1_1, conv2_1, conv3_1, conv4_1, conv5_1

    def forward(self, x):
        # return [gram_matrix(self.model.features[0:i+1](x)) for i in self.layers]
        if len(self.layers) == 0:
            return []
        prev_layer = self.layers[0]
        cnn_layers = [self.model.features[0:self.layers[0]+1](x)]
        for i in self.layers[1:]:
            cnn_layers.append(self.model.features[prev_layer+1:i+1](cnn_layers[-1]))
            prev_layer = i
        style = []
        for i in range(len(cnn_layers)):
            style.append(gram_matrix(cnn_layers[i]))
        return style

2.2 Loss of reconstruction

Content and Style reconstruction

To reconstruct content or style of an image, we start with a random image. Let say representation of original image and random image respectively are $t$ and $x$ . By content extractor and style extractor, we can capture their content and style. $F_C(u, l)$ and $F_S(u, l)$ is content feature and style feature of image $u$ at layer $l$ . To measure the efficiency of reconstruction, we compute the loss function:

$L(t, x, l) = \dfrac{1}{2} \sum \bigg(F(t, l) - F(x,l)\bigg)^2$

This is also called mean square error (MSE) between $t$ and $x$ in layer $l$ , with $F$ is $F_C$ or $F_S$ depending on the target information is content or style. The total loss is summing with weights over selected layers:

$L(t, x) = \sum_l{w_l L(t,x,l)}$

Reconstructing style or content becomes minimizing the total loss. It can be done normally by gradient descent algorithms.

Here I define the loss of reconstruction as a class in PyTorch:

class Reconstruction_Loss(torch.nn.Module):
    def __init__(self, extractor, target_img):
        super(Reconstruction_Loss, self).__init__()   
        self.extractor = extractor
        target_tensor = target_img
        self.target = [l.detach() for l in self.extractor(target_tensor)]

    def forward(self, x):
        F = self.extractor(x)
        loss_list =[torch.nn.functional.mse_loss(F[l], self.target[l]) \
                        for l in range(len(self.extractor.layers))]
        total_loss=0
        for loss in loss_list:
            total_loss += loss
        return total_loss

Use the class with Content_Extractor or Style_Extractor, we obtain a model to reconstruct content or style respectively. To train this model, I define a reconstruct function below. Check this link if you are not familiar with training PyTorch models.

def reconstruct(model, target, epochs):
	img = torch.rand(target.shape).requires_grad_(True)
	optimizer = torch.optim.Adam([img], lr=0.1)
	for epoch in range(epochs):
		img.data.clamp_(0, 1)
		optimizer.zero_grad()
		loss = model(img)
		loss.backward()
		optimizer.step()
		if (epoch==0 or epoch % 100==99):
			print ('Epoch %d: Loss=%.4f' % (epoch, loss))
	img.data.clamp_(0, 1)
	return img

Before running these code, we need

prepare tensor variables for input images and results.
turn off using gradients for pre-trained model.
declare a content or style model by using Reconstruct_Loss and Content_Extractor or Style_Extractor.

def convert_tensortonp(tensor_img):
    return np.stack([tensor_img.data[0, 0, :, :].numpy(), \
                     tensor_img.data[0, 1, :, :].numpy(), \
                     tensor_img.data[0, 2, :, :].numpy()], axis = 2)

def convert_totensorimg(np_img, requires_grad=False):
    w, h, c = np_img.shape
    tensor_img = torch.tensor(transform(np_img).float().resize_(1, c, w, h),\
                            dtype=torch.float32, requires_grad=requires_grad)
    return tensor_img

import copy
vgg19 = copy.deepcopy(models.vgg19(pretrained=True))
for param in vgg19.parameters():
    param.requires_grad_(False)

target_tensor = convert_totensorimg(maru_img, requires_grad=False).to(device)
model = Reconstruction_Loss(Content_Extractor(vgg19), target)

Finally, call the reconstruction and see what happen:

Figure 2.2: Examples of content reconstruction for Maru (left) and style reconstruction for Starry Night (right).

Style Transfer

Similar to content and style reconstruction but not individually, style transfer strives to minimizing both of them simultaneously. The loss is sum with weights of content and style, where the ratio between content and style often is $1:10^5$ .

$L_{trans} = L_c * w_c + L_s *w_s$

In implementation, I define a class Style_Transfer_Loss to compute this loss. As Reconstruction_Loss, it contains instance of Content_Extractor or Style_Extractor. Running each of them need visiting layers in vgg19 once. Instead of running the vgg19 twice in total, for content and style separately, I create a new model for Style_Transfer_Loss from the original by the function create_styletransfer_model.

Modify the original model:

For short, create 2 objects: Content_Loss (from Content_Extractor and Reconstruction_Loss) and Style_Loss (from Style_Extractor and Reconstruction_Loss).
Add instances of Content_Loss and Style_Loss objects after appropriate convolution layers.
According to the article, changing pooling from max to average improves the gradient flows.
Current ReLU function use inplace=True in their setting, which means assigning directly new values to current variables. Changing inplace=False lets the model create new variables to store new values, make better performance when computing gradients.
Drop redundant layers.

from torch import nn
def create_styletransfer_model(self, content_tensor, style_tensor, \
                                      content_layers, style_layers):
    layer_cnt = 0
    seq = nn.Sequential()   
    for layer in vgg19.features:
        if isinstance(layer, nn.MaxPool2d):
            layer = nn.AvgPool2d(kernel_size=2, stride=2)
        elif isinstance(layer, nn.ReLU):
            layer = nn.ReLU(inplace=False)
        seq.add_module('{}'.format(layer_cnt), layer)
        if isinstance(layer, nn.Conv2d):
            if layer_cnt in content_layers:
                target = vgg19.features[:layer_cnt+1](content_tensor).detach()
                content_loss = Content_Loss(target)
                seq.add_module("ContentLoss_{}".format(layer_cnt), content_loss)
                self.content_loss_list.append(content_loss)
            if layer_cnt in style_layers:
                target = vgg19.features[:layer_cnt+1](style_tensor).detach()
                style_loss = Style_Loss(target)
                seq.add_module("StyleLoss_{}".format(layer_cnt), style_loss)
                self.style_loss_list.append(style_loss)
        layer_cnt +=1
        if (len(content_layers)==0 or layer_cnt>content_layers[-1]) \
            and (len(style_layers)==0 or layer_cnt>style_layers[-1]):
            break
    return seq

With configuration content_layer = [7] (conv2_2), style_layer=[0, 5, 10] (conv1_1, conv2_1, conv3_1), the model of Style_Transfer_Loss:

Style_Transfer_Loss(
  (model): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (StyleLoss_0): Style_Loss()
    (1): ReLU()
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU()
    (4): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (StyleLoss_5): Style_Loss()
    (6): ReLU()
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (ContentLoss_7): Content_Loss()
    (8): ReLU()
    (9): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (StyleLoss_10): Style_Loss()
  )
)

Here is the detail of Style_Transfer_Loss.

CONFIG_CONTENT_LAYERS, CONFIG_STYLE_LAYER = [21], [0, 5, 10, 19, 28]
CONFIG_WEIGHT_STYLE = [0.2, 0.2, 0.2, 0.05, 0.35]

class Style_Transfer_Loss(nn.Module):
    def __init__(self, content_tensor, style_tensor):
        super(Style_Transfer_Loss, self).__init__()
        #hyper-parameters
        self.content_weight, self.style_weight = 1, 1e5
        self.content_loss_list, self.style_loss_list = [], [], []
        self.content_layers = CONFIG_CONTENT_LAYERS
        self.style_layers = CONFIG_STYLE_LAYER
        self.weight_style_list = CONFIG_WEIGHT_STYLE
        #init
        self.content_loss, self.style_loss = 1e8, 1e8
        self.create_model = create_styletransfer_model
        seq = self.create_model(self, content_tensor, style_tensor, \
                               self.content_layers, self.style_layers)
        self.model = seq.to(device)

    def forward(self, x):
        L_content, L_style = 0, 0
        self.model(x) 
        for l in self.content_loss_list:
            L_content += l.loss
        for i in range(len(self.style_loss_list)):
            l = self.style_loss_list[i]
            w = self.weight_style_list[i]
            L_style += l.loss * w
        self.content_loss = L_content * self.content_weight
        self.style_loss = L_style * self.style_weight
        self.loss = self.content_loss + self.style_loss
        return self.loss

3. Experiements with Style Transfer

Figure 3.1 Raw images of content image (the top) and style images (the bottom). The Rain Princess (the bottom left) is identified as colorful blurred pieces, while the Starry Night (the bottom right) is special with waves in the sky and inside objects.

Let have some experiments with the photo of the village as the content, and two famous artistic paintings Rain Princess and Starry Night as style images. Outputs are expected as keeping the global object arrangement as the content but these objects are drawn by texture as the style, such as colorful and glossy pieces in the Rain Princess or blue waves in Starry Night.

Size of content image

Size of input is a simple but important configuration parameter. With small enough value, we can both observe results and have fast running time. With high value, the effect of style is more detail and complex, but may take much longer time to complete the training step.

Figure 3.2 Experiments with different sizes of content image: the left is 256 and the right is 512. Other configuration is the same.

Convolution Layer for Content and Style Configuration

Different layer contains different types of visual information of style. According to the paper and my experiences, low-level layers are related more to color, contrast; the middle could capture edges, movements of artists when drawing; high-level layers is more sophisticated, mixing previous information to form abstract features.

In some cases, the loss of convolution layers in block 4 is very high. Removing this imbalance can be achieved by keeping their weights small.

*Figure 3.3 Experiments with different convolution layer in style extractor.* The top left: [conv1_1], the top right: [conv3_1], the bottom left: [conv4_1], the bottom right: [conv5_1]. All pictures are generated from content image with size=(512, 512).

Setting with combination of configuration and running with full size of content image, images generated have more sophisticated patterns with smooth coloring and thick texture.