For my Metis capstone project, I decided to make a tourist attraction recommender, but unlike normal travel and vacation planning websites, this recommender uses an image as input.
The motivation behind this project, besides having an interest in image classification and neural networks, was to create a machine learning model that could help alleviate some of the stress associated with vacation planning. Whether it’s someone looking to recreate a past experience with some old travel photos, or perhaps you’ve just come across an image on your social media feed that looks like a cool destination for your next vacation, this model can help provide some inspiration.
If you find yourself in need of some vacation planning help, use my Streamlit web app that I deployed using an Amazon EC2 instance and S3 storage. It will also show a visual representation of your location relative to the recommended attractions to help you decide between recommendations.
The first step in creating this model is not only gathering the data set necessary for training our neural network but also defining the scope of the project. I decided to only look at attractions located inside the continental United States, for road trip purposes and also so recommendations were more affordable and feasible, for those looking to travel on a budget.
Images from the top 30 attractions from each of the states in the continental United States were scraped from Tripadvisor. About 60 images, from each attractions image gallery, totaling to about 75,000 images from about 1,500 different tourist attractions.
Images were scraped using Selenium and Beautifulsoup. Selenium was used for web browser automation, searching a state, finding the top attractions page for each state and links to attraction pages. Then, using Beautifulsoup, all image urls were scraped from the first page in each attractions image gallery, about 30 to 60 images.
For the purpose of this blog, I will not go too in depth into how the image dataset was created, but instead will focus more on the model itself. For more information on web scraping images from Tripadvisor and web browser automation using Selenium refer to my Github repo.
It should also be noted that these images were tourist uploaded images, uploaded to Tripadvisor from attraction reviews. Some images are mislabeled (images are uploaded from someones travel photo gallery), and have no correlation to the attraction.
At most 60 images were scraped from each attractions image gallery, some attractions only had 30 images. This is not enough data to train a neural network on to classify an image as a certain attraction, even excluding the fact that some of these images are already mislabeled or have no correlation to the attraction. So the decision was made to label attractions based on an assigned attraction class and then using a neural network, classify an image into an attraction class. After classifying, attractions within the specified attraction class will be recommended to the user.
After creating a data frame of attractions, locations, and a list of image urls from web scraping, attractions needed to be clustered and labeled into certain attraction classes. This was done using topic modeling on attraction names. Again, we will not focus on the code for this part, as this pertains to the creation of our model parameters but not the modeling itself. Also, a lot of hand labeling was done. Attraction labeling could be done in many different ways, one of which would be to also scrape attraction reviews and perform topic modeling on reviews instead of just attraction names, which contain limited information (there are a lot of attractions with ‘Park’ in their name).
Before topic modeling, text was pre-processed using the natural language processing Python package, NLTK (Natural Language Toolkit). Text data was changed to lowercase, punctuation removed, stop words removed, and then lemmatized.
Next topic modeling was done using the scikit-learn Python package for machine learning. Multiple topic models were explored: Latent Direct Allocation (LDA) was used with a Count Vectorizer and also Non-Negative Matrix Factorization (NMF) was used with both a Count Vectorizer and a TF-IDF Vectorizer. An NMF with a TF-IDF Vectorizer gave the most clear and concise classes and was chosen for the final topic model. Topic modeling and additional hand labeling got eight different attraction class labels: parks, beaches/oceans, gardens/zoos, art, sports, entertainment, museums, and landmarks.
After attraction labeling, a new data frame was created with each row representing a single image, with an image url, attraction name, location, and class label. Images are now ready to be loaded into an array and preprocessed.
For pre-processing, an image is loaded from the url, resized to 150 x 150 x 3 (for red, green, and blue color channels), and then transformed into a NumPy array. An image array is created with each point corresponding to a row in the original data frame.
Neural Network with Transfer Learning using VGG-16
Three initial neural networks were modeled: a Convolutional Neural Network (CNN), a CNN with regularization, and a neural network with transfer learning using the pre-trained VGG-16 computer vision model. Training and scoring all models on a small subset of the dataset showed that the neural network with transfer learning outperformed the CNN models. Looking at all models confusion matrices and accuracy and loss graphs, there appeared to be severe overfitting as well as class imbalance.
What is Transfer Learning and VGG-16?
Transfer learning is when a trained model for one problem is used in some way on a different problem. For neural networks, transfer learning is when some layers of a trained model are used on a new model to solve the problem of interest. This can greatly decrease learning time and accuracy of the new model if the model used for transfer learning has been trained on a very large and diverse dataset.
VGG-16 (also called OxfordNet) is a convolutional neural network architecture named after the Visual Geometry Group from Oxford, who developed it. It is a computer vision model, used to identify features in an image.
Handling Class Imbalance with Class Removal and Random Under Sampling
Two classes were removed and combined with pre existing classes. The sports class was combined with the entertainment class and the arts class with the museums class. Random under sampling was done on the parks class, randomly removing about 10,000 images to get the class size to the same size as the museums class (the second largest class).
Preprocessing and Test, Train, and Validation Set Creation
All image color values range from 0 to 256, with 256 measuring the highest in intensity. In order to standardize the image dataset we divide all image arrays in the dataset by 255. Our attraction class labels are also one hot encoded because neural networks cannot read text data.
After fixing class imbalance, image augmentation was done on the remaining images in the train dataset in order to reduce overfitting. A horizontal flip, a random rotation of an angle between -30 and 30 degrees, and a random noise variation was applied to all images. This tripled the training dataset bringing the number of images in our dataset from about 38,000 images to 150,000 images.
Creating VGG-16 Feature Vectors For All Images
Using the VGG-16 model, a feature vector for every image was created. The last few layers are removed from the VGG-16 model to create a larger feature vector. The layers are then frozen so the model weights do not change during training.
Neural Network with Transfer Learning
A neural network is then trained using the output of our VGG-16 model. L2 regularizers are added to each dense layer as well as dropout layers to reduce overfitting the model.
Final Neural Network Testing
Final testing on the testing dataset showed improved class balance and a severe reduction in over fitting. However looking at the confusion matrix, the model does not perform as well when distinguishing between some classes: museums and landmarks; entertainment and museums; landmarks and entertainment; and parks and beaches/oceans.
Finding Attraction Recommendations
Attraction recommendations are found by finding the closest attractions, within an attraction class to the input image. This is done by creating feature vectors for images and finding the distance between vectors using cosine distance. Two different feature vectors are used to calculate distance: the VGG-16 feature vector previously used to train the neural network and then a color distribution feature vector.
The color distribution vector was found by splitting each image up into five sections (four corner sections and a center ellipse section) and then finding the red, green, and blue color distributions (measured in intensity from 0 to 256, 256 being the most intense) with a specified number of bins. The number of bins represents the length of each distribution vector for each color for each section of the image. All color vectors from each section are combined into one vector representing an entire images color distribution.
Now that we have all of the functions we need to get our color distribution feature vectors and our VGG-16 feature vectors for an image, we can create two new columns in the original data frame for each vector. This allows us to group our original data frame by attraction labels and easily compute distances after an input image is classified.
To find distances between our input image and every image in a specified class, we first find the cosine distance of the input images color distribution feature vector to every color distribution feature vector in the class and then do the same for the VGG-16 feature vectors. Now that we have two distance (d(rgb) and d(vgg)) arrays representing every image in our class dataset, each distance array must be normalized, so that each distance is weighed evenly before applying a scalar. After normalization, we multiply the d(vgg) distances by a scalar then sum the two distances together to get our TOTAL distance.
The scalar is dependent on the attraction class that we are finding the distances in. VGG-16 feature vectors were observed to be more important for more man-made attraction classes and the color distribution feature vectors were found to be more important for nature oriented attractions. A larger scalar was used for museums, landmarks, gardens/zoos, and entertainment and a smaller scalar was used for beaches/oceans. A scalar of one was used for parks.
After calculating the total distances, the class data frame is grouped by attraction names, and the mean of all image distances for each attraction is found. The attractions with the smallest mean distances are recommended to the user.
Testing the Final Model
The model is not perfect, but looking at the top recommendations, at least one of the recommendations could be a great destination for someones next vacation!