Natural Language Processing Coding Example

Are you working with a large website that is full of useful content but the pages were not labeled when they were created? Then you will definitely want to try this Natural Language Processing model for page categorization.

The object of this model was to use the available labeled pages to assign likely page meaning to unlabeled pages based on the content of those pages. This model could be used to assign a an objective in the customer journey or the subject matter category the page should fall into with in the site.

If you are planning on building a page recommender system for your content in the future this could be a very useful model in preparing your data for future machine learning projects.

Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import category_encoders

Importing The Dataset

dataset = pd.read_csv('Page_Content.csv')
labeled_datadset = pd.read_csv('Page Purpose Labeld Data.csv')

Data Cleaning

# creating the labeled training set
df = pd.merge(dataset, labeled_datadset, how = "right", on =['Page URL'], )
df.shape
# removing rows with empty 'Page Content' cells
df = df.dropna(subset = ["Page Content"], how = "all")
df.shape
(430, 3)
# creating a label encoder (did not use in final model)
#enc = category_encoders.one_hot.OneHotEncoder(cols = ['Page Purpose'], drop_invariant = False, use_cat_names = True, return_df = True,)
# df2 = enc.fit_transform(df)
# df2 = df2.drop(['Page Purpose_-1'], axis = 1)
# df2
# labels represented in the labeled training set
sns.countplot(df['Page Purpose'])
<matplotlib.axes._subplots.AxesSubplot at 0x1679f242358>

Text Cleaning

# importing text cleaning libraries
import re
import nltk from nltk.corpus
import stopwords from nltk.stem.porter
import PorterStemmer import string
# Text cleaning function
def message_cleaning(message):
test_punc_removed = [char for char in message if char not in string.punctuation ]
test_punc_removed_join = ''.join(test_punc_removed)
test_punc_removed_join_nums = re.sub('[^a-zA-Z]+', ' ', test_punc_removed_join)
test_punc_removed_join_nums_clean = [word for word in test_punc_removed_join_nums.split() if word.lower() not in stopwords.words('english')]
return test_punc_removed_join_nums_clean
# Vectorizing the text
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = message_cleaning)
df_countvectorizer = vectorizer.fit_transform(df2['Page Content'])
print(vectorizer.get_feature_names())
print(df_countvectorizer.toarray())
[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
df_countvectorizer.shape
(430, 9150)

Splitting Train & Test Sets

# labeling the X and y data 
X = df_countvectorizer
y = label
# Splitting the training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# Training the Niave Bayes classifier 
from sklearn.naive_bayes import MultinomialNB
NB_Classifier = MultinomialNB()
NB_Classifier.fit(X_train, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Evaluating the Model

from sklearn.metrics import classification_report, confusion_matrix
y_predict_train = NB_Classifier.predict(X_train)
# confusion matrix to test accuracy of training model
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x167a3353eb8>
print(classification_report(y_train, y_predict_train))
              precision    recall  f1-score   support

     Convert       0.99      0.98      0.99       154
      Engage       0.92      0.92      0.92        13
     Enhance       1.00      1.00      1.00         9
      Evolve       0.92      1.00      0.96        45
     Inspire       1.00      1.00      1.00        17
       Trust       1.00      0.98      0.99       106

    accuracy                           0.98       344
   macro avg       0.97      0.98      0.98       344
weighted avg       0.98      0.98      0.98       344

# confusion matrix to test accuracy of testing model
y_predict_test = NB_Classifier.predict(X_test)
cm2 = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm2, annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x167a3429cc0>
print(classification_report(y_test, y_predict_test))
              precision    recall  f1-score   support

   Add Value       0.00      0.00      0.00         2
     Convert       0.95      1.00      0.98        42
      Engage       0.00      0.00      0.00         2
     Enhance       1.00      0.75      0.86         4
      Evolve       0.85      1.00      0.92        11
     Inspire       0.00      0.00      0.00         2
       Trust       0.88      0.91      0.89        23

    accuracy                           0.90        86
   macro avg       0.53      0.52      0.52        86
weighted avg       0.85      0.90      0.87        86

In this case we were working with a small amount of data which means that we wanted to validate our model using train test split, however we want to take advantage of the whole dataset for training when it comes to our final model.

Training On The Whole Labeled Dataset

from sklearn.naive_bayes import MultinomialNB 
NB_Classifier = MultinomialNB()
label = df["Page Purpose"].values
# training the model on the full labeled training set
NB_Classifier.fit(df_countvectorizer, label)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
# sample test 
testing_sample = ["This school is the best! "] testing_sample_countvectorizer = vectorizer.transform(testing_sample) NB_Classifier.predict(testing_sample_countvectorizer)
array(['Convert'], dtype='<U9')
# loop to predict and assign values to all unlabeled data in the dataset Page_Purpose = [] 
for row in dataset['Page Content']:
testing_sample_countvectorizor = vectorizer.transform([row])
purpose = [(NB_Classifier.predict(testing_sample_countvectorizor))]
Page_Purpose.append(purpose)

dataset['Page_Purpose'] = Page_Purpose
# Saveing predictions to a csv
dataset.to_csv('Page Purpose Whole Labaled Dataset.csv', index = False)
final_df = pd.read_csv('Page Purpose Whole Labaled Dataset.csv')
# distribution of final predicted labels
sns.countplot(final_df['Page_Purpose'])
<matplotlib.axes._subplots.AxesSubplot at 0x167a2fafb38>

With our model we achieved an accuracy rating of 87%. Based on our limited labeled data this was a great result. Moving forward in this project we will now be able to asses these pages performance based on the categories they should fit in and make adjustments to those assignments as needed.

Leave a Reply

Your email address will not be published. Required fields are marked *