Skip to main content

Natural Language Processing - Episode 1

This episode references this notebook. It will familiarize you with the dataset and the baseline relevant to the business problem we want to solve.

1Load the Data

We are going to build a model that classifies customer reviews as positive or negative sentiment, using the Women's E-Commerce Clothing Reviews Dataset. Here is what the data looks like:

import pandas as pd
df = pd.read_parquet('train.parquet')
print(f'num of rows: {df.shape[0]}')
    num of rows: 20377

The data is stored in a parquet file, which is a framework agnostic way of storing data that you are likely to encounter in the wild. It works seamlessly with pandas and is a format that is commonly available if your data is already in a database.

df.head()
labels review
0 0 Odd fit: I wanted to love this sweater but the...
1 1 Very comfy dress: The quality and material of ...
2 0 Fits nicely but fabric a bit thin: I ordered t...
3 1 Great fit: Love these jeans, fit and style... ...
4 0 Stretches out, washes poorly. wish i could ret...

2Fit a Baseline Model

Before we begin training a model, it is useful to set a baseline. One such baseline is the majority-class classifier, which measures what happens when we label all of our examples with the majority class. We can then calculate our performance metrics by using this baseline model, which in this case is accuracy and the area under the ROC curve:

from sklearn.metrics import accuracy_score, roc_auc_score

valdf = pd.read_parquet('valid.parquet')
baseline_predictions = [1] * valdf.shape[0]
base_acc = accuracy_score(valdf.labels,
baseline_predictions)
base_rocauc = roc_auc_score(valdf.labels,
baseline_predictions)

msg = 'Baseline Accuracy: {}\nBaseline AUC: {}'
print(msg.format(round(base_acc,3), round(base_rocauc,3)))
    Baseline Accuracy: 0.773
Baseline AUC: 0.5

Now that we understand the dataset and the problem a bit more, we can start building our model. We will draw upon machine learning techniques from natural language processing to see if we can train an algorithm to predict the sentiment of these fashion reviews.