Fixing Mislabeled Demos with Machine Learning

Collin Cierlak
GumGum Tech Blog
Published in
11 min readMay 17, 2024

Introduction

We have used demo ad campaigns extensively at GumGum to show off new kinds of ad units and to give potential clients ideas of what their ad campaigns could look with us. Since their inception, these demos have had important metadata fields that were filled out when they were built that aid in organizing and searching our demo catalog; one such field is “industry”. Industries can be thought of as the general sector of the economy that an ad is most relevant to and are organized hierarchically under a broader “industry category”. For example:

  • “Internet Services” industry belongs to the “Technology” category
  • “Hair Care” industry belongs to the “Beauty & Personal Care” category
Industries and their Industry Categories

The Problem

Unfortunately due to a minor bug, users of our demo managing application were accidentally assigning industry categories instead of specific industries to demos. Eventually we found and fixed the bug in the application but the damage had been done already: all our demos were saved with an industry category id in place of an industry id. Fortunately a clever colleague (shout out to @seth_13181) noticed that we could match some of these demos against the ad campaigns names that were built from these demos and so were able to retrieve the correct industry metadata for those demos. Out of the total 22,000 demos, this built a neat set of ~11,000 records to use as training data which we enriched with extracted features from other metadata of the demos.

Below we will explain how we built a throwaway multi-class classification machine learning model that we used to back-fill our missing industries with a class weighted f1 score of ~0.90 by following these steps:

  1. Exploring our data with pandas and plotly (EDA)
  2. Applying categorical feature selection tests and transformations
  3. Extracting and decomposing a text feature
  4. Building a full pipeline and selecting a model

1. EDA

Let’s start with some exploratory data analysis (EDA). After pulling our matched and unmatched demos into pandas dataframes, df_matched and df_unmatched respectively, let’s take a quick peek at the data we’ll be training on to get an idea of what kind of features will be available for us to use.

df_matched.head()

Based on their names it seems that all our features are categorical, with the exception of campaign_name, and that our problem setup will be multi-class classification. We next want to see if the distribution of industry categories are at least similar enough to reasonably assume we could accurately predict the unmatched demos from our training set. It could be that the matched demos come from only a specific subset of industry categories and that would hamper our ability to predict missing industries. My preferred graphing library is plotly and we can easily build a few histograms to view our data using the plotly_express functions.

import plotly_express as px

px.histogram(df_unmmatched, x='industry_category_id', title='Unmatched Demos')
px.histogram(df_matched, x='industry_category_id', title='Matched Demos')
Distribution of industry categories in unmatched demos
Distribution of industry categories in matched demos (training data)

Thankfully it appears as though the distributions are similar enough to proceed after removing just one industry category (18).

Now we can start our feature selection process, beginning by taking basic statistical measurements from our training data using .describe() function; subsequently calling .T to transpose the results to fit on screen.

df_matched.describe().T

Notice that we’re holding onto a few columns that definitely won’t help us and a few that probably won’t help us. We can drop the id column since that’s just the demo id but also the is_animated column since it’s entirely empty. Then there’s features like desktop_only, is_concept, is_lightbox, and is_interactive which are boolean columns that have mean values ~5% or less indicating that they are sparsely populated. We can’t say for certain that these features should be dropped yet though. They may have some predictive power for particular industries but we’ll keep them in mind for pruning later based on their low variance. Also based on the count column a few columns have missing values, let’s take a closer look using .isna().

# Drop useless columns
df_matched = df_matched.drop(columns=['id', 'is_animated'])

df_matched.isna().sum()
industry_category_id   0
business_unit_id 0
agency_id 27
mobile_only 0
desktop_only 0
is_direct 0
is_concept 0
is_lightbox 0
is_responsive 0
is_interactive 0
campaign_name 0
industry_id 0

So we have a handful of missing values for agency_id. Given that this is a column of ids we can fill in the missing values with an arbitrary id that isn’t used such as “0” (we know this is unused because the min values for this column in our .describe() output is 1).

2. Categorical Features

Now let’s move on to graphing our feature and target distributions with plotly to get a better feel for our data.

# Fill in missing values
df_matched = df_matched.fillna(0)

# Melt the dataframe first so plotly can easily split histograms on columns
(px.histogram(df_matched.drop(columns=['campaign_name']).melt(), x='value', color='variable', facet_col='variable', facet_col_wrap=3, facet_col_spacing=.04, facet_row_spacing=.03)
.update_yaxes(matches=None, showticklabels=True)
.update_layout(showlegend=False)
.show())

Our visualization confirms the low variance in our features that we saw in our .describe(). As tempting as it is to apply a VarianceThreshold transformer from sklearn to remove them once we build our pipeline… perhaps there’s a better way to determine if we should keep these features.

Making use of a few feature selection tests we can take a slightly different approach to identifying features to remove, namely chi2, f_classif, and mutual_info_classif. These tests rank features based on how strong a relationship exists with our target and can only be used when dealing with a classification problem (as opposed to regression, which has similar feature tests available in sklearn).

Each has its own limitations:

  • f_classif computes ANOVA values and thus only captures linear relationships
  • chi2 can capture non-parametric relationships but cannot be run on data containing negative values
  • mutual_info_classif can also capture non-parametric relationships but requires more data than chi2 to be accurate

Luckily, none of those prevent us from trying all 3 on our data:

from sklearn.feature_selection import f_classif, mutual_info_classif, chi2

features = df_matched.drop(columns=['campaign_name', 'industry_id']).columns

chi2_results = chi2(df_matched[features], df_matched.industry_id)
chi2_df = pd.DataFrame(chi2_results, index=['chi2_stat', 'chi2_p_value'], columns=features)

anova_results = f_classif(df_matched[features], df_matched.industry_id)
anova_df = pd.DataFrame(anova_results, index=['f_stat', 'anova_p_value'], columns=features)

mutual_info_results = mutual_info_classif(df_matched[features], df_matched.industry_id)
mutual_info_df = pd.DataFrame([mutual_info_results], index=['mi_stat'], columns=features)

# combine results into one dataframe to view
pd.concat([chi2_df, anova_df, mutual_info_df], axis=0).T

All three provide us with a test statistic to rank our features but chi2 and f_classif also provide us with a p value to verify that the test statistic is sufficiently significant. None of our results have high enough p values to be suspicious but as a general rule be wary of any around 0.05 or higher. Here are the top 3 features from each test:

  • chi2 — agency_id, industry_category_id, unit_type_id
  • f_classif — industry_category_id, is_direct, is_concept
  • mutual_info_classif — industry_category_id, agency_id, is_direct

Note that is_concept would have been dropped if we had naively filtered our features on variance alone. We can also see some overlap in our top features result and now we must decide which test to use in our pipeline. One approach is to check the impact on our prediction accuracy using a test model, like DecisionTreeClassifier. We’re testing with a tree based model since we know we have some non-parametric relationships between our features and target.

Let’s begin setting up our pipeline so we can try them out; since our categorical features are nominal (that is, the ids do not have an ordinal rank) we need to use a OneHotEncoder after filtering. The OneHotEncoder will index as well as one-hot encode our categorical features so that our Decision Tree will not treat them as continuous features.

from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# stratify on our target given the classes are not evenly distributed
df_train, df_test = train_test_split(df_matched, train_size=.8, random_state=981, stratify=df_matched.industry_id)
categorical_features = ['agency_id', 'industry_category_id', 'unit_type_id', 'business_unit_id', 'mobile_only',
'desktop_only', 'is_direct', 'is_concept', 'is_lightbox', 'is_responsive', 'is_interactive']

# taking the top 3 features for each test
feature_tests = [
('anova', SelectKBest(f_classif, k=3)),
('mutual info', SelectKBest(mutual_info_classif, k=3)),
('chi2', SelectKBest(chi2, k=3)),
]

for feature_test_step in feature_tests:
feature_test_pipeline = Pipeline(steps=[
feature_test_step,
('ohe', OneHotEncoder(handle_unknown='ignore')),
('dt', DecisionTreeClassifier(class_weight='balanced'))
])

fit_pipeline = feature_test_pipeline.fit(df_train[categorical_features], df_train.industry_id)
predictions = fit_pipeline.predict(df_train[categorical_features])

# print out f1 score for each test
print(feature_test_step[0], metrics.f1_score(df_train.industry_id, predictions, average='weighted', zero_division=0))
anova         0.695
mutual info 0.735
chi2 0.754

Filtering with chi2 gives us the best f1 score for our decision tree classifier so we should clearly pick that one and move on… right? Well, we already know that each test had different features in their top results so it doesn’t feel right to just leave them out of our training data, especially when using a tree based model that can pick up on non-parameteric relationships. Fortunately we do have a way to combine them: FeatureUnion.

One caveat to be aware of, the output of a FeatureUnion will include duplicate features if the outputs of our feature tests overlap so we should keep only a small fraction of top scoring features from the two most distinct test results, chi2 and f_classif. Though it’s outside the scope of this exercise, FeatureUnion is also quite useful for combining more complex feature transformations such as mixing PCA output features with original features.

Applying the same tests with the same percentiles yields us a slight boost in f1 score:

from sklearn.pipeline import FeatureUnion

feature_test_pipeline = Pipeline(steps=[
('feature test', FeatureUnion([
('chi2', SelectKBest(chi2, k=3)),
('ftest', SelectKBest(f_classif, k=3))
])),
('ohe', OneHotEncoder(handle_unknown='ignore')),
('dt', DecisionTreeClassifier(class_weight='balanced'))
])

fit_pipeline = feature_test_pipeline.fit(df_train[categorical_features], df_train.industry_id)
predictions = fit_pipeline.predict(df_train[categorical_features])

print('feature union', metrics.f1_score(df_train.industry_id, predictions, average='weighted', zero_division=0))
feature union   0.792

An f1 score of 0.792 is not bad but before moving on, let’s take another look at that campaign_name feature we skipped past at the top.

3. A Text Feature Appears

From our initial problem setup we may recall that these names are given to all demos upon creation by the demo builder application user and we were able to match these names to existing campaigns in our system to create our training data.

our og df.head() results

It would not be unreasonable to assume that they have some latent features we could extract to improve our model’s predictive power. Let’s get a sense of what features we can pull out from a campaign name by applying text feature extraction techniques, namely tokenizing and token counting. The idea here is to split a campaign name like “New England Toyota Dealers” and “L’Oreal Feria 2015” into tokens [“New”, “England”, “Toyota”, “Dealers”, “L’Oreal”, “Feria”, “2015”] and count occurrences of those tokens within other campaign names. We expect to see strong relationships between words such as “Toyota”, “Dealers”, and “L’Oreal” and our industries such as “Auto Dealers” and “Make-Up and Cosmetics”.

From sklearn we can use a CountVectorizer to split up the campaign names into tokens and count them all in one go. CountVectorizer yields us a sparse matrix which we can wrap in a DataFrame to easily display in a bar graph.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
campaign_tokens = cv.fit_transform(df_matched.campaign_name)
df_campaign_tokens = pd.DataFrame(campaign_tokens.toarray(), columns=cv.get_feature_names_out())
df_campaign_tokens_totals = df_campaign_tokens.sum().sort_values(ascending=False)
px.bar(df_campaign_tokens_totals, orientation='h')
bottom distribution cut off to fit on screen

We see that there are a few words that appear to be useless like “2021”, “of”, and “q4” and though it is tempting to consider them “stop words” to remove as would be part of the typical preprocessing workflow for NLP problems, we need to recognize that this isn’t technically an NLP problem. In this case we are using a CountVectorizer akin to a OneHotEncoder on a set of named entities (not a text corpus). By not removing these words and by not setting any document frequency limits we will have a problem with high dimensionality later on to deal with but we’ll deal with that next. In fact there is one parameter with an undesirable default value on our transformer that subtly reduces dimensionality and we should adjust it to make it better suited for this task: lowercase=False

cv = CountVectorizer(lowercase=False)
campaign_tokens = cv.fit_transform(df_matched.campaign_name)
df_campaign_tokens = pd.DataFrame(campaign_tokens.toarray(), columns=cv.get_feature_names_out())
df_campaign_tokens_totals = df_campaign_tokens.sum().sort_values(ascending=False)
px.bar(df_campaign_tokens_totals, orientation='h')

By not forcing all tokens to be lowercase (the default behavior of CountVectorizer) we allow our model to pick up on brand names that are usually capitalized in our data that could be lost if counted in a case-insensitive manner.

Now we should address our dimensionality problem. Not doing so would mean adding ~5000 features, one for each distinct token, to our dataset. We will use singular value decomposition (SVD) to reduce our dimensionality by applying theTruncatedSVD transformer. We cannot use principal component analysis (PCA) outright because the output of CountVectorizer is a sparse matrix and PCA requires centering the data before decomposition. TruncatedSVD uses latent semantic analysis (LSA) which does not require centering and is better suited for text features. Let’s try to reduce the dimensionality tenfold by setting our n_components parameter to 500.

from sklearn.decomposition import TruncatedSVD

text_pipeline = Pipeline(steps=[
('cv', CountVectorizer(lowercase=False)),
('lsa', TruncatedSVD(n_components=500))
])

4. Building the Pipeline

Now that we have our text feature let’s add it to our categorical features using a ColumnTransformer and create our pipeline. We’ll try out a few tree based models and compare them against a DummyClassifier for good measure.

from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, mutual_info_classif, chi2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn import metrics
from sklearn.model_selection import cross_val_score

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

from sklearn.model_selection import train_test_split

target = 'industry_id'

features = list(df_matched.columns)
features.remove(target)

text_feature = 'campaign_name'
categorical_features = features.copy()
categorical_features.remove(text_feature)

df_train, df_test = train_test_split(df_matched, train_size=.7, random_state=981, stratify=df_matched[target])

categorical_pipeline = Pipeline(steps=[
('feature test filter', FeatureUnion([
('chi2', SelectKBest(chi2, k=3)),
('ftest', SelectKBest(f_classif, k=3))
])), ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

text_pipeline = Pipeline(steps=[
('cv', CountVectorizer(lowercase=False)),
('lsa', TruncatedSVD(n_components=500))
])

features_transformer = ColumnTransformer([
('categorical', categoricals_pipeline, categorical_features),
('text', text_pipeline, text_feature)
])

models_to_try = [
('dummy', DummyClassifier()),
('dt', DecisionTreeClassifier(class_weight='balanced')),
('et', ExtraTreeClassifier(class_weight='balanced')),
('rf', RandomForestClassifier(class_weight='balanced')),
]

for model in models_to_try:
pipeline = Pipeline(steps=[
('transform features', features_transformer),
model
])

pipeline_fit = pipeline.fit(df_train[features], df_train.industry_id)
predictions = pipeline_fit.predict(df_test[features])
score = metrics.f1_score(df_test.industry_id, predictions, average='weighted', zero_division=0)
print(model[0], score)
dummy  0.040
dt 0.844
et 0.769
rf 0.898

RandomForestClassifier scores the highest against our test set with a final f1 score of 0.898 so this will be the model we select to fill in our missing industries in our unmatched demo data. We could try to squeeze out a slightly better score with some hyperparameter tuning by tweaking n_estimators, max_features, and max_depth but this model will only be used once before being thrown away so this seems like a fine place to stop.

Conclusion

To summarize, we visualized our features using plotly to get a better feel for our data, used a union of three feature selection tests to programmatically filter our categorical columns, and finally decomposed our extracted text features from our campaign_name column. After testing three tree based classifiers we found that a Random Forest model gave us our best results, being able to predict our missing industry ids with an f1 score of ~0.90. Thanks for following along!

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram

--

--