Introducing Discovery Ad Performance Analysis

Posted by Manisha Arora, Nithya Mahadevan, and Aritra Biswas, gPS Data Science team

Overview of Discovery Ads and need for Ad Performance Analysis

Discovery ads, launched in May 2019, allow advertisers to easily extend their reach of social ads users across YouTube, Google Feed and Gmail worldwide. They provide brands a new opportunity to reach 3 billion people as they explore their interests and search for inspiration across their favorite Google feeds (YouTube, Gmail, and Discover) — all with a single campaign. Learn more about Discovery ads here.

Due to these uniquenesses, customers need a data driven method to identify textual & imagery elements in Discovery Ad copies that drive Interaction Rate of their Discovery Ad campaigns, where interaction is defined as the main user action associated with an ad format—clicks and swipes for text and Shopping ads, views for video ads, calls for call extensions, and so on.

Interaction Rate = interaction / impressions

“Customers need a data driven method to identify textual & imagery elements in Discovery Ad copies that drive Interaction Rate of their campaigns.”

– Manisha Arora, Data Scientist

Our analysis approach:

The Data Science team at Google is investing in a machine learning approach to uncover insights from complex unstructured data and provide machine learning based recommendations to our customers. Machine Learning helps us study what works in ads at scale and these insights can greatly benefit the advertisers.

We follow a six-step based approach for Discovery Ad Performance Analysis:

  • Understand Business Goals
  • Build Creative Hypothesis
  • Data Extraction
  • Feature Engineering
  • Machine Learning Modeling
  • Analysis & Insight Generation

To begin with, we work closely with the advertisers to understand their business goals, current ad strategy, and future goals. We closely map this to industry insights to draw a larger picture and provide a customized analysis for each advertiser. As a next step, we build hypotheses that best describe the problem we are trying to solve. An example of a hypothesis can be -”Do superlatives (words like “top”, “best”) in the ad copy drive performance?”

“Machine Learning helps us study what works in ads at scale and these insights can greatly benefit the advertisers.”

Manisha Arora, Data Scientist

Once we have a hypothesis we are working towards, the next step is to deep-dive into the technical analysis.

Data Extraction & Pre-processing

Our initial dataset includes raw ad text, imagery, performance KPIs & target audience details from historic ad campaigns in the industry. Each Discovery ad contains two text assets (Headline and Description) and one image asset. We then apply ML to extract text and image features from these assets.

Text Feature Extraction

We apply NLP to extract the text features from the ad text. We pass the raw text in the ad headline & description through Google Cloud’s Language API which parses the raw text into our feature set: commonly used keywords, sentiments etc.

Example: 

Image Feature Extraction

We apply Image Processing to extract image features from the ad copy imagery. We pass the raw images through Google Cloud’s Vision API & extract image components including objects, person, background, lighting etc.

Following are the holistic set of features that are extracted from the ad content:

Feature Design


Text Feature Design

There are two types of text features being included in DisCat:

1. Generic text feature

a. These are features returned by Google Cloud’s Language API including sentiment, word / character count, tone (imperative vs indicative), symbols, most frequent words and so on.

2. Industry-specific value propositions

a. These are features that only apply to a specific industry (e.g. finance) that are manually curated by the data science developer in collaboration with specialists and other industry experts.

  • For example, for the finance industry, one value proposition can be “Price Offer”. A list of keywords / phrases that are related to price offers (e.g. “discount”, “low rate”, “X% off”) will be curated based on domain knowledge to identify this value proposition in the ad copies. NLP techniques (e.g. wordnet synset) and manual examination will be used to make sure this list is inclusive and accurate.

Image Feature Design

Like the text features, image features can largely be grouped into two categories:

1. Generic image features

a. These features apply to all images and include the color profile, whether any logos were detected, how many human faces are included, etc.

b. The face-related features also include some advanced aspects: we look for prominent smiling faces looking directly at the camera, we differentiate between individuals vs. small groups vs. crowds, etc.

2. Object-based features

a. These features are based on the list of objects and labels detected in all the images in the dataset, which can often be a massive list including generic objects like “Person” and specific ones like particular dog breeds.

b. The biggest challenge here is dimensionality: we have to cluster together related objects into logical themes like natural vs. urban imagery.

c. We currently have a hybrid approach to this problem: we use unsupervised clustering approaches to create an initial clustering, but we manually revise it as we inspect sample images. The process is:

  • Extract object and label names (e.g. Person, Chair, Beach, Table) from the Vision API output and filter out the most uncommon objects
  • Convert these names to 50-dimensional semantic vectors using a Word2Vec model trained on the Google News corpus
  • Using PCA, extract the top 5 principal components from the semantic vectors. This step takes advantage of the fact that each Word2Vec neuron encodes a set of commonly adjacent words, and different sets represent different axes of similarity and should be weighted differently
  • Use an unsupervised clustering algorithm, namely either k-means or DBSCAN, to find semantically similar clusters of words
  • We are also exploring augmenting this approach with a combined distance metric:

d(w1, w2) = a * (semantic distance) + b * (co-appearance distance)

where the latter is a Jaccard distance metric

Each of these components represents a choice the advertiser made when creating the messaging for an ad. Now that we have a variety of ads broken down into components, we can ask: which components are associated with ads that perform well or not so well?

We use a fixed effects1 model to control for unobserved differences in the context in which different ads were served. This is because the features we are measuring are observed multiple times in different contexts i.e. ad copy, audience groups, time of year & device in which ad is served.

The trained model will seek to estimate the impact of individual keywords, phrases & image components in the discovery ad copies. The model form estimates Interaction Rate (denoted as ‘IR’ in the following formulas) as a function of individual ad copy features + controls:

We use ElasticNet to spread the effect of features in presence of multicollinearity & improve the explanatory power of the model:

“Machine Learning model estimates the impact of individual keywords, phrases, and image components in discovery ad copies.”

– Manisha Arora, Data Scientist

 

Outputs & Insights

Outputs from the machine learning model help us determine the significant features. Coefficient of each feature represents the percentage point effect on CTR.

In other words, if the mean CTR without feature is X% and the feature ‘xx’ has a coeff of Y, then the mean CTR with feature ‘xx’ included will be (X + Y)%. This can help us determine the expected CTR if the most important features are included as part of the ad copies.

Key-takeaways (sample insights):

We analyze keywords & imagery tied to the unique value propositions of the product being advertised. There are 6 key value propositions we study in the model. Following are the sample insights we have received from the analyses:

Shortcomings:

Although insights from DisCat are quite accurate and highly actionable, the moel does have a few limitations:

1. The current model does not consider groups of keywords that might be driving ad performance instead of individual keywords (Example – “Buy Now” phrase instead of “Buy” and “Now” individual keywords).

2. Inference and predictions are based on historical data and aren’t necessarily an indication of future success.

3. Insights are based on industry insights and may need to be tailored for a given advertiser.

DisCat breaks down exactly which features are working well for the ad and which ones have scope for improvement. These insights can help us identify high-impact keywords in the ads which can then be used to improve ad quality, thus improving business outcomes. As next steps, we recommend testing out the new ad copies with experiments to provide a more robust analysis. Google Ads A/B testing feature also allows you to create and run experiments to test these insights in your own campaigns.

Summary

Discovery Ads are a great way for advertisers to extend their social outreach to millions of people across the globe. DisCat helps break down discovery ads by analyzing text and images separately and using advanced ML/AI techniques to identify key aspects of the ad that drives greater performance. These insights help advertisers identify room for growth, identify high-impact keywords, and design better creatives that drive business outcomes.

Acknowledgement

Thank you to Shoresh Shafei and Jade Zhang for their contributions. Special mention to Nikhil Madan for facilitating the publishing of this blog.

Notes

  1. Greene, W.H., 2011. Econometric Analysis, 7th ed., Prentice Hall;

    Cameron, A. Colin; Trivedi, Pravin K. (2005). Microeconometrics: Methods and Applications