Vigilance and Details: Interactive Debugging Outschool Search with

by Brindi (Outschool), Parima (Outschool), Elsa (Promoted), Andrew (Promoted)

At Outschool, we take search quality extremely seriously. We want to make sure our users are consistently served  relevant and accurate search results. When we get a report where search results that are not quite right, we jump into action to find out exactly what happened and fix our algorithm so that similar reports don’t  happen again. Below is an example of how we typically debug search ranking errors. 

The Outschool team continuously monitors search quality reports. Red alert! A user had a search result that wasn’t as expected! The search query was “national geographic level 3 flex.” The item “Book Club: National Geographic Level 3 English Reading Club FLEX Series 1” was expected to be first, but it ranked 11th. Other items that didn’t  share many title keywords ranked higher. The Outschool team reaches out to  our search partners at, who  immediately jump onto debugging the issue with us.

Red alert! Fix search immediately! The Promoted-Outschool team jumps to action.

First, we try repeating the same search on Outschool to reproduce the bug. However, now, it doesn’t appear: the expected listing is 2nd or 3rd in our test queries. That’s not perfect, but it’s good to see that the search ranking issue may be transient or intermittent. But, that is not good enough! We dig deeper.

We’re able to reproduce the bad search result set exactly as the user reported with exhaustive details about why so that we can fix the root cause. Since our search results are personalized, it is not trivial to repro the results like users see them. Promoted backend (“Search Introspection”) logs all searches for all users with the parameters associated. These data logs become the starting point for our investigations. 

  1. We use’s Manager tool to search for the past searches of the target query, “national geographic level 3 flex”

  1. creates a spreadsheet of all items delivered and not delivered, including scores and thousands of top features. This is a regular Google spreadsheet, so we can use comments, search, sorting, and sharing, and csv exports, just like any spreadsheet. [some details may have been modified to protect confidential business information].

The Outschool and team investigate together. First, we look at the values for “Predict Click,” “Predict Purchase,” and “Ecv3” (the expected dollar value if an item is purchased.) Outschool uses a mixture of these three terms, called the “Blender Utility Score,” to decide what to show. Typically, “Predict Click” is most associated with search relevance. 

Problems can arise if the ranking model over-indexes on ECV instead of the relevance of a search result. What if we ignored the purchase probability and ECV  and only ranked on “Predict Click?”

( Video: changing the sort order by clicks.

While this helps the issue, sorting by click probability places the target item 4th, not 1st or 2nd. Let’s look at some important click features to see why that could be.

( Video: inspecting features.

The score is composed of thousands of machine learning features. Through’s logs, we’re able to rank and sort by the features that are most important. On initial inspection, the text-matching features appear correct. There is no strong signal of personalization, which can sometimes distort results. What does catch our attention, though, is that the target item lacks recent engagement history compared to other high-ranking results with high text-matching scores. Therefore, the issue is not a missing or broken feature, but a comparative “cold start” issue where the model is more confident to recommend similar items with more training data than potentially better-matched items with less data.

The Solution

How can we fix the cold start issue? This is unlikely to be substantially fixed by improving the click model with better text-matching features or training on more data. The Outschool search team and’s team hop onto a quick call to iron out the solution:

( Video: inspecting number of search tokens per search over time.

We review the distribution of search terms over time. A relatively low percentage of searches include 4 or more keywords, and the Outschool team knows that on Outschool, these tend to be highly specific searches.

From this experience, the following rule is created: If a listing matches  4 or more search terms  in the same order, even if words in between don’t match, then we should rank those items first. (i.e., longest common subsequence >= 4). This would still rank the target item second in this case. But for this query, that is acceptable and perhaps ideal performance, and it would be unlikely to cause conflicts with the ML optimization system for less-targeted queries.

Here are the new results:

The expected item now ranks 2nd. Items with similar names also rank highly.

Much better!

We confirm that this rule is sometimes triggered, but not too frequently: OK, looks safe.

Great! This change affects too few queries to probably impact A/B metrics, but it impacts the people making these specific searches. We will monitor over time to see if more people on Outschool start making longer search queries to see if their behavior evolves given dramatically better search results.

About the Author

Parima Shah

My name is Parima. I am a professional problem solver and creative coder.