Analyzing ESPN FC Daily Transcripts with Databricks

Rohit Bhagwat
6 min readMay 24, 2023

I spend countless hours every weekend watching football (soccer). As if that isn’t enough, I also spend an embarrassingly long time watching analysis on ESPN FC. As someone who watches ESPN Daily show daily, I wondered how many times they end up discussing the GOAT (Greatest Of All Time) debate, Lionel Messi and other topics.

(Here’s the YouTube channel)

Being a data engineer at heart, I naturally decided to build a data pipeline in Databricks to analyze these YouTube videos and get the answer.

Project Goals

The objectives of this little project were:

  • Retrieve all transcripts using the YouTube API
  • Load the transcripts from ESPN FC channel’s Daily playlist into Databricks Delta tables
  • Utilize HuggingFace transformers to summarize the transcript and extract entities
  • Determine how many times the GOAT debate was discussed
  • Test with the following Large Language Models (LLMs): Dolly & GPT 3.5

Another personal goal was to spend less time watching content and more time playing FIFA 23 on my Xbox!

Process Overview

Let’s go over the steps involved:

1. Prerequisites and Setup

Before starting, ensure you have:

  • Access to a Databricks Workspace

--

--

Rohit Bhagwat

Data & Analytics Professional, Aspiring Data Scientist, Runner, Gadget enthusiast