Analyzing ESPN FC Daily Transcripts with Databricks
I spend countless hours every weekend watching football (soccer). As if that isn’t enough, I also spend an embarrassingly long time watching analysis on ESPN FC. As someone who watches ESPN Daily show daily, I wondered how many times they end up discussing the GOAT (Greatest Of All Time) debate, Lionel Messi and other topics.
Being a data engineer at heart, I naturally decided to build a data pipeline in Databricks to analyze these YouTube videos and get the answer.
Project Goals
The objectives of this little project were:
- Retrieve all transcripts using the YouTube API
- Load the transcripts from ESPN FC channel’s Daily playlist into Databricks Delta tables
- Utilize HuggingFace transformers to summarize the transcript and extract entities
- Determine how many times the GOAT debate was discussed
- Test with the following Large Language Models (LLMs): Dolly & GPT 3.5
Another personal goal was to spend less time watching content and more time playing FIFA 23 on my Xbox!
Process Overview
Let’s go over the steps involved:
1. Prerequisites and Setup
Before starting, ensure you have:
- Access to a Databricks Workspace