Scraping one of the main Spanish TV broadcasters with cookies, Python and yt-dlp
As a little project I wondered: Would it be possible to scrape the contents of a Spanish channel website so I could get their content offline and without any loss? That way you would be able to watch it everywhere, with no ads and no resolution limits. (This broadcaster is also known for their bad practices in ad breaks, on-screen elements and an unbrowsable website if not using adblock).
I was thinking about publishing the code in the first place, but I think it will be more safe and productive to write about the process of doing it so you could extend it to any websites you like:
If you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetime.
This is generic information, the URLs provided in this article are not real. This is only for educational purposes about working with python, requests and cookies.
Analyzing the problem
To be able to see any content on that channel platform, you will need to create a user account. That means, we will need to create an account to be able to make and send our requests to their internal API
For this I used any modern browser and I toggled the Developer tools in the Networking tab. That way you will be able to see the login sequence.
In this case, the login is a POST request to an API endpoint sending your username and password as an input. Then it generates a session token inside a cookie and allows you to interact with the internal API.
You could replicate that in python like this:
session = requests.Session() # create a new session object to store session data for future requests
credentials = { "username:" "myuser", "password:" "mypassword"}
session.post("http://api.web.com/login", credentials)
You can use the python-dotenv
package to store credentials inside an external file
Choose an example content
The best way to scrape something is to use actual data, but be careful: some websites have request limits and may block you with a HTTP 429 Too Many Requests
error. One way of avoiding this is dumping your jsons and content locally and scraping over that local copy.
After logging in, we will send a GET request with the URL of the content we want to scrape (e.g: http://web.com/episode/example
) Use a browser in developer mode first to see what you need to look for. For me, the useful data was stored inside two <script>
tags.
For this, you may find the re
and BeautifulSoup4
packages useful for filtering the info you need. In this case, a simple call to re was all I needed:
html = session.get("http://web.com/episode/example").text
matches = re.findall("<script>(.*?)</script>", html)
Inside those tags theres another url that will lead you to the specific content API endpoint, the response is a JSON where you could scrape the metadata and the streaming url
You can use the json get('')
functions or the jmespath
package to fetch the data you want from a json into variables
Interfacing your code with yt-dlp
yt-dlp is a fork of youtube-dl, a python utility to download video from lots of sources you can install it with:
pip3 install yt-dlp
Suppose we already extracted a filename string and the m3u8 playlist with the content we want to download. We can pass them to yt-dlp like this:
import yt_dlp
# * the options we will pass to yt-dlp *
# in this case only the outtmpl property to specify the
# output filename we extracted in a previous request
ydl_opts = {'outtmpl': title+'.mp4'}
for item in items:
if "mpegurl" in item.get("type"):
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
error_code = ydl.download(item.get("url"))
exit()
That will have your file downloaded, yt-dlp will take care of downloading the separate audio and video streams and merging them into a file.
I hope you find this information useful and try to get your hands dirty with something that comes to your mind. Scraping can be very powerful for automation, IoT, Telegram notifications…