Парсинг веб-страниц для извлечения заголовков с помощью BeautifulSoup - Fcodenotes

Самый простой способ извлечь заголовок веб-страницы — обратиться к тегу <title>в структуре HTML. Вот пример фрагмента кода, демонстрирующий этот метод:

from bs4 import BeautifulSoup
# HTML content of the web page
html_content = "<html><head><title>Example Page</title></head><body><h1>Hello, World!</h1></body></html>"
# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')
# Extract the title
title = soup.title.string
# Print the extracted title
print(title)

Иногда заголовок веб-страницы содержится в теге <h1>. В этом случае мы можем использовать BeautifulSoup для извлечения текста внутри тега <h1>. Вот пример:

from bs4 import BeautifulSoup
# HTML content of the web page
html_content = "<html><body><h1>Example Page</h1><p>Hello, World!</p></body></html>"
# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')
# Extract the title from the h1 tag
title = soup.h1.string
# Print the extracted title
print(title)

from bs4 import BeautifulSoup
# HTML content of the web page
html_content = "<html><head><meta property='og:title' content='Example Page'></head><body><h1>Hello, World!</h1></body></html>"
# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')
# Extract the title from the meta tag
title = soup.meta['content']
# Print the extracted title
print(title)