关于html：如何使用Python检索网页的页面标题？

How can I retrieve the page title of a webpage using Python?

如何使用Python检索网页的页面标题(标题html标签)？

这是@Vinko Vrsalovic的答案的简化版本：

1
2
3
4
5

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

注意：

soup.title在html文档中的任何位置找到第一个title元素
title.string假定它只有一个子节点，并且该子节点是一个字符串

对于beautifulsoup 4.x，请使用其他导入：

1	from bs4 import BeautifulSoup

我将始终将lxml用于此类任务。您也可以使用beautifulsoup。

1
2
3

import lxml.html
t = lxml.html.parse(url)
print t.find(".//title").text

根据评论进行编辑：

1
2
3
4
5
6
7

from urllib2 import urlopen
from lxml.html import parse

url ="https://www.google.com"
page = urlopen(url)
p = parse(page)
print p.find(".//title").text

无需导入其他库。请求具有内置的此功能。

1
2
3
4
5

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('') + 7 : al.find('')]
u'Friends (TV Series 1994\\u20132004) - IMDb'

机械化浏览器对象具有title()方法。因此，这篇文章中的代码可以重写为：

1
2
3
4

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

对于这样一个简单的任务，这可能是过高的，但是如果您打算做更多的事情，那么从这些工具(机械化，BeautifulSoup)开始比较明智，因为它们比其他工具(使用urllib获得更容易使用)内容和regexen或其他解析器以解析html)

链接：
美丽汤
机械化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data()

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

使用HTMLParser：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

from urllib.request import urlopen
from html.parser import HTMLParser

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.match = False
self.title = ''

def handle_starttag(self, tag, attributes):
self.match = tag == 'title'

def handle_data(self, data):
if self.match:
self.title = data
self.match = False

url ="http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title) # prints: Example Domain

使用汤.select_one定位标题标签

1
2
3
4
5
6

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

使用正则表达式

1
2
3

import re
match = re.search('(.*?)', raw_html)
title = match.group(1) if match else 'No title'

这是一个容错的HTMLParser实现。
您可以在get_title()处投入几乎任何东西而不会破坏它，如果发生意外情况
get_title()将返回None。
Parser()下载页面时，会将其编码为ASCII
无论页面中使用的字符集如何，都忽略任何错误。
更改to_ascii()将数据转换为UTF-8或任何其他编码将是微不足道的。只需添加一个编码参数，然后将函数重命名为to_encoding()之类的东西即可。
默认情况下，HTMLParser()会在损坏的html上中断，甚至会在琐碎的事情(如不匹配的标签)上中断。为了防止出现这种情况，我将HTMLParser()的错误方法替换为将忽略错误的函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

'''
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
pass

def is_string(data):
return isinstance(data, str)

def is_bytes(data):
return isinstance(data, bytes)

def to_ascii(data):
if is_string(data):
data = data.encode('ascii', errors='ignore')
elif is_bytes(data):
data = data.decode('ascii', errors='ignore')
else:
data = str(data).encode('ascii', errors='ignore')
return data

class Parser(HTMLParser):
def __init__(self, url):
self.title = None
self.rec = False
HTMLParser.__init__(self)
try:
self.feed(to_ascii(urlopen(url).read()))
except urllib.error.HTTPError:
return
except urllib.error.URLError:
return
except ValueError:
return

self.rec = False
self.error = error_callback

def handle_starttag(self, tag, attrs):
if tag == 'title':
self.rec = True

def handle_data(self, data):
if self.rec:
self.title = data

def handle_endtag(self, tag):
if tag == 'title':
self.rec = False

def get_title(url):
return Parser(url).title

print(get_title('http://www.google.com'))

soup.title.string实际上返回一个unicode字符串。
要将其转换为普通字符串，您需要
string=string.encode('ascii','ignore')

在Python3中，我们可以从urllib.request库中调用方法urlopen从bs4中调用方法BeautifulSoup来获取页面标题。

1
2
3
4
5

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

在这里，我们使用最高效的解析器'lxml'。

使用lxml ...

从根据Facebook opengraph协议标记的页面元中获取它：

1
2
3
4