关于regex：用于HTML解析的Python正则表达式(BeautifulSoup)

Python regular expression for HTML parsing (BeautifulSoup)

我想获取HTML中隐藏的输入字段的值。

1	<input type="hidden" name="fooId" value="12-3456789-1111111111" />

我想用Python写一个正则表达式，该表达式将返回fooId的值，因为我知道HTML中的行遵循格式

1	<input type="hidden" name="fooId" value="[id is here]" />

有人可以在Python中提供一个示例来解析HTML的值吗？

对于这种特殊情况，BeautifulSoup比正则表达式更难编写，但是它更健壮...我只是在BeautifulSoup示例中做出了贡献，因为您已经知道要使用哪个正则表达式：-)

1
2
3
4
5
6
7
8
9
10

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag
#or index it directly via fooId['value']

我同意Vinko BeautifulSoup是必经之路。但是我建议使用fooId['value']来获取属性，而不是依赖于值作为第三个属性。

1
2
3
4
5
6
7

from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute

1
2
3
4

import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value

解析是您真正不想自己避免的领域之一，因为您将追逐边缘情况，并且错误会持续很多年

我建议您使用BeautifulSoup。它具有很高的声誉，并且从文档中看起来非常易于使用。

Pyparsing是BeautifulSoup和regex之间的一个很好的过渡步骤。它比正则表达式更强大，因为它的HTML标签解析可以理解大小写，空格，属性存在/不存在/顺序的变化，但是比起使用BS，进行这种基本标签提取更容易。

您的示例特别简单，因为您要查找的所有内容都位于打开的" input"标记的属性中。这是一个pyparsing示例，显示了输入标签上的几种变体，这些变体将使正则表达式适合，并且还显示了如何在标记中添加不匹配的标记：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

html ="""<html><body>
<input type="hidden" name="fooId" value="**[id is here]**" />
<blah>
<input name="fooId" type="hidden" value="**[id is here too]**" />
<input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
<INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />

<foo>
</body></html>"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
print inpTag.value

打印：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- type: hidden
- value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

您可以看到pyparsing不仅匹配了这些不可预测的变化，而且还返回了对象中的数据，从而可以轻松地读取各个标签属性及其值。

1
2
3
4
5
6

/<input\\s+type="hidden"\\s+name="([A-Za-z0-9_]+)"\\s+value="([A-Za-z0-9_\\-]*)"\\s*/>/

>>> import re
>>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
>>> re.match('<input\\s+type="hidden"\\s+name="([A-Za-z0-9_]+)"\\s+value="([A-Za-z0-9_\\-]*)"\\s*/>', s).groups()
('fooId', '12-3456789-1111111111')