关于.net:在C#中解析html的最佳方法是什么?

关于.net:在C#中解析html的最佳方法是什么?

What is the best way to parse html in C#?

我正在寻找一个库/方法来解析一个html文件,该文件具有比通用xml解析库更多的html特定功能。


Html敏捷包

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse"out of the web" HTML files. The parser is very tolerant with"real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).


您可以使用TidyNet.Tidy将HTML转换为XHTML,然后使用XML解析器。

另一种选择是使用内置引擎mshtml:

1
2
3
4
5
6
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);

这允许您使用类似javascript的函数,如getElementById()


我发现了一个名为Fizzler的项目,它采用jQuery / Sizzler方法来选择HTML元素。它基于HTML Agility Pack。它目前处于测试阶段,只支持CSS选择器的一个子集,但是使用CSS选择器而不是令人讨厌的XPath非常酷且令人耳目一新。

http://code.google.com/p/fizzler/


你可以做很多事情,而不必在第三方产品和mshtml(即互操作)上疯狂。使用System.Windows.Forms.WebBrowser。从那里,您可以在HtmlDocument上执行"GetElementById"或在HtmlElements上执行"GetElementsByTagName"。如果你想与浏览器实际交互(例如模拟按钮点击),你可以使用一点反射(imo比Interop更小的邪恶)来做到这一点:

1
var wb = new WebBrowser()

...告诉浏览器导航(与此问题相关)。然后在Document_Completed事件上,您可以模拟这样的点击。

1
2
3
4
5
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);

你可以做类似的反思,提交表格等。

请享用。


我编写了一些提供"LINQ to HTML"功能的代码。我想我会在这里分享一下。它基于Majestic 12.它采用Majestic-12结果并生成LINQ XML元素。此时,您可以使用针对HTML的所有LINQ to XML工具。举个例子:

1
2
3
4
5
6
7
8
9
        IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

        foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {

            if (anchorTag.Attribute("href") == null)
                continue;

            Console.WriteLine(anchorTag.Attribute("href").Value);
        }

我想使用Majestic-12,因为我知道它有很多关于HTML的内置知识。我发现,将Majestic-12结果映射到LINQ将接受的内容,因为XML需要额外的工作。我所包含的代码进行了大量的清理,但是当你使用它时,你会发现被拒绝的页面。您需要修复代码才能解决这个问题。抛出异常时,请检查exception.Data ["source"],因为它可能设置为导致异常的HTML标记。以一种很好的方式处理HTML有时候并不是微不足道的......

所以现在预期实际上很低,这里是代码:)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Majestic12ToXml {
public class Majestic12ToXml {

    static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {

        HTMLparser parser = OpenParser();
        parser.Init(htmlAsBytes);

        XElement currentNode = new XElement("document");

        HTMLchunk m12chunk = null;

        int xmlnsAttributeIndex = 0;
        string originalHtml ="";

        while ((m12chunk = parser.ParseNext()) != null) {

            try {

                Debug.Assert(!m12chunk.bHashMode);  // popular default for Majestic-12 setting

                XNode newNode = null;
                XElement newNodesParent = null;

                switch (m12chunk.oType) {
                    case HTMLchunkType.OpenTag:

                        // Tags are added as a child to the current tag,
                        // except when the new tag implies the closure of
                        // some number of ancestor tags.

                        newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                        if (newNode != null) {
                            currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                            newNodesParent = currentNode;

                            newNodesParent.Add(newNode);

                            currentNode = newNode as XElement;
                        }

                        break;

                    case HTMLchunkType.CloseTag:

                        if (m12chunk.bEndClosure) {

                            newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                            if (newNode != null) {
                                currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                newNodesParent = currentNode;
                                newNodesParent.Add(newNode);
                            }
                        }
                        else {
                            XElement nodeToClose = currentNode;

                            string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                            while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                nodeToClose = nodeToClose.Parent;

                            if (nodeToClose != null)
                                currentNode = nodeToClose.Parent;

                            Debug.Assert(currentNode != null);
                        }

                        break;

                    case HTMLchunkType.Script:

                        newNode = new XElement("script","REMOVED");
                        newNodesParent = currentNode;
                        newNodesParent.Add(newNode);
                        break;

                    case HTMLchunkType.Comment:

                        newNodesParent = currentNode;

                        if (m12chunk.sTag =="!--")
                            newNode = new XComment(m12chunk.oHTML);
                        else if (m12chunk.sTag =="![CDATA[")
                            newNode = new XCData(m12chunk.oHTML);
                        else
                            throw new Exception("Unrecognized comment sTag");

                        newNodesParent.Add(newNode);

                        break;

                    case HTMLchunkType.Text:

                        currentNode.Add(m12chunk.oHTML);
                        break;

                    default:
                        break;
                }
            }
            catch (Exception e) {
                var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason:" + e.Message, e);

                // the original html is copied for tracing/debugging purposes
                originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                    .Take(m12chunk.iChunkLength)
                    .Select(B => (char)B).ToArray());

                wrappedE.Data.Add("source", originalHtml);

                throw wrappedE;
            }
        }

        while (currentNode.Parent != null)
            currentNode = currentNode.Parent;

        return currentNode.Nodes();
    }

    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {

        string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

        XElement discoveredParent = null;

        // Get a list of all ancestors
        List<XElement> ancestors = new List<XElement>();
        XElement ancestor = nextPotentialParent;
        while (ancestor != null) {
            ancestors.Add(ancestor);
            ancestor = ancestor.Parent;
        }

        // Check if the new tag implies a previous tag was closed.
        if ("form" == m12chunkCleanedTag) {

            discoveredParent = ancestors
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }
        else if ("td" == m12chunkCleanedTag) {

            discoveredParent = ancestors
                .TakeWhile(XE =>"tr" != XE.Name)
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }
        else if ("tr" == m12chunkCleanedTag) {

            discoveredParent = ancestors
                .TakeWhile(XE => !("table" == XE.Name
                                    ||"thead" == XE.Name
                                    ||"tbody" == XE.Name
                                    ||"tfoot" == XE.Name))
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }
        else if ("thead" == m12chunkCleanedTag
                  ||"tbody" == m12chunkCleanedTag
                  ||"tfoot" == m12chunkCleanedTag) {


            discoveredParent = ancestors
                .TakeWhile(XE =>"table" != XE.Name)
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        }

        return discoveredParent ?? nextPotentialParent;
    }

    static string CleanupTagName(string originalName, string originalHtml) {

        string tagName = originalName;

        tagName = tagName.TrimStart(new char[] { '?' });  // for nodes <?xml >

        if (tagName.Contains(':'))
            tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

        return tagName;
    }

    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {

        result = null;
        string attributeName = originalName;

        if (string.IsNullOrEmpty(originalName))
            return false;

        if (_startsAsNumeric.IsMatch(originalName))
            return false;

        //
        // transform xmlns attributes so they don't actually create any XML namespaces
        //
        if (attributeName.ToLower().Equals("xmlns")) {

            attributeName ="xmlns_" + xmlnsIndex.ToString(); ;
            xmlnsIndex++;
        }
        else {
            if (attributeName.ToLower().StartsWith("xmlns:")) {
                attributeName ="xmlns_" + attributeName.Substring("xmlns:".Length);
            }  

            //
            // trim trailing "
            //
            attributeName = attributeName.TrimEnd(new char[] { '"' });

            attributeName = attributeName.Replace(":","_");
        }

        result = attributeName;

        return true;
    }

    static Regex _weirdTag = new Regex(@"^<!\[.*\]>$");       // matches"<![if !supportEmptyParas]>"
    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches"<%@ ... %>"
    static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches"<!-Extra_Images->"

    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {

        if (string.IsNullOrEmpty(m12chunk.sTag)) {

            if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                return new XElement("doctype");

            if (_weirdTag.IsMatch(originalHtml))
                return new XElement("REMOVED_weirdBlockParenthesisTag");

            if (_aspnetPrecompiled.IsMatch(originalHtml))
                return new XElement("REMOVED_ASPNET_PrecompiledDirective");

            if (_shortHtmlComment.IsMatch(originalHtml))
                return new XElement("REMOVED_ShortHtmlComment");

            // Nodes like"<br" will end up with a m12chunk.sTag==""...  We discard these nodes.
            return null;
        }

        string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

        XElement result = new XElement(tagName);

        List<XAttribute> attributes = new List<XAttribute>();

        for (int i = 0; i < m12chunk.iParams; i++) {

            if (m12chunk.sParams[i] =="<!--") {

                // an HTML comment was embedded within a tag.  This comment and its contents
                // will be interpreted as attributes by Majestic-12... skip this attributes
                for (; i < m12chunk.iParams; i++) {

                    if (m12chunk.sTag =="--" || m12chunk.sTag =="-->")
                        break;
                }

                continue;
            }

            if (m12chunk.sParams[i] =="?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                continue;

            string attributeName = m12chunk.sParams[i];

            if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                continue;

            attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
        }

        // If attributes are duplicated with different values, we complain.
        // If attributes are duplicated with the same value, we remove all but 1.
        var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

        foreach (var duplicatedAttribute in duplicatedAttributes) {

            if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                throw new Exception("Attribute value was given different values");

            attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
            attributes.Add(duplicatedAttribute.First());
        }

        result.Add(attributes);

        return result;
    }

    static HTMLparser OpenParser() {
        HTMLparser oP = new HTMLparser();

        // The code+comments in this function are from the Majestic-12 sample documentation.

        // ...

        // This is optional, but if you want high performance then you may
        // want to set chunk hash mode to FALSE. This would result in tag params
        // being added to string arrays in HTMLchunk object called sParams and sValues, with number
        // of actual params being in iParams. See code below for details.
        //
        // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
        oP.SetChunkHashMode(false);

        // if you set this to true then original parsed HTML for given chunk will be kept -
        // this will reduce performance somewhat, but may be desireable in some cases where
        // reconstruction of HTML may be necessary
        oP.bKeepRawHTML = false;

        // if set to true (it is false by default), then entities will be decoded: this is essential
        // if you want to get strings that contain final representation of the data in HTML, however
        // you should be aware that if you want to use such strings into output HTML string then you will
        // need to do Entity encoding or same string may fail later
        oP.bDecodeEntities = true;

        // we have option to keep most entities as is - only replace stuff like  
        // this is called Mini Entities mode - it is handy when HTML will need
        // to be re-created after it was parsed, though in this case really
        // entities should not be parsed at all
        oP.bDecodeMiniEntities = true;

        if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
            oP.InitMiniEntities();

        // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
        // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
        // this only works if auto extraction is enabled
        oP.bAutoExtractBetweenTagsOnly = true;

        // if true then comments will be extracted automatically
        oP.bAutoKeepComments = true;

        // if true then scripts will be extracted automatically:
        oP.bAutoKeepScripts = true;

        // if this option is true then whitespace before start of tag will be compressed to single
        // space character in string:"", if false then full whitespace before tag will be returned (slower)
        // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
        // a waste of CPU cycles
        oP.bCompressWhiteSpaceBeforeTag = true;

        // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
        // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
        // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
        // or open
        oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

        return oP;
    }
}
}

之前已经提到过Html Agility Pack - 如果你想提速,你可能还想看看Majestic-12 HTML解析器。它的处理相当笨重,但它提供了非常快速的解析体验。


我认为@ Erlend使用HTMLDocument是最好的方法。但是,我也很幸运使用这个简单的库:

SgmlReader


没有第三方lib,可以在Console和Asp.net上运行的WebBrowser类解决方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;

class ParseHTML
{
    public ParseHTML() { }
    private string ReturnString;

    public string doParsing(string html)
    {
        Thread t = new Thread(TParseMain);
        t.ApartmentState = ApartmentState.STA;
        t.Start((object)html);
        t.Join();
        return ReturnString;
    }

    private void TParseMain(object html)
    {
        WebBrowser wbc = new WebBrowser();
        wbc.DocumentText ="feces of a dummy";        //;magic words        
        HtmlDocument doc = wbc.Document.OpenNew(true);
        doc.Write((string)html);
        this.ReturnString = doc.Body.InnerHtml +" do here something";
        return;
    }
}

用法:

1
2
3
4
string myhtml ="<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);

我过去使用过ZetaHtmlTidy来加载随机网站,然后用xpath点击内容的各个部分(例如/ html / body // p [@ class ='textblock'])。它运作良好,但有一些特殊的网站,它有问题,所以我不知道这是否是绝对最好的解决方案。


解析HTML的麻烦在于它不是一门精确的科学。如果你正在解析它是XHTML,那么事情会容易得多(正如你所提到的,你可以使用一般的XML解析器)。因为HTML不一定是格式良好的XML,所以在尝试解析它时会遇到很多问题。它几乎需要在逐个站点的基础上完成。


我写了一些用于在C#中解析HTML标记的类。 如果满足您的特殊需求,它们既简单又简单。

您可以阅读有关它们的文章并在http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c下载源代码。

在http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class上还有一篇关于通用解析助手类的文章。


试试这个脚本。

http://www.biterscripting.com/SS_URLs.html

当我用这个网址时,

1
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")

它显示了该主题页面上的所有链接。

1
2
3
4
5
6
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.webp
.
.
.

您可以修改该脚本以检查图像,变量等。


根据您的需要,您可以选择功能更丰富的库。我尝试了大多数/所有解决方案,但是Html Agility Pack的主要内容和优势是什么。它是一个非常宽容和灵活的解析器。


如果您需要在页面上看到JS的影响[并且您准备启动浏览器],请使用WatiN


您可以使用HTML DTD和通用XML解析库。


推荐阅读