开发者

Extract a Table from an HTML File with PowerShell or VBS

开发者 https://www.devze.com 2023-01-14 17:27 出处:网络
I have a two part problem that needs fixing. I\'ll try my best to describe it then break down what I \"think\" the steps are.

I have a two part problem that needs fixing. I'll try my best to describe it then break down what I "think" the steps are.

I am trying to get a specific table in a webpage and email it to myself.

At the moment what I am trying is to use GNU\Win32 wget.exe (I'd rather use PowerShell natively but for some reason I couldn't, perhaps because the method I was using couldn't render the ASPX page?) Using wget I was able to make a local html version of the ASPX page.

Now I have been attempting to parse the file and extract a specific table. In this particular case the table begins with <table border="0" cellpadding="2" cellspacing="2" width="300px"> and ends with </table> and there are no nested tables.

I've thrown some regex at my problem (yes I know regex may not be the tool I need here) but to no avail.

---Ammended Here is where I am at now...

开发者_如何学编程
$content = (new-object System.Net.WebClient).DownloadString($url)
$found = $content -cmatch '(?si)<table border="0" cellpadding="2" cellspacing="2" width="300px"[^>]*>(.*?)Total Queries</td>(.*?)</tr>(.*?)</table>'
$result = $matches[3]
$result


I've done this sort of thing with PowerShell. It is pretty straightforward:

PS> $url = "http://www.windowsitpro.com/news/PaulThurrottsWinInfoNews.aspx"
PS> $content = (new-object System.Net.WebClient).DownloadString($url)
PS> $content -match '(?s)<table[^>]+border\s*=\s*"0"\s*.*?>(.*?)</table>'
True
PS> $matches[1]

        <tr>
          <snip>
        </tr>

Just substitute width for border and 300px for 0 for your regex e.g.:

PS> $content -match '(?s)<table[^>]+width\s*=\s*"300px"\s*.*?>(.*?)</table>'

Ih the case of matching multiple tables, you have to switch from -match, which is a boolean operator just looking to find a single match to Select-String which can find all matches e.g.:

PS> $pattern = '(?s)<table[^>]+width\s*=\s*"300px"\s*.*?>(.*?)</table>'    
PS> $content  | Select-String -AllMatches $pattern | 
                Foreach {$_.Matches | $_.Group[1].Value}

Essentially all matches will be in the $_.Matches collection. If you know that the table is always the third one you can access like so:

... | Foreach {$_.Matches[2].Group[1].Value}


A while ago I wrote a function called Get-MarkupTag. This gets you away from having to use regular expressions directly (it does so under the covers). It also attempts to turn HTML into XML, at which point getting out the data is pretty simple.

To do this with Get-MarkupTag, you'd do something like

$webClient = New-Object Net.Webclient -Property @{UseDefaultCredentials=$true}
$html = $webClient.DownloadString($url)
$table = Get-MarkupTag -html $html -tag "table" |
    Where-Object { $_.Tag -like '<table border="0" cellpadding="2" cellspacing="2" width="300px">*' } |
    Select-Object -expandProperty Xml
$table.tr |  # Row
    Foreach-Object {
        $_.Td # Column
    }

Hope this helps


I'd tackle it this way using VBScript.

  • remove all double-quotes with single quotes, just for ease of reading & writing the code. i.e. myHTMLString = Replace(myHTMLString, """", "'")

  • determine if the file contains your table. Sounds like it doesn't have an id or name attribute. Too bad, but failing that, use InStr to determine where the starting position of the table is. Dim tableStartsAt = InStr(myHTMLString,"<table border='0'") Careful with all the attributes here, as you're at the mercy of the table having its attributes moved around without you noticing! Perhaps when no matching table is found, email THAT stats to yourself as a warning that some maintenance is needed.

  • now that you have the start position of your table, find its end tag. i.e. Dim tableEndsAt = InStr(tableStartsAt,myHTMLString,"</table>")

  • get the HTML string: Dim myTable = Mid(myHTMLString,tableStartsAt,tableEndsAt-tableStartsAt)

  • put that into an email, send using VBScript. Ensure you have Mail.IsHTML = True. Here's another VBScript sending email question.


I thought the HuddleMasses Get-Web cmdlets had an option to read in tables as XML.

0

精彩评论

暂无评论...
验证码 换一张
取 消