Extracting URLs from an Emacs buffer?_问答_开发者

How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?

Input:

<html>
 <a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow</a>
 <h1>Emacs Lisp</h1>
 <a href="http://news.ycombinator.com" _target="_blank">Hacker News</a>
</html>

Output:

http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News

I've seen the re-search-forward function me开发者_StackOverflowntioned several times during my search. Here's what I think that I need to do based on what I've read so far.

(defun extra-urls (file)
 ...
 (setq buffer (...
 (while
        (re-search-forward "http://" nil t)
        (when (match-string 0)
...
))

I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.

(defun extract-urls (fname)
 "Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
  (setq in-buf (set-buffer (find-file fname))); Save for clean up
  (beginning-of-buffer); Need to do this in case the buffer is already open
  (setq u1 '())
  (while
      (re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)

      (when (match-string 0)            ; Got a match
        (setq url (match-string 1) )    ; URL
        (setq title (match-string 2) )  ; Title
        (setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
       )
      )
  (kill-buffer in-buf)          ; Don't leave a mess of buffers
  (progn
    (with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
      (mapcar 'insert u1))
    (switch-to-buffer "new-urls.csv"); Finally, show the new buffer
    )
  )

;; Create a list of files to process
;;
(mapcar 'extract-urls '(
                       "/tmp/foo.html"
                       "/tmp/bar.html"
               ))

If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
  (beginning-of-buffer)
  (replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:\\(.*\\)$" "\\1")
)

It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".

Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).

Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.

You can use the 'xml library, examples of using the parser are found here. To parse your particular file, the following does what you want:

(defun my-grab-html (file)
  (interactive "fHtml file: ")
  (let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
    (mapc (lambda (n)
            (when (consp n) ; don't operate on the whitespace, xml preserves whitespace
              (let ((link (cdr (assq 'href (xml-node-attributes n)))))
                (when link
                  (insert link)
                  (insert "|")
                  (insert (car (xml-node-children n))) ;# grab the text for the link
                  (insert "\n")))))
          (xml-node-children res))))

This does not recursively parse the HTML to find all the links, but it should get you started in the direction of the general solution.