I'm trying to extract information from an HTML file(Google Chrome Bookmarks exported)
It contains text of the following format
and I would like to extract just the website adresses after <DT><A HREF=
and before ADD_DATE=
I'm considering using SED and AWK or Python So any answers from the three languages are welcome
So far I just know how to print lines containing <DT><A HREF=
with awk
awk '/<DT><A HREF="*"/' favorit.html
i suppose i should combine this with sed
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3 ADD_DATE="0" LAST_MODIFIED="1309451494" PERSONAL_TOOLBAR_FOLDER="true">Barre de favoris</H3>
<DL><p>
<DT><H3 ADD_DATE="1281455379" LAST_MODIFIED="1309422816">brain</H3>
<DL><p>
<DT><A HREF="http://gmazars.info/conf/index.html" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABVklEQVQ4jcWSoVcjQQyHP3i4VM/ojiZ+sRTHva7u2nqulit2/4Haes5XM7WsH/SuntUz+k5wb6AtjxMIoiYvk1+SLzn7cXv7hy/Y+VeSPxRwzqGqiEjxjbUlLiKoavEv3gfWDw9YY0g5Y63l1/0917MZqsrPuzsAFk1z4BeBeV0DsFwuy8epc+y9p65rRIScM6rK3vvTDlSVEEIJ/H58LO8xRqqrK0IIOOdo2/ZU4NiMteSUyDnz3HVUVcVEhL7vGWM8hei952Y2w1j7ymO9ZtE0AOy9R1W5PGr/oAP/9IQxhs1mg4jQdV0Zo+97ckpUVcV2uz0QOPv2QyojrFYrAHa7HSJSiMd/wGQyKUnWmLKxgy3EcSTlXODN65oYIxMRps4x9D1d17FoGtq2ZYzxTSC8vLyCtPatqgjDMJBTYhgGUs6EELh8dy+fQjw+ro/sU4j/Swb4C6whlU0nCEWKAAAAAEl开发者_开发知识库FTkSuQmCC">Computer Vision Resource</A>
<DT><A HREF="http://research.google.com/" ADD_DATE="1281455379">Google Research</A>
<DT><A HREF="http://research.microsoft.com/en-us/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABl0lEQVQ4jaWSTWuTQRRGz52575uYhaYF0VZbk0ChyaJWCvpv3Opv0607QfzYuBKtgktTsjAUTFuSkmTembku0uzaEtoDsxlmnjk8d2QWs3ELdB4zN00QQEPO3NTBCWiVbhkQopGXO3aRJAIYZgvNqwMEnVUzYjKcczjnEBGqEBYHvOPSgkTADO8EPRmN+PLxA831dbzzpJSYjMc86XTY6faIMSIimNnCRoSFHzgztKiVPG61ePf2Dd29p+wdHPCoKCkKZTT6hxMhhEC9XieEQIwVXz9/4tnzF2xsbKLJMjFGBkd9ZvMZv75/Q7XgbrPJ8fAvm1vbjM9OqULAeWW73WLQ7/P75yEvX73GhSqQyfT293HeE1Pk/sMHDI7+YJapqsDxcEi2zE6vy3Q6ZavdprO7y+R8grz/cWjzqsKJEGMkzOcAlLUaKUUAUkyUtZKiLDkdnXCn0UDVU5Ylmi0vRgZ4VRpFwbJ6LXRZO4aRUuLe2hpm+WLqhpoZy7UKSysAcw630q1rcKu+fBlmhpaqiFz3Ya+m8J7/gTTD19UuBM8AAAAASUVORK5CYII=">Microsoft Research - Turning Ideas into Reality</A>
<DT><A HREF="http://techresearch.intel.com/articles/index.html" ADD_DATE="1281455379">Intel Labs</A>
<DT><A HREF="http://www.ibm.com/developerworks/" ADD_DATE="1309092502" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAWUlEQVQ4jWNkSJ3xn4ECwESJ5uFkwP9Z6XCMjQ8TQwYwPhMDAwMDY9pMBmQaxmZMm4nVEGQxFlxOgylCNhSnF7ABbC6A8ZENhYcButNgYUDIBYyjKXEQGAAAPiUyGXrLJGMAAAAASUVORK5CYII=">IBM developerWorks : IBM's resource for developers and IT professionals</A>
<DT><A HREF="http://www.siam.org/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAWklEQVQ4jd2RwQrAIAxDU7+8fvnbYR6s2sMUxlguoYGGVyrJkWDfj5YdfY8AeJmATj3BKo9ZKwmDnGzO/BHBXADFTDKruhV9zkfV1XW5RgIAa7/cVjlZ/knBBdfjeU7uim0TAAAAAElFTkSuQmCC">SIAM: Society for Industrial and Applied Mathematics</A>
<DT><A HREF="http://www.wolfram.com/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC7UlEQVQ4jYWTX0hTcRzFz3Y37931zv2x5dzUiWZz2TQFkUVYVEpiVFCziB4CoyAYQUEGQmi9S0QREb3ZQ2aEoVQP1bJSg7SmpqKmU6bOOV3debfddb23p8XUovP2+f3O98eB7/kR+I88N1sOGfc6ptxut/S3e3kyPL7SUOnZV7szwQONjZbC9s6uw/OgEmdtTpehraVFlWBF8gM+glDaFkLvArsPNLMG6rP2ScctbnxEjtiqq83hfMCHZa6MZcG4yLKXEjOyjZHuq8uk89wPSA5ZNKgSVYsCoOyJYUxfOMP4CYulsvBoQfed539NAABzKUK4h9eqrVazSqjIQOoKh4mVUUwPRiwnwUCg+YVkvwwAWk9cvGp6OnidyhFSQ3QckVEa+jwGYq4Sy1wEA0OrKNBTOCdGgFg8+K3Y4qBEmleAZ+QAMFdjuWcu1wUrFCyKi7Kxq0SJ0h06pOtSQGrWYDDGkG9TAFV6CCuLW4rcfROGn3Oz0QLTMQUANNQ3hKvPnq6SB7XjSsUatDY1mD0lILcYIPnGIIo9EM06hG0p4DI5iGtUgDdqqm0P73r+rFEd4JvHPUuY6f+OeKYeqwYaZLoGuepM5DBpYPu8WJrmQaQo4Cste533qtMDAMRtl4u8orS+CLwfONoXWgIoEiRDIaoiEedXsdU7j1/Dk/AOs/CNxqEJk0gLsvaay/UfW7s/TBG128v3+yd9W32IunO1wvfsckuJCBmEeT/IsB/E9DQENg5Br8TEzzC+cDKEBAqpoR/HD5+u82/qwYi9aFm1w6yfGZsFHViA0Z6H7OJ8oN+L/qFfeLst69rX8eC9PC1jNWenra/BXaeT6YBVemOxS48MJvYZ0iXOapciNrMUAi0tKR3S26YmKnlm3V+Q4prKwMHSC1MmUzpLph2hqUIh7KPRKdeSOkRkfiK2rGNMzo2p/6mX5acGXtaeqU1wW5ZD1XWjxZ7s2VTlZHGVxe38vPdTgut8vVFc7x1K9vwGs2EnwS3T7R0AAAAASUVORK5CYII=">Wolfram Research: Mathematica, Technical and Scientific Software</A>
<DT><A HREF="http://www.mathworks.com/" ADD_DATE="1281455379">The MathWorks - MATLAB and Simulink for Technical Computing</A>
<DT><A HREF="http://www.youtube.com/user/GoogleTechTalks" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABS0lEQVQ4jZ2SP0hCURTGf+clQdAQ2tKsUHNbQ9SgCI5BU/sbpSnaWrPRxamlyUJsaKlcWqIaoigKk+ckNKVOBYF6Gu57r/fyT9a3nO+ce893zr3noKr6VM7qTmZaVVWTdkkPLhs6LvAJaPvQ2L/Av/18nFXAr36eTyugdu7GLxC0fQLvr0cKaLXT0w+XewnVTm+ogMWY2N+M8dLVvnjEIyJiiCpTc+uc5dOICHbumvkJobuywULEImmXQgLitvVvRKjVILPk6QFj6ilweoVoPKo4ze+DRCzYIDhvP2JhmE9MzA5I9sqMhhW6uLgMTtO4TtP4uLxQNLxQhNs6lC+AwBRC8CbiIREzIlu7kEq5wcYIgd9a958qAwQqFUKfendv/JMT2NsGS4z/8Ahrq2YK7c+W3/mwrQjsGWKKMzMZNVzj0UDasF3oj0u9JV8pACCxycNu1gAAAABJRU5ErkJggg==">YouTube - Chaîne de GoogleTechTalks</A>
<DT><A HREF="http://groups.csail.mit.edu/vision/welcome/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAB0klEQVQ4jY2T30tTYRjHP+mcp8mWzBFKJVk4CyVZI+wH1Z1FQtFdV3Uxr4bX/gkS4eXoquxGvOgmuhg0jWRRwhgrBks7Z81AZzuy49Kj+6Fm62LsXafZYd+r9+F9v5/neb+875FyuVzGRJlPIZR3MwDslQoADNwZpeviLQCazMwAywuv0bNpAKySjZ8/kuiZpNi3mJl3dY3cqswV3wQu9xAA4clHFAuFwwFyMIC6FBV1tbO982zNILUZmgjArq4RDz6jxzuM88wFceDYiV5aHS5R2xzt/D74VQ/IfYtx1N7BJd8Ts1vVSYSoJsI4T/U1ZDrIb9YD1lMJOvuvNgTYK+6ItQVAUyLsl/Kc9N5uCFDYzKIpkRpgay1Ji9RmCKsqTYmQjofJylFK2zmK2xsAfH45iVWyVQDp2BzHz182GOVgAOX9KwAku5Nuz00cXb2oiTAAnocTlQlS89OsL8fpv+sHIDU/zeKbFwC4b9zn9PUHhsn0TBJdXalloH5ZoMc7jMs9RPT5ON9jswyOjNI3MvbfDAwhXht7CsDHgJ/cqsy9x28PzeLfEAUAKj9ubbEyyVJoyvDS/lZTswU18YHmllYjoKqNla+mnQH2S3k6us+J+g88sKj7zLO6HwAAAABJRU5ErkJggg==">MIT CSAIL Computer Vision Research Group</A>
<DT><A HREF="http://www.youtube.com/watch?v=9K8X__I2O2A&feature=related" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABS0lEQVQ4jZ2SP0hCURTGf+clQdAQ2tKsUHNbQ9SgCI5BU/sbpSnaWrPRxamlyUJsaKlcWqIaoigKk+ckNKVOBYF6Gu57r/fyT9a3nO+ce893zr3noKr6VM7qTmZaVVWTdkkPLhs6LvAJaPvQ2L/Av/18nFXAr36eTyugdu7GLxC0fQLvr0cKaLXT0w+XewnVTm+ogMWY2N+M8dLVvnjEIyJiiCpTc+uc5dOICHbumvkJobuywULEImmXQgLitvVvRKjVILPk6QFj6ilweoVoPKo4ze+DRCzYIDhvP2JhmE9MzA5I9sqMhhW6uLgMTtO4TtP4uLxQNLxQhNs6lC+AwBRC8CbiIREzIlu7kEq5wcYIgd9a958qAwQqFUKfendv/JMT2NsGS4z/8Ahrq2YK7c+W3/mwrQjsGWKKMzMZNVzj0UDasF3oj0u9JV8pACCxycNu1gAAAABJRU5ErkJggg==">YouTube - Hello World through custom UART to HyperTerminal</A>
A very quick solution would be to use regular expressions in Python.
Assuming that the variable s contains your HTML string:
import re
s = ''' <DT><A HREF="http://gmazars.info/conf/index.html"
<DT><A HREF="http://research.google.com/"
<DT><A HREF="http://research.microsoft.com/en-us/"
<DT><A HREF="http://techresearch.intel.com/articles/index.html"
'''
print re.findall("HREF=\"(.*?)\"", s)
A rather hacky way to do it would be to save it as a string and make a substring using a regex. I'm not familiar with any of those languages, but I'm sure they can do that fairly easily.
Something like:
(DT><A HREF=){1}(.*)(ADD_DATE=){1}
and /2 would grab the information you wanted. Or something similar, it's been a while since I used it.
awk -F"\"" '/<DT><A HREF=/{print $2}' file
精彩评论