开发者

How to print certain lines and certain parts of line in a text file

开发者 https://www.devze.com 2023-03-16 15:25 出处:网络
I\'m trying to extract information from an HTML file(Google Chrome Bookmarks exported) It contains text of the following format

I'm trying to extract information from an HTML file(Google Chrome Bookmarks exported)

It contains text of the following format and I would like to extract just the website adresses after <DT><A HREF= and before ADD_DATE=

I'm considering using SED and AWK or Python So any answers from the three languages are welcome

So far I just know how to print lines containing <DT><A HREF= with awk

awk '/<DT><A HREF="*"/' favorit.html

i suppose i should combine this with sed

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="0" LAST_MODIFIED="1309451494" PERSONAL_TOOLBAR_FOLDER="true">Barre de favoris</H3>
    <DL><p>
        <DT><H3 ADD_DATE="1281455379" LAST_MODIFIED="1309422816">brain</H3>
        <DL><p>
            <DT><A HREF="http://gmazars.info/conf/index.html" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABVklEQVQ4jcWSoVcjQQyHP3i4VM/ojiZ+sRTHva7u2nqulit2/4Haes5XM7WsH/SuntUz+k5wb6AtjxMIoiYvk1+SLzn7cXv7hy/Y+VeSPxRwzqGqiEjxjbUlLiKoavEv3gfWDw9YY0g5Y63l1/0917MZqsrPuzsAFk1z4BeBeV0DsFwuy8epc+y9p65rRIScM6rK3vvTDlSVEEIJ/H58LO8xRqqrK0IIOOdo2/ZU4NiMteSUyDnz3HVUVcVEhL7vGWM8hei952Y2w1j7ymO9ZtE0AOy9R1W5PGr/oAP/9IQxhs1mg4jQdV0Zo+97ckpUVcV2uz0QOPv2QyojrFYrAHa7HSJSiMd/wGQyKUnWmLKxgy3EcSTlXODN65oYIxMRps4x9D1d17FoGtq2ZYzxTSC8vLyCtPatqgjDMJBTYhgGUs6EELh8dy+fQjw+ro/sU4j/Swb4C6whlU0nCEWKAAAAAEl开发者_开发知识库FTkSuQmCC">Computer Vision Resource</A>
            <DT><A HREF="http://research.google.com/" ADD_DATE="1281455379">Google Research</A>
            <DT><A HREF="http://research.microsoft.com/en-us/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABl0lEQVQ4jaWSTWuTQRRGz52575uYhaYF0VZbk0ChyaJWCvpv3Opv0607QfzYuBKtgktTsjAUTFuSkmTembku0uzaEtoDsxlmnjk8d2QWs3ELdB4zN00QQEPO3NTBCWiVbhkQopGXO3aRJAIYZgvNqwMEnVUzYjKcczjnEBGqEBYHvOPSgkTADO8EPRmN+PLxA831dbzzpJSYjMc86XTY6faIMSIimNnCRoSFHzgztKiVPG61ePf2Dd29p+wdHPCoKCkKZTT6hxMhhEC9XieEQIwVXz9/4tnzF2xsbKLJMjFGBkd9ZvMZv75/Q7XgbrPJ8fAvm1vbjM9OqULAeWW73WLQ7/P75yEvX73GhSqQyfT293HeE1Pk/sMHDI7+YJapqsDxcEi2zE6vy3Q6ZavdprO7y+R8grz/cWjzqsKJEGMkzOcAlLUaKUUAUkyUtZKiLDkdnXCn0UDVU5Ylmi0vRgZ4VRpFwbJ6LXRZO4aRUuLe2hpm+WLqhpoZy7UKSysAcw630q1rcKu+fBlmhpaqiFz3Ya+m8J7/gTTD19UuBM8AAAAASUVORK5CYII=">Microsoft Research - Turning Ideas into Reality</A>
            <DT><A HREF="http://techresearch.intel.com/articles/index.html" ADD_DATE="1281455379">Intel Labs</A>
            <DT><A HREF="http://www.ibm.com/developerworks/" ADD_DATE="1309092502" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAWUlEQVQ4jWNkSJ3xn4ECwESJ5uFkwP9Z6XCMjQ8TQwYwPhMDAwMDY9pMBmQaxmZMm4nVEGQxFlxOgylCNhSnF7ABbC6A8ZENhYcButNgYUDIBYyjKXEQGAAAPiUyGXrLJGMAAAAASUVORK5CYII=">IBM developerWorks : IBM&#39;s resource for developers and IT professionals</A>
            <DT><A HREF="http://www.siam.org/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAWklEQVQ4jd2RwQrAIAxDU7+8fvnbYR6s2sMUxlguoYGGVyrJkWDfj5YdfY8AeJmATj3BKo9ZKwmDnGzO/BHBXADFTDKruhV9zkfV1XW5RgIAa7/cVjlZ/knBBdfjeU7uim0TAAAAAElFTkSuQmCC">SIAM: Society for Industrial and Applied Mathematics</A>
            <DT><A HREF="http://www.wolfram.com/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC7UlEQVQ4jYWTX0hTcRzFz3Y37931zv2x5dzUiWZz2TQFkUVYVEpiVFCziB4CoyAYQUEGQmi9S0QREb3ZQ2aEoVQP1bJSg7SmpqKmU6bOOV3debfddb23p8XUovP2+f3O98eB7/kR+I88N1sOGfc6ptxut/S3e3kyPL7SUOnZV7szwQONjZbC9s6uw/OgEmdtTpehraVFlWBF8gM+glDaFkLvArsPNLMG6rP2ScctbnxEjtiqq83hfMCHZa6MZcG4yLKXEjOyjZHuq8uk89wPSA5ZNKgSVYsCoOyJYUxfOMP4CYulsvBoQfed539NAABzKUK4h9eqrVazSqjIQOoKh4mVUUwPRiwnwUCg+YVkvwwAWk9cvGp6OnidyhFSQ3QckVEa+jwGYq4Sy1wEA0OrKNBTOCdGgFg8+K3Y4qBEmleAZ+QAMFdjuWcu1wUrFCyKi7Kxq0SJ0h06pOtSQGrWYDDGkG9TAFV6CCuLW4rcfROGn3Oz0QLTMQUANNQ3hKvPnq6SB7XjSsUatDY1mD0lILcYIPnGIIo9EM06hG0p4DI5iGtUgDdqqm0P73r+rFEd4JvHPUuY6f+OeKYeqwYaZLoGuepM5DBpYPu8WJrmQaQo4Cste533qtMDAMRtl4u8orS+CLwfONoXWgIoEiRDIaoiEedXsdU7j1/Dk/AOs/CNxqEJk0gLsvaay/UfW7s/TBG128v3+yd9W32IunO1wvfsckuJCBmEeT/IsB/E9DQENg5Br8TEzzC+cDKEBAqpoR/HD5+u82/qwYi9aFm1w6yfGZsFHViA0Z6H7OJ8oN+L/qFfeLst69rX8eC9PC1jNWenra/BXaeT6YBVemOxS48MJvYZ0iXOapciNrMUAi0tKR3S26YmKnlm3V+Q4prKwMHSC1MmUzpLph2hqUIh7KPRKdeSOkRkfiK2rGNMzo2p/6mX5acGXtaeqU1wW5ZD1XWjxZ7s2VTlZHGVxe38vPdTgut8vVFc7x1K9vwGs2EnwS3T7R0AAAAASUVORK5CYII=">Wolfram Research: Mathematica, Technical and Scientific Software</A>
            <DT><A HREF="http://www.mathworks.com/" ADD_DATE="1281455379">The MathWorks - MATLAB and Simulink for Technical Computing</A>
            <DT><A HREF="http://www.youtube.com/user/GoogleTechTalks" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABS0lEQVQ4jZ2SP0hCURTGf+clQdAQ2tKsUHNbQ9SgCI5BU/sbpSnaWrPRxamlyUJsaKlcWqIaoigKk+ckNKVOBYF6Gu57r/fyT9a3nO+ce893zr3noKr6VM7qTmZaVVWTdkkPLhs6LvAJaPvQ2L/Av/18nFXAr36eTyugdu7GLxC0fQLvr0cKaLXT0w+XewnVTm+ogMWY2N+M8dLVvnjEIyJiiCpTc+uc5dOICHbumvkJobuywULEImmXQgLitvVvRKjVILPk6QFj6ilweoVoPKo4ze+DRCzYIDhvP2JhmE9MzA5I9sqMhhW6uLgMTtO4TtP4uLxQNLxQhNs6lC+AwBRC8CbiIREzIlu7kEq5wcYIgd9a958qAwQqFUKfendv/JMT2NsGS4z/8Ahrq2YK7c+W3/mwrQjsGWKKMzMZNVzj0UDasF3oj0u9JV8pACCxycNu1gAAAABJRU5ErkJggg==">YouTube - Chaîne de GoogleTechTalks</A>
            <DT><A HREF="http://groups.csail.mit.edu/vision/welcome/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAB0klEQVQ4jY2T30tTYRjHP+mcp8mWzBFKJVk4CyVZI+wH1Z1FQtFdV3Uxr4bX/gkS4eXoquxGvOgmuhg0jWRRwhgrBks7Z81AZzuy49Kj+6Fm62LsXafZYd+r9+F9v5/neb+875FyuVzGRJlPIZR3MwDslQoADNwZpeviLQCazMwAywuv0bNpAKySjZ8/kuiZpNi3mJl3dY3cqswV3wQu9xAA4clHFAuFwwFyMIC6FBV1tbO982zNILUZmgjArq4RDz6jxzuM88wFceDYiV5aHS5R2xzt/D74VQ/IfYtx1N7BJd8Ts1vVSYSoJsI4T/U1ZDrIb9YD1lMJOvuvNgTYK+6ItQVAUyLsl/Kc9N5uCFDYzKIpkRpgay1Ji9RmCKsqTYmQjofJylFK2zmK2xsAfH45iVWyVQDp2BzHz182GOVgAOX9KwAku5Nuz00cXb2oiTAAnocTlQlS89OsL8fpv+sHIDU/zeKbFwC4b9zn9PUHhsn0TBJdXalloH5ZoMc7jMs9RPT5ON9jswyOjNI3MvbfDAwhXht7CsDHgJ/cqsy9x28PzeLfEAUAKj9ubbEyyVJoyvDS/lZTswU18YHmllYjoKqNla+mnQH2S3k6us+J+g88sKj7zLO6HwAAAABJRU5ErkJggg==">MIT CSAIL Computer Vision Research Group</A>
            <DT><A HREF="http://www.youtube.com/watch?v=9K8X__I2O2A&feature=related" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABS0lEQVQ4jZ2SP0hCURTGf+clQdAQ2tKsUHNbQ9SgCI5BU/sbpSnaWrPRxamlyUJsaKlcWqIaoigKk+ckNKVOBYF6Gu57r/fyT9a3nO+ce893zr3noKr6VM7qTmZaVVWTdkkPLhs6LvAJaPvQ2L/Av/18nFXAr36eTyugdu7GLxC0fQLvr0cKaLXT0w+XewnVTm+ogMWY2N+M8dLVvnjEIyJiiCpTc+uc5dOICHbumvkJobuywULEImmXQgLitvVvRKjVILPk6QFj6ilweoVoPKo4ze+DRCzYIDhvP2JhmE9MzA5I9sqMhhW6uLgMTtO4TtP4uLxQNLxQhNs6lC+AwBRC8CbiIREzIlu7kEq5wcYIgd9a958qAwQqFUKfendv/JMT2NsGS4z/8Ahrq2YK7c+W3/mwrQjsGWKKMzMZNVzj0UDasF3oj0u9JV8pACCxycNu1gAAAABJRU5ErkJggg==">YouTube - Hello World through custom UART to HyperTerminal</A>


A very quick solution would be to use regular expressions in Python.

Assuming that the variable s contains your HTML string:

import re

s = '''   <DT><A HREF="http://gmazars.info/conf/index.html" 
            <DT><A HREF="http://research.google.com/" 
            <DT><A HREF="http://research.microsoft.com/en-us/" 
            <DT><A HREF="http://techresearch.intel.com/articles/index.html" 
'''

print re.findall("HREF=\"(.*?)\"", s)


A rather hacky way to do it would be to save it as a string and make a substring using a regex. I'm not familiar with any of those languages, but I'm sure they can do that fairly easily.

Something like:

(DT><A HREF=){1}(.*)(ADD_DATE=){1}

and /2 would grab the information you wanted. Or something similar, it's been a while since I used it.


awk -F"\"" '/<DT><A HREF=/{print $2}' file
0

精彩评论

暂无评论...
验证码 换一张
取 消