开发者

grepping out invalid URIs

开发者 https://www.devze.com 2023-02-14 12:09 出处:网络
I have dbpedia\'s NTriple files. Some of them contain non absolute URIs, URI\'s that don\'t start with http://. This is causing problem to the parsing.

I have dbpedia's NTriple files. Some of them contain non absolute URIs, URI's that don't start with http://. This is causing problem to the parsing.

i.e. i开发者_开发知识库 have some triples that have URIs like <www.example.com> instead of <http://www.example.com>

I'd like to grep them out by negating them.

I tried, failing, with grep -v "^(<http)".

Any suggestion?

Edit

I probably made my point wrongly. These URI's aren't necessarily at the beginning of the line. That was my mistake in using the '^' operator as NOT. Also, I want to grep them out, with grep -v.

These are some sample lines:

<http://dbpedia.org/resource/Petrodvorets_Watch_Factory> <http://xmlns.com/foaf/0.1/homepage> <www.raketa.su> .

<http://dbpedia.org/resource/ABS_network> <http://xmlns.com/foaf/0.1/homepage> <www.absn.tv> .


grep -P '^(?!<http).*'

(?!...) is a negative lookahead I did not test it so if you that does not work, search the web for 'regex negative lookahead' that should do the job


To handle multiple URIs per line the working regex is:

grep -P '<(?!http(s)?:\/\/).*>', to start with.


"^(<http)" would only match if "<http" is at the beginning of the line. Is that true in your case?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号