开发者

Extracting dates from html meta data in FAST-ESP

开发者 https://www.devze.com 2022-12-28 05:07 出处:网络
During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).

During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).

<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data

Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.

#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes

Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of sp开发者_JS百科ecifying "all meta attributes" as input.

Has anyone done something similar and can offer any advice on the best way to do this.

Thanks

Neil


I suppose that a custom stage that takes all the needed date attributes as an input, processes a comparison between all them (to find the newest date), and outputs the most up-to-date field will do the job.

0

精彩评论

暂无评论...
验证码 换一张
取 消