text-extraction
Extracting data from a web page
I am doing a school project which needs extracting data from web pages. To be precise I need a library or opensource program to extract human readable content from html/text data. Something like web b[详细]
2023-02-27 02:37 分类:问答How to extract just plain text from .doc & .docx files? [closed]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.[详细]
2023-02-25 04:18 分类:问答Add rules and extract text from PDF using C# .net
I want to build a PDF text extraction tool having similar features to this application (A-PDF Data Extractor) http://www.a-pdf.com/data-extractor/index.htm[详细]
2023-02-11 15:11 分类:问答Text extraction with NSRegularExpression
Given a NSString *test = @\"...href=\"/functions?q=KEYWORD\\x26amp...\"; How can I extract the word KEYWORD from the string using NSRegularExpression?[详细]
2023-02-04 16:54 分类:问答How to extract a substring using regex
I have a string that has two single quotes in it, the \' character. In between the single quotes is the data I want.[详细]
2023-02-03 12:54 分类:问答Preserve "long" spaces in PDFBox text extraction
I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other[详细]
2023-02-03 09:29 分类:问答Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?
Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?[详细]
2023-02-02 21:37 分类:问答Stripping HTML but retaining block/inline structure
I would like to convert HTML to plain text but retain the minimum structure. All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped co[详细]
2023-01-29 09:46 分类:问答Regex for extracting only TR with TDs
Good morning I\'m trying to get a table row (TR) that must have one or more table cells (TDs): Having this string[详细]
2023-01-27 02:49 分类:问答"Unexpected color space /R11" while parsing a PDF file with text and image
System.ArgumentException was unhandled by user code Message=Unexpected color space /R11 Source=itextsharp[详细]
2023-01-26 14:42 分类:问答