开发者

Convert a pdf file to text in C# [closed]

开发者 https://www.devze.com 2022-12-14 11:09 出处:网络
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this
Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 5 years ago.

Improve this question 开发者_StackOverflow中文版

I need to convert a .pdf file to a .txt file

How can I do this in C#?


I've had the need myself and I used this article to get me started: http://www.codeproject.com/KB/string/pdf2text.aspx


Ghostscript could do what you need. Below is a command for extracting text from a pdf file into a txt file (you can run it from a command line to test if it works for you):

gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "test.pdf" -c quit >"test.txt"

Check here: codeproject: Convert PDF to Image Using Ghostscript API for details on how to use ghostscript with C#


The concept of converting PDF to text is not really straight forward and you wont see anyone posting a code here that will convert PDF to text straight. So your best bet now is to use a library that would do the job for you... a good one is PDFBox, you can google it. You'll probably find it written in java but fortunately you can use IKVM to convert it to .Net....


As an alternative to Don's solution there I found the following:

Extract Text from PDF in C# (100% .NET)


Docotic.Pdf library can extract text from PDF files (formatted or not).

Here is a sample code that shows how to extract formatted text from a PDF file and save it to an other file.

public static void ExtractFormattedText(string pdfFile, string textFile)
{
    using (PdfDocument doc = new PdfDocument(pdfFile))
    {
        string text = doc.GetTextWithFormatting();
        File.WriteAllText(textFile, text);
    }
}

Also, there is an article on our site that shows other options for extraction of text from PDF files.

Disclaimer: I work for Bit Miracle, vendor of the library.


    public void PDF_TEXT()
    {
        richTextBox1.Text =  string.Empty;

        ReadPdfFile(@"C:\Myfile.pdf");  //read pdf file from location
    }


    public void ReadPdfFile(string fileName)
    {

 string strText = string.Empty;
 StringBuilder text = new StringBuilder();
   try
    {
    PdfReader reader = new PdfReader((string)fileName);
    if (File.Exists(fileName))
    {
    PdfReader pdfReader = new PdfReader(fileName);

   for (int page = 1; page <= pdfReader.NumberOfPages; page++)
      {

 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

 string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

          text.Append(currentText);

                }
                pdfReader.Close();
            }
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.Message);
        }
        richTextBox1.Text = text.ToString();

    }



    private void Save_TextFile_Click(object sender, EventArgs e)
    {
        SaveFileDialog sfd = new SaveFileDialog();

        DialogResult messageResult = MessageBox.Show("Save this file into Text?", "Text File", MessageBoxButtons.OKCancel);

        if (messageResult == DialogResult.Cancel)
        {

        }
        else
        {
            sfd.Title = "Save As Textfile";
            sfd.InitialDirectory = @"C:\";
            sfd.Filter = "TextDocuments|*.txt";


            if (sfd.ShowDialog() == DialogResult.OK)
            {
                if (richTextBox1.Text != "")
                {
                    richTextBox1.SaveFile(sfd.FileName, RichTextBoxStreamType.PlainText);
                    richTextBox1.Text = "";
                    MessageBox.Show("Text Saved Succesfully", "Text File");

                }
                else
                {
                    MessageBox.Show("Please Upload Your Pdf", "Text File",
                    MessageBoxButtons.OKCancel, MessageBoxIcon.Asterisk);
                }

            }

        }

    }
0

精彩评论

暂无评论...
验证码 换一张
取 消