开发者

Re-formatting free-text to fixed format text (C#)

开发者 https://www.devze.com 2023-02-06 01:23 出处:网络
I have a problem that seems quite straight forward, but I cannot find a clean and simple solution. I have some freely formatted text. This text can be quite long and contains lines of various length

I have a problem that seems quite straight forward, but I cannot find a clean and simple solution.

  • I have some freely formatted text. This text can be quite long and contains lines of various length (> 120 characters), paragraphs and empty lines.

  • I need to present this text in a fixed format (say 120 characters pr. line and 25 lines pr. page). But keeping the original formatting in paragraphs and empty lines.

A page break should not be in the middle of a word. Ideally a page break should be placed so that we avoid single lines of a new paragraph on the bottom of a page and rather move the whole paragraph to the next page etc.

Simplified sample (input text):


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec at magna at tellus vehicula eleifend. Vivamus at est erat. Phasellus eget tincidunt tellus. Integer ultrices dolor a magna congue imperdiet. 

Duis est sem, aliquet id fermentum sed, mollis nec metus. Phasellus porttitor porttitor sodales. Aliquam tincidunt convallis massa, sed tempus erat ornare in. Sed scelerisque, lorem accumsan imperdiet accumsan, mauris turpis molestie augue, vehicula egestas tellus quam ac nulla. 

In porta augue ac dolor imperdiet semper. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Proin lacus neque, tempor nec feugiat sed, posuere sed lorem. 

Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nulla metus neque, volutpat vitae pharetra rutrum, malesuada in dolor. 

"Fixed" width formatted with page breaks (output of program):


Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Donec at magna at tellus vehicula eleifend. Vivamus at 
est erat. Phasellus eget tincidunt tellus. Integer 
ultrices dolor a magna congue imperdiet. 

Duis est sem, aliquet id fermentum sed, mollis nec metus. 
Phasellus porttitor porttitor sodales. Aliquam tincidunt 
convallis massa, sed tempus erat ornare in. Sed scelerisque, 
lorem accumsan imperdiet accumsan, mauris turpis molestie 
augue, vehicula egestas tellus quam ac nulla. 
[pagebreak]
In porta augue ac dolor imperdiet semper. Vestibulum ante 
ipsum primis in faucibus orci luctus et ultrices posuere 
cubilia Curae; Pr开发者_如何学Pythonoin lacus neque, tempor nec feugiat sed, 
posuere sed lorem. 

Class aptent taciti sociosqu ad litora torquent per conubia 
nostra, per inceptos himenaeos. Nulla metus neque, volutpat 
vitae pharetra rutrum, malesuada in dolor. 

Anyone have any ideas?


Phase I

  1. Read the text into a single string.
  2. Split the lines into an array (lines[]) on the newline character (\n).

Phase 2

  1. Initialize a stringbuilder.
  2. Loop through the lines collection, and split each line into words array on the space character. Then loop through the words array and append each one to the string builder. When the line length exceeds your threshold insert a newline character. When you're at the end of the the lines array, check that the stringbuilder !EndsWith the newline character (your last line was exactly the threshold length, and then add two newline characters for a paragraph break.


Assuming that you're using a single non-proportional font, (so the width of a line is specified as a number of characters rather than a number of centimetres)...

There are several parts to your problem.

First, you want to word-wrap your text paragraphs into lines of no more than n characters. The basic approach is to first process each paragraph so that it is a single line of text (if you don't already have an input in that form), then use a variable as a 'cursor' - place it at index n and then step it backwards until you find some whitespace. This is the end of the last word that will fit in that line. Copy this line out into a list of lines, and repeat to break the string up into word-wrapped paragraphs.

(Note: There are some cases you will have to handle here: You may have to split at punctuation characters and hyphens, and you may have to cope with a "single word" that is longer than the formatting width. For more advanced formatting you may want to add a hyphenation dictionary so you can split words with a hyphen)

Once you have your paragraphs you need to apply a similar algorithm to break the document into pages one at a time. Again, start with a 'cursor' position that is m lines into the line-list (where m is the page length). However, you want "widow and orphan" control, so you need to add some logic, e.g:

  • If the first line(s) of a page are blank, delete them (so you don't get whitespace at the top of the page). This of course means more lines will flow onto the bottom of the page.
  • If the first line of a page is the end of a paragraph (i.e. the second line of the page is blank) then you may want to fix the orphaned line by moving the final line of the previous page to the top of this page. (But only if the orphan is the tail if a paragraph rather than just a very short paragraph!)
  • If the last line of a page is the start of a new paragraph (i.e. the second-to-last line is blank) then move it to the start of the next page.

Fundamentally the process is pretty simple, but there are a lot of little complexities relating to how you want to handle the word-wrap and page-wrap. A simple algorithm won't take long to knock up, but you could spend a lot of time tweaking and improving it to achieve the "best" (in your eyes et least) results.


Ok, if we split text into words and paragraphs, then we can simply add word by word to output:

const int linewidth = 50;

static void Main(string[] args) {

  using(StreamReader r = new StreamReader("text1.txt")) {
    using(StreamWriter w = new StreamWriter("text2.txt")) {

      int written = 0;

      while(true) {
        string word = ReadWord(r);
        if(word == null) break; //end of file
        if(word == "") {
          //end of paragraph
          w.Write("\r\n\r\n");
          written = 0;
        }

        if(written + word.Length > linewidth) {
          //endline
          w.Write("\r\n");
          written = 0;
          int i = 0;
          while(word[i] == ' ') i++;
          w.Write(word.Substring(i));
          written = word.Length - i;
        } else {
          w.Write(word);
          written += word.Length;
        }
      }
    }
  }
}

So we need some smart "word reader":

static int c = -1;

static string ReadWord(StreamReader r) {
  string word = "";
  bool started = false;

  if(c == -1) c = ReadChar(r);

  while(true) {
    if(c == -1) {
      //eof
      if(word == "") return null;
      return word;
    }
    word += (char)c;
    c = r.Read();
    if(c != ' ') started = true;
    else if(started) break;
  }

  return word;
}

And this word reader needs a smart character reader, which treats all line ends as spaces and recognizes empty lines as paragraphs:

static bool lineend = false;

static int ReadChar(StreamReader r) {
  int c = r.Read();
  if(c == '\n') c = r.Read();
  if(c == '\r') {
    if(lineend) return '\r';
    lineend = true;
    return ' ';
  }
  lineend = false;
  return c;
}

As you can see, I use no internal array buffers, so the program can be used for any large files, but isn't possibly as fast as the algorithm in memory with strings.

Words longer than line are written to their own lines (see Main).

Only spaces and CRLF are treated as word delimiters. In real word situation wyou should probably extend this to TAB or other whitespaces.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号