I am looking for a algorithm that takes a string and splits it into a certain number of parts. These parts shall contain complete words (so whitespaces are used to split the string) and the parts shall be of nearly the same length, or contain the longest possible parts.
I know it is not that hard to code a function that can do what I want but I wonder whether there is a well-proven and fast algorithm for that purpose?
edit: To clarify my question I'll describe you the problem I am trying to solve.
I generate images with a fixed width. Into these images I write user names using GD and Freetype in PHP. Since I have a fixed width I want to split the names into 2 or 3 lines if they don't fit into one.
In order to fill as much space as possible I want to split the names in a way that each line contains as much words as possible. With this I mean that in one line should be as much words as neccessary in order to keep each line's length near to an average line length of the whole text block. So if there are one long word and two short words the two short words should stand on one line if it makes all lines about equal long.
(Then I compute the text block width using 1, 2 or 3 lines and if it fits into my image I render it. Just if there are 3 lines and it won't fit I decrease the font size until e开发者_JAVA技巧verything is fine.)
Example:
This is a long text
should be display something like that:
This is a
long text
or:
This is
a long
text
but not:
This
is a long
text
and also not:
This is a long
text
Hope I could explain clearer what I am looking for.
If you're talking about line-breaking, take a look at Dynamic Line Breaking, which gives a Dynamic Programming solution to divide words into lines.
I don't know about proven, but it seems like the simplest and most efficient solution would be to divide the length of the string by N then find the closest white space to the split locations (you'll want to search both forward and back).
The below code seems to work though there are plenty of error conditions that it doesn't handle. It seems like it would run in O(n) where n is the number of strings you want.
class Program
{
static void Main(string[] args)
{
var s = "This is a string for testing purposes. It will be split into 3 parts";
var p = s.Length / 3;
var w1 = 0;
var w2 = FindClosestWordIndex(s, p);
var w3 = FindClosestWordIndex(s, p * 2);
Console.WriteLine(string.Format("1: {0}", s.Substring(w1, w2 - w1).Trim()));
Console.WriteLine(string.Format("2: {0}", s.Substring(w2, w3 - w2).Trim()));
Console.WriteLine(string.Format("3: {0}", s.Substring(w3).Trim()));
Console.ReadKey();
}
public static int FindClosestWordIndex(string s, int startIndex)
{
int wordAfterIndex = -1;
int wordBeforeIndex = -1;
for (int i = startIndex; i < s.Length; i++)
{
if (s[i] == ' ')
{
wordAfterIndex = i;
break;
}
}
for (int i = startIndex; i >= 0; i--)
{
if (s[i] == ' ')
{
wordBeforeIndex = i;
break;
}
}
if (wordAfterIndex - startIndex <= startIndex - wordBeforeIndex)
return wordAfterIndex;
else
return wordBeforeIndex;
}
}
The output for this is:
1: This is a string for
2: testing purposes. It will
3: be split into 3 parts
Again, following Brian's answer, I made a PHP version of his code:
// Input text $txt = "This is a really long string that should be broken up onto lines of about the same number of characters."; // Number of lines $numLines = 3; /* Do it, result comes as an array: */ $aResult = splitLinesByClosestWhitespace($txt, $numLines); /* Output result: */ if ($aResult) { for ($x=1; $x<=$numLines; $x++) echo "Line ".$x.": ".$aResult[$x]."<br>"; } else { echo "Not enough spaces to generate the lines!"; } /**********************/ /** * Splits a string into multiple lines of the closest possible same length, * using the closest whitespaces * @param string $txt String to split * @param integer $numLines Number of lines * @return array|false */ function splitLinesByClosestWhitespace($txt, $numLines) { $p = intval( strlen($txt) / $numLines ); $aTxtIndx = array(); $aTxt = array(); // Check we have enough whitespaces to generate the number of lines $wsCount = count( explode(" ", $txt) ) - 1; if ($wsCount<$numLines) return false; // Get the indexes for ($x=1; $x<=$numLines; $x++) { $aTxtIndx[$x] = FindClosestWordIndex($txt, $p * ($x-1) ); } // Do the split for ($x=1; $x<=$numLines; $x++) { if ($x != $numLines) $aTxt[$x] = substr( $txt, $aTxtIndx[$x], trim($aTxtIndx[$x+1]) ); else $aTxt[$x] = substr( $txt, trim($aTxtIndx[$x]) ); } return $aTxt; } /** * Finds the closest word to a string index * @param string $s String to search * @param integer $startIndex Index at which to find the closest word * @return integer */ function FindClosestWordIndex($s, $startIndex) { $wordAfterIndex = 0; $wordBeforeIndex = 0; for ($i = $startIndex; $i < strlen($s); $i++) { if ($s[$i] == ' ') { $wordAfterIndex = $i; break; } } for ($i = $startIndex; $i >= 0; $i--) { if ($s[$i] == ' ') { $wordBeforeIndex = $i; break; } } if ($wordAfterIndex - $startIndex <= $startIndex - $wordBeforeIndex) return $wordAfterIndex; else return $wordBeforeIndex; }
Partitioning into equal sizes is NP-Complete
Working python codes
- Wrap.py - Break paragraphs into lines, attempting to avoid short lines.
- SMAWK.py - Same thing in
O(n)
codes by David Eppstein.
The way word-wrap is usually implemented is to place as many words as possible onto one line, and break to the next when there is no more room. This assumes, of course, that you have a maximum-width in mind.
Regardless of what algorithm you use, keep in mind that unless you are working with a fixed-width font, you want to work with the physical width of the word, not the number of letters.
Following Brian's answer, I made a JavaScript version of his code: http://jsfiddle.net/gmoz22/CPGY2/.
// Input text
var txt = "This is a really long string that should be broken up onto lines of about the same number of characters.";
// Number of lines
var numLines = 3;
/* Do it, result comes as an array: */
var aResult = splitLinesByClosestWhitespace(txt, numLines);
/* Output result: */
if (aResult)
{
for (var x = 1; x<=numLines; x++)
document.write( "Line "+x+": " + aResult[x] + "<br>" );
} else {
document.write("Not enough spaces to generate the lines!");
}
/**********************/
// Original algorithm by http://stackoverflow.com/questions/2381525/algorithm-split-a-string-into-n-parts-using-whitespaces-so-all-parts-have-nearl/2381772#2381772, rewritten for JavaScript by Steve Oziel
/**
* Trims a string for older browsers
* Used only if trim() if it is not already available on the Prototype-Object
* since overriding it is a huge performance hit (generally recommended when extending Native Objects)
*/
if (!String.prototype.trim)
{
String.prototype.trim = function(){return this.replace(/^\s+|\s+$/g, '');};
}
/**
* Splits a string into multiple lines of the closest possible same length,
* using the closest whitespaces
* @param {string} txt String to split
* @param {integer} numLines Number of lines
* @returns {Array}
*/
function splitLinesByClosestWhitespace(txt, numLines)
{
var p = parseInt(txt.length / numLines);
var aTxtIndx = [];
var aTxt = [];
// Check we have enough whitespaces to generate the number of lines
var wsCount = txt.split(" ").length - 1;
if (wsCount<numLines)
return false;
// Get the indexes
for (var x=1; x<=numLines; x++)
{
aTxtIndx[x] = FindClosestWordIndex(txt, p * (x-1) );
}
// Do the split
for (var x=1; x<=numLines; x++)
{
if (x != numLines)
aTxt[x] = txt.slice(aTxtIndx[x], aTxtIndx[x+1]).trim();
else
aTxt[x] = txt.slice(aTxtIndx[x]).trim();
}
return aTxt;
}
/**
* Finds the closest word to a string index
* @param {string} s String to search
* @param {integer} startIndex Index at which to find the closest word
* @returns {integer}
*/
function FindClosestWordIndex(s, startIndex)
{
var wordAfterIndex = 0;
var wordBeforeIndex = 0;
for (var i = startIndex; i < s.length; i++)
{
if (s[i] == ' ')
{
wordAfterIndex = i;
break;
}
}
for (var i = startIndex; i >= 0; i--)
{
if (s[i] == ' ')
{
wordBeforeIndex = i;
break;
}
}
if (wordAfterIndex - startIndex <= startIndex - wordBeforeIndex)
return wordAfterIndex;
else
return wordBeforeIndex;
}
It works fine when the number of desired lines is not too close to the number of whitespaces. In the example I gave, there are 19 whitespaces and it starts to bug when you ask to break it into 17, 18 or 19 lines. Edits welcome!
精彩评论