What's the simplest way to remove extraneous leading numbers?_问答_开发者

I have data that is reliably in this format:

    1. New York Times - USA
    2. Guardian - UK
    3. Le Monde - France

I'm using this code to parse out the newspaper and country values:

    String newspaper = "";
    String country = "";
    int hyphenIndex = unparsedText.indexOf("-");
    if (hyphenIndex > -1)
    {
        newspaper = unparsedText.substring(0, hyphenIndex);
    }
    country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
    country = country.trim();

But this produces newspaper values of开发者_StackOverflow中文版:

    1. New York Times
    2. Guardian
    3. Le Monde

What's the simplest change to make to end up with newspaper values of:

    New York Times
    Guardian
    Le Monde

Here is a regex based solution:

input.replaceAll("(?m)^\\d+\\.\\s*|\\s*-\\s*.*?$", "");

The regex works in multiline mode (?m) and deletes:

Leading digit(s) followed by a dot followed by any number of space.
Hyphen followed by anything.

I'm assuming there are no hyphens in the newspaper name.

Code In Action

Surely just find the index of the first '.' and use substring(from,to) to get the bit in the middle.

Something like:

String newspaper = "";
String country = "";
int hyphenIndex = unparsedText.indexOf("-");
int dotIndex = unparsedText.indexOf(".");
if (hyphenIndex > -1)
{
    newspaper = unparsedText.substring(dotIndex + 1, hyphenIndex);
}
country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
country = country.trim();

If it really is reliably in that format, it seems that the easiest (and likely most efficient) way to do this would be to find the first instance of the . character, and then take a substring starting from dotIndex + 1. In fact you could combine this with your current substring operation (based on the position of the dash) to extract the newspaper name in one go.

If the format is a little less reliable, you could use a regex to match digits followed by a separator character followed by whitespace, and remove that. But in this case, that seems like overkill.

java.util.regex.Matcher m = (new java.util.regex.Pattern("[a-zA-Z ]*")).matcher(unparsedText);
m.find();
System.err.println(unparsedText.substring(m.start(), m.end());

Note #1: assuming newspaper cannot contain numbers.

Note #2: haven't tested.

If the entries all follow the format you gave you could look for the full stop after the number e.g.

int dotIndex = unparsedText.indexOf(".");

and then

newspaper = unparsedText.substring(dotIndex + 2, hyphenIndex - 1);

Note: that you want to start 2 characters after the . and exclude the 1 space before the - or use trim()

String#split(String regex) would work if you split on . and -.

[0] => "1"
[1] => " New York Times "
[2] => " USA"

Then just trim the results you want.

This regex should work:

    Pattern pattern =  Pattern.compile("\\d+.\\s(.*)\\s-.*");
    Matcher matcher = pattern.matcher("1. New Your Times - USA");
    String newspaper = matcher.toMatchResult().group(1);
    Assert.assertEquals("New Your Times", newspaper);

I would do it like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Application
{
    public static void main ( final String[] args )
    {
        final String[] lines = new String[] { "1. New York Times - USA", "2. Guardian - UK", "3. Le Monde - France" };

        final Pattern p = Pattern.compile ( "\\.\\s+(.*?)\\s+-\\s+(.*)" );

        for ( final String unparsedText : lines )
        {
            String newspaper;
            String country;

            final Matcher m = p.matcher ( unparsedText );

            if ( m.find () )
            {
                newspaper = m.group ( 1 );
                country = m.group ( 2 );

                System.out.println ( "Newspaper: " + newspaper + " Country: " + country );
            }
        }
    }
}