开发者

Recognise an arbitrary date string [closed]

开发者 https://www.devze.com 2023-01-18 18:51 出处:网络
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 1 year ago.

Improve this question

I need to be able to recognise date strings. It doesn't matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being a date, rather than converting it to a Date object. So开发者_运维知识库, this is really a classification rather than parsing problem.

I will have pieces of text such as:

"bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"

and I need to be able to recognise the start and end boundary for each date string within.

I was wondering if anyone knew of any java libraries that can do this. My google-fu hasn't come up with anything so far.

UPDATE: I need to be able to recognise the widest possible set of ways of representing a dates. Of course the naive solution might be to write an if statement for every conceivable format, but a pattern recognition approach, with a trained model, is ideally what I'm after.


Use JChronic

You may want to use DateParser2 from edu.mit.broad.genome.utils package.


You can loop all available date formats in Java:

for (Locale locale : DateFormat.getAvailableLocales()) {
    for (int style =  DateFormat.FULL; style <= DateFormat.SHORT; style ++) {
        DateFormat df = DateFormat.getDateInstance(style, locale);
        try {
                df.parse(dateString);
                // either return "true", or return the Date obtained Date object
        } catch (ParseException ex) {
            continue; // unperasable, try the next one
        }
    }
}

This however won't account for any custom date formats.


Rules that might help you in your quest:

  1. Make or find some sort of a database with known words that match months. Abbreviated and full names, like Jan or January. While searching, it must be case insensitive, because fEBruaRy is also a month, although the person typing it must have been drunk. If you plan to search non-english months, a database is also needed, because no heuristic will find out that "Wrzesień" is polish for september.
  2. For english only, check out ordinal numbers and also make a database for numbers 1 to 31. These will be useful for days and months. If you want to use this approach for other languages, then you will have to do your own research.
  3. Once again, english only, check for "Anno Domini" and "Before Christ", that is, AD and BC respectively. They can also be in form A.D. and B.C.
  4. Concerning numbers themselves that will represent days, months and years, you must know where your limit is. Is it 0-9999, or more? That is, do you want to search for dates that represent years beyond year 9999? If no, then strings that have 1-4 consecutive digits are good guesses for a valid day, month or year.
  5. Days and months have one or two digits. Leading zeros are acceptable, so strings with a format of 0*, where * can be 1-9 are acceptable.
  6. Separators can be tricky, but if you don't allow inconsistent formatting like 10/20\1999, then you will save yourself a lot of grief. This is because 10*20*1999 can be a valid date, with * usually being one element of set {-,_, ,:,/,\,.,','}, but it's possible that * is a combination of 2 or 3 elements of mentioned set. Once again, you must choose acceptable separators. 10?20?1999 can be a valid date for somebody with a weird sense of elegance. 10 / 20 / 1999 can also be a valid date, but 10_/20_/1999 would be a very strange one.
  7. There are cases with no separator. For example: 10Jan1988. These cases use words from 1.
  8. There are special cases, like February 28th or 29th, depending on leap year. Also, months with 30 or 31 days.

I think these are enough for a "naive" classification, a linguist expert might help you more.

Now, an idea for your algorithm. Speed doesn't matter. There might be multiple passes over the same string. Optimize when it starts to matter. When you doubt that you have found a date string, store it somewhere "safe" in a ListOfPossibleDates and do an examination once again, with more rigid rules using combinations from 1. to 8. When you believe a date string is valid, feed it to the Date class to see if it's really valid. 32nd March 1999 is not valid, when you convert it to a format that Date will understand.

One important recurring pattern is lookbehind and lookaround. When you believe a valid entity (day, month, year) is found, you'll have to see what lies behind and after. A stack based mechanism or recursion might help here.

Steps:

  1. Search your string for words from rule 1. If you find any of them, note that location. Note the month. Now, go a few characters behind and a few ahead to see what awaits you. If there are no spaces before and after your month, and there are numbers, like in rule 7., check them for validity. If one of them represents a day (must be 0-31) and other a year (must be 0-9999, possibly with AD or BC), you have one candidate. If there are the same separators before and after, look for rules from 6. Always remember that you must be sure that a valid combination exists. so, 32Jan1999 won't do.
  2. Search your string for other english words, from rules 2. and 3. Repeat similarly like in step 1.
  3. Search for separators. Empty space will be the trickiest. Try to find them in pairs. So, if you have one "/" in your string, find another one and see what they have inbetween. If you find a combination of separators, to the same thing. Also, use the algorithm from step 2.
  4. Search for digits. Valid ones are 0-9999 with leading zeroes allowed. If you find one, look for separators like in step 3.

Since there is literally a countless amount of possibilities, you won't be able to catch them all. Once you have found a pattern that you believe could occur once again, store it somewhere and you can use it as a regex for passing other strings.

Let's take your example, "bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla". After you extract the first date, 12 Jan 09, then use the rest of that string ("bla bla bla 01/04/10 bla bla bla") and apply all above steps once again. This way you'll be sure you didn't miss anything.

I hope these suggestions will be at least of some help. If there doesn't exist a library for do all these dirty (and more) steps for you, then you have a tough road ahead of you. Good luck!


I did it with a huge regex (self created):

public static final String DATE_REGEX = "\b([0-9]{1,2} ?([\\-/\\\\] ?[0-9]{1,2} ?| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ?)([\\-/\\\\]? ?('?[0-9]{2}|[0-9]{4}))?)\b";
public static final Pattern DATE_PATTERN = Pattern.compile(DATE_REGEX, Pattern.CASE_INSENSITIVE); // Case insensitive is to match also "mar" and not only "Mar" for March

public static boolean containsDate(String str)
{
    Matcher matcher = pattern.matcher(str);
    return matcher.matches();
}

This matches following dates:

06 Sep 2010
12-5-2005
07 Mar 95
30 DEC '99
11\9\2001

And not this:

444/11/11
bla11/11/11
11/11/11blah

It also matches dates between symbols like [],(), ,:

Yesterday (6 nov 2010)

It matches dates without year:

Yesterday, 6 nov, was a rainy day...

But it matches:

86-44/1234
00-00-0000
11\11/11

And this doesn't look not anymore like a date. But this is something you can solve by checking if the numbers are possible values for a month, day, year.


Here is a simple natty example :

import com.joestelmach.natty.*;

List<Date> dates =new Parser().parse("Start date 11/30/2013 , end date Friday, Sept. 7, 2013").get(0).getDates();
        System.out.println(dates.get(0));
        System.out.println(dates.get(1));

//output:
        //Sat Nov 30 11:14:30 BDT 2013
        //Sat Sep 07 11:14:30 BDT 2013


I am sure researchers in information extraction have looked at this problem, but I couldn't find a paper.

One thing you can try is do it as a two step process. (1) after collecting as much data as you can, extract features, some features that come to mind: number of numbers that appear in the string, number of numbers from 1-31 that appear in the string, number of numbers from 1-12 that appear in the string, number of months names that appear in the string, and so on. (2) learn from the features using some type of binary classification method (SVM for example) and finally (3) when a new string comes by, extract the features and query the SVM for a prediction.


java.time

You can specify as many custom patterns as you wish using DateTimeFormatter. All you need to do is to specify patterns as optional by enclosing them within square brackets. DateTimeFormatterBuilder provides you with many more things e.g. case-insensitive parsing, defaulting to a missing unit (e.g. HOUR_OF_DAY) etc.

Demo:

import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeFormatterBuilder;
import java.time.format.DateTimeParseException;
import java.util.Locale;
import java.util.stream.Stream;

public class Main {
    public static void main(String[] args) {
        // DateTimeFormatter parser = DateTimeFormatter.ofPattern("[M/d/uu[ H:m]][d MMM u][M.d.u][E MMM d, u]", Locale.ENGLISH);
        final DateTimeFormatter parser = new DateTimeFormatterBuilder()
                    .parseCaseInsensitive() // parse in case-insensitive manner
                    .appendPattern("[M/d/uu[ H:m]][d MMM u][M.d.u][E MMM d, u]")
                    .toFormatter(Locale.ENGLISH);
        
        // Test
        Stream.of(
                    "Thu Apr 1, 2021",
                    "THU Apr 1, 2021",
                    "01/06/10",
                    "1 Jan 2009",
                    "1.2.2010",
                    "asdf"
                ).forEach(s -> {
                    try {
                        LocalDate.parse(s, parser);
                        System.out.println(true);
                    } catch(DateTimeParseException e) {
                        System.out.println(false);
                    }
                });     
    }   
}

Output:

true
true
true
true
true
false

Learn more about the modern date-time API from Trail: Date Time.


Maybe you should use regular expressions?

Hopefully this one would work for mm-dd-yyyy format:

^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$

Here (0[1-9]|1[012]) matches the month 00..12, (0[1-9]|[12][0-9]|3[01]) matches a date 00..31 and (19|20)\d\d matches a year.

Fields can be delmited by dash, slash or a dot.

Regards, Serge


It is virtually impossible to recognize all possible date formats as dates using "standard" algorithms. That's just because there are so many of them.

We, humans are capable of doing that just because we learned that something like 2010-03-31 resembles date. In other words, I would suggest to use Machine Learning algorithms and teach your program to recognize valid date sequences. With Google Prediction API that should be feasible.

Or you can use Regular Expressions as suggested above, to detect some but not all date formats.


What I would do is look for date characteristics, rather than the dates themselves. For example, you could search for slashes, (to get dates of the form 1/1/1001), dashes (1 - 1 - 1001), month names and abbreviations (Jan 1 1001 or January 1 1001). When you get a hit for these, collect the nearby words (2 on each side should be fine) and store that in an array of strings. Once you have scanned all input, check this string array with a function that will go into a bit more depth and pull out actual date strings, using the methods found here. The important thing is just getting the general dates down to a manageable level.


Check this one https://github.com/zoho/hawking. Devolped by ZOHO ZIA Team.

Hawking Parser is a Java-based NLP parser for parsing date and time information. The most popular parsers out there like Heidel Time, SuTime, and Natty Date time parser are distinctly rule-based. As such, they often tend to struggle with parsing date/time information where more complex factors like context, tense, multiple values, and more need to be considered.

With this in mind, Hawking Parser is designed to address a lot of these challenges and has many distinct advantages over other available date/time parsers.

It's a open source Library under GPL v3 and the best one. To know why it's best, check out this blog that explains in detail : https://www.zoho.com/blog/general/zias-nlp-based-hawking-date-time-parser-is-now-open-source.html

P.S: I'm one of the developers of this project


Usually dates are characters separated by a back/forward slash or a dash. Did you consider a regular expression?

I am assuming you are not looking to classify dates of the type Sunday, October 3rd 2010 and so on


I don't know of any library that can do this but writing your own wouldn't be incredibly hard. Assuming your dates are all formatted with the slashes like 12/12/12 then you could verify you have three '\'s. You could get even more technical and have it check the values in between the slashes. For instance, if you have:

30/12/10

Then you know that 30 is the days and 12 is the month. However if you get 30/30/10 you know that even though ti has the correct format, it cannot be a date because there are no '30' months.


I don't know of any library that does this either. I would suggest a mix of nested recursive functions and regular expressions (a lot) to match strings and try to come up with a best guess to see if it can be a date. Dates can be written in a lot of different ways, some people might write them out as "Sunday, October 3 2010" or "Sunday, October 3rd 2010" or "10/03/2010" or "10/3/2010" and a whole bunch of different ways (even more if you are considering dates in other languages/cultures).


You could always check to see if there are two '/' characters in a string.

public static boolean isDate(){
     String date = "12/25/2010";
     int counter = 0;
     for(int i=0; i<date.length(); i++){
          if ("\/-.".indexOf(date.charAt(i)) != -1) //Any symbol can be used. 
               counter++;
     }
     if(counter == 2)    //If there are two symbols in the string,
          return true;   //Return true.
     else
          return false;
}

You can do something similar to check to see if everything else is an integer.

0

精彩评论

暂无评论...
验证码 换一张
取 消