How do I translate this Perl regular expression in

2020-04-09 08:14发布

How would you translate this Perl regex into Java?

/pattern/i

While compiles, it does not match "PattErn" for me, it fails

Pattern p = Pattern.compile("/pattern/i");
Matcher m = p.matcher("PattErn");

System.out.println(m.matches()); // prints "false"

标签: java regex perl
3条回答
萌系小妹纸
2楼-- · 2020-04-09 08:35

Java regex do not have delimiters, and use a separate argument for modifies:

 Pattern p = Pattern.compile("pattern", Pattern.CASE_INSENSITIVE);
查看更多
The star\"
3楼-- · 2020-04-09 08:36

The Perl equivalent of:

/pattern/i

in Java would be:

Pattern p = Pattern.compile("(?i)pattern");

Or simply do:

System.out.println("PattErn".matches("(?i)pattern"));

Note that "string".matches("pattern") validates the pattern against the entire input string. In other words, the following would return false:

"foo pattern bar".matches("pattern")
查看更多
啃猪蹄的小仙女
4楼-- · 2020-04-09 08:39

How would you translate this Perl regex into Java?

/pattern/i

You can't.

There are a lot of reasons for this. Here are a few:

  • Java doesn't support as expressive a regex language as Perl does. It lacks grapheme support (like \X) and full property support (like \p{Sentence_Break=SContinue}), is missing Unicode named characters, doesn't have a (?|...|...|) branch reset operator, doesn’t have named capture groups or a logical \x{...} escape before Java 7, has no recursive regexes, etc etc etc. I could write a book on what Java is missing here: Get used to going back to a very primitive and awkward to use regex engine compared with what you’re used to.

  • Another even worse problem is because you have lookalike faux amis like \w and and \b and \s, and even \p{alpha} and \p{lower}, which behave differently in Java compared with Perl; in some cases the Java versions are completely unusable and buggy. That’s because Perl follows UTS#18 but before Java 7, Java did not. You must add the UNICODE_CHARACTER_CLASSES flag from Java 7 to get these to stop being broken. If you can’t use Java 7, give up now, because Java had many many many other Unicode bugs before Java 7 and it just isn’t worth the pain of dealing with them.

  • Java handles linebreaks via ^ and $ and ., but Perl expects Unicode linebreaks to be \R. You should look at UNIX_LINES to understand what is going on there.

  • Java does not by default apply any Unicode casefolding whatsoever. Make sure to add the UNICODE_CASE flag to your compilation. Otherwise you won’t get things like the various Greek sigmas all matching one another.

  • Finally, it is different because at best Java only does simple casefolding, while Perl always does full casefolding. That means that you won’t get \xDF to match "SS" case insensitively in Java, and similar related issues.

In summary, the closest you can get is to compile with the flags

 CASE_INSENSITIVE | UNICODE_CASE | UNICODE_CHARACTER_CLASSES

which is equivalent to an embedded "(?iuU)" in the pattern string.

And remember that match in Java doesn’t mean match, perversely enough.


EDIT

And here’s the rest of the story...

While compiles, it does not match "PattErn" for me, it fails

   Pattern p = Pattern.compile("/pattern/i");
   Matcher m = p.matcher("PattErn");
   System.out.println(m.matches()); // prints "false"

You shouldn’t have slashes around the pattern.

The best you can do is to translate

$line = "I have your PaTTerN right here";
if ($line =~ /pattern/i) {
    print "matched.\n";
}

this way

import java.util.regex.*;

String line     = "I have your PaTTerN right here";
String pattern  = "pattern";      
Pattern regcomp = Pattern.compile(pattern, CASE_INSENSITIVE
                                        | UNICODE_CASE
                // comment next line out for legacy Java \b\w\s breakage 
                                        | UNICODE_CHARACTER_CLASSES  
                                );    
Matcher regexec = regcomp.matcher(line);    
if (regexec.find()) {
    System.out.println("matched");
} 

There, see how much easier that isn’t? :)

Another thing you lose with Java, because Java doesn’t actually know a regex from doubly linked list from a hole in its head, is compile-time compilation of patterns. Me, I’ve always found compile time the best time for compilation, but try telling Java that. Java makes it really tough to realize that very simple program-sanity measure, something you really need to do in every program all the time. This design flaw is a royal pain in the butt, beecause halfway through your program you take an exception for something that should have been caught during compile time when the rest of your program was being compiled. Just about as exasperating as coitus interruptus, because you were well on your way to getting your business done and BANG everything is ruined.

I didn’t implement the solution to that vexing annoyance in my code above, but you can fake it with some static initialization.

查看更多
登录 后发表回答