提问人:Kuldeep Singh 提问时间:6/12/2020 最后编辑:Stefan ZobelKuldeep Singh 更新时间:6/13/2020 访问量:1816
Scanner.findAll() 和 Matcher.results() 对于相同的输入文本和模式,工作方式不同
Scanner.findAll() and Matcher.results() work differently for same input text and pattern
问:
我在使用正则表达式拆分属性字符串期间看到了这个有趣的事情。我无法找到根本原因。
我有一个字符串,其中包含诸如属性键=值对之类的文本。 我有一个正则表达式,它根据 = 位置将字符串拆分为键/值。它将 first = 视为拆分点。value 也可以包含 =。
我尝试在 Java 中使用两种不同的方法来做到这一点。
使用 Scanner.findAll() 方法
这不符合预期。它应该根据模式提取和打印所有键。但我发现它的行为很奇怪。我有一个键值对,如下所示
SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important .....}
应提取的键是 SectionError.ErrorMessage=,但它也将 errorlevel= 视为键。
有趣的一点是,如果我从属性中删除其中一个字符 String passed,它的行为正常,并且只提取 SectionError.ErrorMessage= 键。
使用 Matcher.results() 方法
这工作正常。无论我们在属性字符串中放入什么,都没有问题。
我尝试的示例代码:
import java.util.Scanner;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import static java.util.regex.Pattern.MULTILINE;
public class MessageSplitTest {
static final Pattern pattern = Pattern.compile("^[a-zA-Z0-9._]+=", MULTILINE);
public static void main(String[] args) {
final String properties =
"SectionOne.KeyOne=first value\n" + // removing one char from here would make the scanner method print expected keys
"SectionOne.KeyTwo=second value\n" +
"SectionTwo.UUIDOne=379d827d-cf54-4a41-a3f7-1ca71568a0fa\n" +
"SectionTwo.UUIDTwo=384eef1f-b579-4913-a40c-2ba22c96edf0\n" +
"SectionTwo.UUIDThree=c10f1bb7-d984-422f-81ef-254023e32e5c\n" +
"SectionTwo.KeyFive=hello-world-sample\n" +
"SectionThree.KeyOne=first value\n" +
"SectionThree.KeyTwo=second value additional text just to increase the length of the text in this value still not enough adding more strings here n there\n" +
"SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message}\n" +
"SectionFour.KeyOne=sixth value\n" +
"SectionLast.KeyOne=Country";
printKeyValuesFromPropertiesUsingScanner(properties);
System.out.println();
printKeyValuesFromPropertiesUsingMatcher(properties);
}
private static void printKeyValuesFromPropertiesUsingScanner(String properties) {
System.out.println("===Using Scanner===");
try (Scanner scanner = new Scanner(properties)) {
scanner
.findAll(pattern)
.map(MatchResult::group)
.forEach(System.out::println);
}
}
private static void printKeyValuesFromPropertiesUsingMatcher(String properties) {
System.out.println("===Using Matcher===");
pattern.matcher(properties).results()
.map(MatchResult::group)
.forEach(System.out::println);
}
}
打印输出:
===Using Scanner===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
errorlevel=
SectionFour.KeyOne=
SectionLast.KeyOne=
===Using Matcher===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
SectionFour.KeyOne=
SectionLast.KeyOne=
造成这种情况的根本原因可能是什么?扫描仪的 findAll 的工作方式与匹配器不同吗?
如果需要更多信息,请告诉我。
答:
Scanner
的文档中经常提到“缓冲区”这个词。这表明它不知道它正在读取的整个字符串,并且一次只在缓冲区中保存其中的一小部分。这是有道理的,因为 s 也被设计为从流中读取,从流中读取所有内容可能需要很长时间(或永远!)并占用大量内存。Scanner
Scanner
在 的源代码中,确实有一个:Scanner
CharBuffer
// Internal buffer used to hold input
private CharBuffer buf;
由于字符串的长度和内容,扫描仪决定加载所有内容,直到...
SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very...
^
somewhere here
(It could be anywhere in the word "errorlevel")
...进入缓冲区。然后,在读取字符串的一半后,字符串的另一半开始如下所示:
errorlevel=Warning {HelpMessage:This is very...
errorLevel=
现在是字符串的开头,导致模式匹配。
Matcher
不使用缓冲区。它将与它匹配的整个字符串存储在字段中:
/**
* The original string being matched.
*/
CharSequence text;
因此,在 中未观察到此行为。Matcher
评论
findAll
next
findAll
Scanner(String)
StringReader
CharBuffer.wrap(…)
清扫者的回答是正确的,这是 的缓冲区不包含整个字符串的问题。我们可以简化示例以专门触发问题:Scanner
static final Pattern pattern = Pattern.compile("^ABC.", Pattern.MULTILINE);
public static void main(String[] args) {
String testString = "\nABC1\nXYZ ABC2\nABC3ABC4\nABC4";
String properties = "X".repeat(1024 - testString.indexOf("ABC4")) + testString;
String s1 = usingScanner(properties);
System.out.println("Using Scanner: "+s1);
String m = usingMatcher(properties);
System.out.println("Using Matcher: "+m);
if(!s1.equals(m)) System.out.println("mismatch");
if(s1.equals(usingScannerNoStream(properties)))
System.out.println("Not a stream issue");
}
private static String usingScanner(String source) {
return new Scanner(source)
.findAll(pattern)
.map(MatchResult::group)
.collect(Collectors.joining(" + "));
}
private static String usingScannerNoStream(String source) {
Scanner s = new Scanner(source);
StringJoiner sj = new StringJoiner(" + ");
for(;;) {
String match = s.findWithinHorizon(pattern, 0);
if(match == null) return sj.toString();
sj.add(match);
}
}
private static String usingMatcher(String source) {
return pattern.matcher(source).results()
.map(MatchResult::group)
.collect(Collectors.joining(" + "));
}
打印:
Using Scanner: ABC1 + ABC3 + ABC4 + ABC4
Using Matcher: ABC1 + ABC3 + ABC4
mismatch
Not a stream issue
此示例在前缀前面附加尽可能多的字符,以便将误报匹配的开头与缓冲区的大小对齐。的初始缓冲区大小为 ,但可能会在需要时放大。X
Scanner
1024
由于忽略了扫描程序的分隔符,就像 一样,此代码还显示手动循环表现出相同的行为,换句话说,这不是使用的 Stream API 的问题。findAll
findWithinHorizon
findWithinHorizon
由于会在需要时扩大缓冲区,因此我们可以通过使用匹配操作来解决此问题,该操作在执行预期的匹配操作之前强制将整个内容读取到缓冲区中,例如Scanner
private static String usingScanner(String source) {
Scanner s = new Scanner(source);
s.useDelimiter("(?s).*").hasNext();
return s
.findAll(pattern)
.map(MatchResult::group)
.collect(Collectors.joining(" + "));
}
这个特定的分隔符会消耗整个字符串,这将强制完全缓冲字符串,而不会推进位置。后续操作会忽略分隔符和检查结果,但由于缓冲区完全填充,不再出现此问题。hasNext()
findAll()
hasNext()
当然,这破坏了解析实际流时的优势。Scanner
评论