提问人:T. Dakin 提问时间:12/4/2020 更新时间:12/4/2020 访问量:81
如何使用 Scanner.useDelimiter() 匹配两个相邻的字符,后跟一个单词?
How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?
问:
我正在尝试解析具有一般结构的普通.txt文件
[[Title]]
CATEGORIES: text, text, text
some text etc...
[[Next Title]]
CATEGORIES: text, text, text
Next other text etc ...
在我的代码中,我使用这种模式
Scanner inputScanner = new Scanner(fileEntry)
inputScanner.useDelimiter("\\]\\]|\\[\\[");
while (inputScanner.hasNext()) {
// Get title of wiki article and contents
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
}
但它也抓住了诸如
"[some text [ some other text ] some more text ]"
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s"
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]"
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]"
"observed is not some nonphysical world of [[consciousness]], mind, or mental life "
我希望扫描仪在看到时进行分隔
'[[' or ']] CATEGORIES'
但不确定我该怎么做,因为我不擅长模式或正则表达式。 谁能确定一种可能有效的模式?我尝试查看其他分隔符问题和 javadocs,但很难将它们应用于我的问题。 感谢您抽出宝贵时间提供任何帮助!
答:
1赞
Prasanna
12/4/2020
#1
为了正确匹配标题,我们可以在正则表达式中使用:positive lookahead
\[\[(?=.*]]\nCATEGORIES:)|]]\n(?=CATEGORIES:)
解释:
- 匹配后跟任意字符和字符串序列。使用积极的前瞻,所以只有匹配。
[[
CATEGORIES
[[
- 同样,匹配后跟字符串。
]]
CATEGORIES
更新的片段:
String text = "[[title1]] \n" +
"CATEGORIES: [some text [ some other text ] some more text ]\n" +
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s\n" +
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]\n" +
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]\n" +
"observed is not some nonphysical world of [[consciousness]], mind, or mental life\n" +
"[[title2]]\n" +
"CATEGORIES: [[some more text]]";
Scanner inputScanner = new Scanner(text);
inputScanner.useDelimiter("\\[\\[(?=.*]]\\s*CATEGORIES:)|]]\\s*\n(?=\\s*CATEGORIES:)");
while (inputScanner.hasNext()) {
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
System.out.printf("Name:%s\nContents:%s\n\n", wikiName, wikiContents);
}
输出:
Name:title1
Contents:CATEGORIES: [some text [ some other text ] some more text ]
[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s
[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]
[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]
observed is not some nonphysical world of [[consciousness]], mind, or mental life
Name:title2
Contents:CATEGORIES: [[some more text]]
评论
0赞
T. Dakin
12/5/2020
非常感谢你,如果你不介意解释一下,'?=.*'是否允许正则表达式捕获里面的字符串?
1赞
Prasanna
12/5/2020
@T.Dakin,就像你的正则表达式一样,我用作分隔符。但是,我使用积极的 lookahed ,所以只有在它紧跟着 .您可以参考 lookahead - regular-expressions.info/lookaround.html。[[
?=
[[
.*]\nCATEGORIES
评论
"\\]\\]\s*?CATEGORIES|\\[\\["
Body mass index]] CATEGORIES: Body shape, Human weight, Human height, Medical signs, Ratios, Belgian inventions The body mass index (BMI), or Quetelet index,