java.util.Scanner 和维基百科-解网

问：

我正在尝试使用 java.util.Scanner 获取维基百科内容并将其用于基于单词的搜索。事实是，这一切都很好，但是在阅读某些单词时，它会给我带来错误。查看代码并进行一些检查，结果发现，用一些单词似乎无法识别编码，或者，内容不再可读。这是用于获取页面的代码：

-开始-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

问题出现在意大利语维基百科的“pubblico”一词上。 Word pubblico 上的 println 结果是这样的（剪切）： ï¿ï¿1/2]Ksr>ï¿1/2~E 1/21Aï¿1/2ï¿1/2ï¿1/2Eï¿1/2ER3tHZï¿1/24vï¿1/2ï¿1/2&PZjtcï¿1/2¿1/2ï¿1/2Dï¿1/27_|ï¿1/2ï¿1/2ï¿1/2ï¿1/2=8ï¿1/2ï¿1/2Ø}

你知道为什么吗？然而，查看页面源和标题是相同的，具有相同的编码......

原来内容是 gzip 压缩的，所以我可以告诉维基百科不要向我发送压缩的页面，或者这是唯一的方法吗？谢谢

维基百科 java.util.scanner

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

您也可以直接将字符集传递给 Scanner 构造函数，如另一个答案所示。

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

编码不会改变。为什么？

0赞 Marco Beggio 2/13/2009 #5

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

所以有效!!

上一个：编写一个更简洁、更模块化的命令解析器

下一个：通过扫描仪保证整数输入的更简单方法？

java.util.Scanner 和维基百科

java.util.Scanner and Wikipedia

评论

评论