为什么这些不同的编码不允许我正确显示葡萄牙语？-解网

问：

我正在做一些涉及葡萄牙语文本的文本挖掘。我的一些自定义文本挖掘函数中还包含其他特殊字符。

我不是这个话题的专家。当我的许多字符开始显示不正确时，我认为我需要更改文件编码。我试过了

ISO-8858-1认证
ISO-8858-7认证
UTF-8 格式
视窗-1252

它们都没有改善字符的显示。我是否需要不同的编码，或者我是否做错了？

例如，当我尝试从 GitHub 读取此非索引字列表时：

stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")

他们是这样出来的：

tail(stop_words, 17)

206    tivÃ©ramos
207         tenha
208      tenhamos
209        tenham
210       tivesse
211   tivÃ©ssemos
212      tivessem
213         tiver
214      tivermos
215       tiverem
216         terei
217         terÃ¡
218       teremos
219        terÃ£o
220         teria
221     terÃamos
222        teriam

我也试过了.stringsAsFactors = F

我不会说葡萄牙语，但我的直觉告诉我，欧元和版权符号不在他们的字母表中。此外，它似乎正在将一些重音小写 e 更改为大写不同重音的 A。

如果有帮助：

Sys.getlocale()

[1] “LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 国家.1252;LC_MONETARY=English_United 国家.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252”

我还尝试更改了语言环境，并且.stri_encode(stop_words$V1, "", "UTF-8")tail(enc2native(as.vector(stop_words[,1])),17)

R 文本编码字符编码

@OriolMirosa在从我的系统默认值（ISO-8859-1）更改编码之前，我遇到了问题。我尝试使用 RStudio（使用编码重新打开）更改它，然后重新提取数据。我也尝试用包装更改它。我认为下面的答案是正确的，它以某种方式被双重编码，但我不知道为什么或如何修复它。stringi

0赞 Oriol Mirosa 7/28/2017

您是否尝试过或enc2utf8(as.vector(stop_words[,1]))enc2native(as.vector(stop_words[,1]))

0赞 Hack-R 7/28/2017

@OriolMirosa我没有尝试过，谢谢。在您阅读您的评论后，我现在刚刚尝试过，但问题仍然存在。

0赞 Oriol Mirosa 7/28/2017

嗯。。。你在什么系统中？你使用 RStudio 吗？您的 R 终端使用什么字体？你能在终端中看到波浪号和其他拉丁字符吗？（如果您的键盘是英文的，请按 alt+e，然后按 e 得到“é”）

答：

1赞 Alexandre Mercier Aubin 7/28/2017 #1

您似乎正在对 utf-8 进行双重编码。

下面是 utf-8 中的字符图表：http://www.i18nqa.com/debug/utf8-debug.html。
现在看看“实际”列。

如您所见，打印的字符似乎表示实际值，而不是编码值。

一个临时的解决方法是解码一层 utf-8。

更新：

安装 R 后，我试图重现该问题。
这是我的控制台日志，有一个简单的解释：

首先，我复制粘贴了您的代码：

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")
> tail(stop_words, 17)
             V1
206  tivÃ©ramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivÃ©ssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terÃ¡
218     teremos
219      terÃ£o
220       teria
221   terÃamos
222      teriam

好的，所以它没有按原样工作，所以我在 read.table 函数的末尾添加了编码参数。当我尝试使用小写的 utf-8 时，结果如下：

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt",encoding="utf-8")
> tail(stop_words, 17)
             V1
206  tivÃ©ramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivÃ©ssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terÃ¡
218     teremos
219      terÃ£o
220       teria
221   terÃamos
222      teriam

最后，我使用了带有大写字母的 UTF-8，现在它可以正常工作：

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt", encoding = "UTF-8")
> tail(stop_words, 17)
            V1
206  tivéramos
207      tenha
208   tenhamos
209     tenham
210    tivesse
211 tivéssemos
212   tivessem
213      tiver
214   tivermos
215    tiverem
216      terei
217       terá
218    teremos
219      terão
220      teria
221   teríamos
222     teriam

您可能忘记将编码参数放在 read.table 的末尾，或者尝试使用小写而不是大写。我从中了解到的是，如果您不指定字符已在其中编码，则 R 会尝试将字符转换为 UTF-8。

为什么这些不同的编码不允许我正确显示葡萄牙语？

Why aren't these various encodings allowing me to properly display Portuguese?

评论

评论

评论