土耳其字符的未知字符

Unknown character for Turkish character

提问人:ARAT 提问时间:11/2/2023 更新时间:11/2/2023 访问量:78

问:

我有一个由两列组成的数据帧:(1) 土耳其城市,(2) 相应的值。

dict_ = {'City': {0: 'ADANA',
  1: 'ANKARA',
  2: 'ANTALYA',
  3: 'AYDIN',
  4: 'BALIKESİR',
  5: 'BURSA',
  6: 'DENİZLİ',
  7: 'DÜZCE',
  8: 'DİYARBAKIR',
  9: 'ELAZIĞ',
  10: 'GAZİANTEP',
  11: 'GİRESUN',
  12: 'HATAY',
  13: 'KAHRAMANMARAŞ',
  14: 'KARABÜK',
  15: 'KARS',
  16: 'KAYSERİ',
  17: 'KIRIKKALE',
  18: 'KIRKLARELİ',
  19: 'KIRŞEHİR',
  20: 'KOCAELİ',
  21: 'KONYA',
  22: 'KÜTAHYA',
  23: 'MANİSA',
  24: 'MARDİN',
  25: 'MERSİN',
  26: 'MUĞLA',
  27: 'ORDU',
  28: 'OSMANİYE',
  29: 'SAKARYA',
  30: 'SAMSUN',
  31: 'TRABZON',
  32: 'UŞAK',
  33: 'YALOVA',
  34: 'ZONGULDAK',
  35: 'ÇORUM',
  36: 'İSTANBUL',
  37: 'İZMİR'},
 'Value': {0: 15,
  1: 25,
  2: 19,
  3: 2,
  4: 6,
  5: 5,
  6: 3,
  7: 1,
  8: 1,
  9: 1,
  10: 7,
  11: 2,
  12: 31,
  13: 5,
  14: 1,
  15: 1,
  16: 4,
  17: 5,
  18: 1,
  19: 1,
  20: 6,
  21: 4,
  22: 2,
  23: 1,
  24: 1,
  25: 5,
  26: 5,
  27: 4,
  28: 3,
  29: 2,
  30: 3,
  31: 2,
  32: 2,
  33: 1,
  34: 2,
  35: 2,
  36: 221,
  37: 6}}

data = pd.DataFrame(dict_)

当我尝试将列大写(其中第一个字母是大写的,其余字母是小写的)时,我遇到了一个奇怪的字符问题。City

data['İl'].apply(str.capitalize)

当我无法识别时,“İ”的小写版本会更改为字符,例如:

enter image description here

enter image description here

import unicodedata
unicodedata.name("i̇")
# TypeError: name() argument 1 must be a unicode character, not str

我尝试了很多解决方案,但无济于事!

python pandas 字符串 大写 capitalize

评论

0赞 tripleee 11/2/2023
看起来你的字符是分解形式的,所以 i with dot 被小写成 i with dot。或者,您可能正在使用无法正确支持此字形的字体。我们无法从屏幕截图中看出您的文件中有哪些字节。
0赞 Matt Pitkin 11/2/2023
请参阅此处的讨论 bugs.python.org/issue34723
0赞 Matt Pitkin 11/2/2023
另请参阅 stackoverflow.com/questions/19030948/...

答:

1赞 Faruk Karaca 11/2/2023 #1
def turkish_title_case(text):
    turkish_correction = {"İ": "i", "I": "ı", "Ç": "ç", "Ğ": "ğ", "Ü": "ü", "Ş": "ş", "Ö": "ö"}

    for turkish, corrected in turkish_correction.items():
        text = text.replace(turkish, corrected)
    text = text.capitalize()

    turkish_correction = {"I": "İ"}
    for turkish, corrected in turkish_correction.items():
        text = text.replace(turkish, corrected)

    return text

考虑到城市名称是固定的,这可能适用于这种情况。

enter image description here

1赞 Matt Pitkin 11/2/2023 #2

基于此解决方案,您可以尝试使用以下unicode_tr包进行安装:

pip install unicode_tr

有了这个,你可以做:

import pandas as pd
from unicode_tr import unicode_tr

dict_ = {
    'City': {
        0: 'ADANA',
        1: 'ANKARA',
        2: 'ANTALYA',
        3: 'AYDIN',
        4: 'BALIKESİR',
        5: 'BURSA',
        6: 'DENİZLİ',
        7: 'DÜZCE',
        8: 'DİYARBAKIR',
        9: 'ELAZIĞ',
        10: 'GAZİANTEP',
        11: 'GİRESUN',
        12: 'HATAY',
        13: 'KAHRAMANMARAŞ',
        14: 'KARABÜK',
        15: 'KARS',
        16: 'KAYSERİ',
        17: 'KIRIKKALE',
        18: 'KIRKLARELİ',
        19: 'KIRŞEHİR',
        20: 'KOCAELİ',
        21: 'KONYA',
        22: 'KÜTAHYA',
        23: 'MANİSA',
        24: 'MARDİN',
        25: 'MERSİN',
        26: 'MUĞLA',
        27: 'ORDU',
        28: 'OSMANİYE',
        29: 'SAKARYA',
        30: 'SAMSUN',
        31: 'TRABZON',
        32: 'UŞAK',
        33: 'YALOVA',
        34: 'ZONGULDAK',
        35: 'ÇORUM',
        36: 'İSTANBUL',
        37: 'İZMİR'
    },
    'Value': {
        0: 15,
        1: 25,
        2: 19,
        3: 2,
        4: 6,
        5: 5,
        6: 3,
        7: 1,
        8: 1,
        9: 1,
        10: 7,
        11: 2,
        12: 31,
        13: 5,
        14: 1,
        15: 1,
        16: 4,
        17: 5,
        18: 1,
        19: 1,
        20: 6,
        21: 4,
        22: 2,
        23: 1,
        24: 1,
        25: 5,
        26: 5,
        27: 4,
        28: 3,
        29: 2,
        30: 3,
        31: 2,
        32: 2,
        33: 1,
        34: 2,
        35: 2,
        36: 221,
        37: 6
    }
}

data = pd.DataFrame(dict_)

data["City"].apply(unicode_tr.capitalize)

输出:

0             Adana
1            Ankara
2           Antalya
3             Aydın
4         Balıkesir
5             Bursa
6           Denizli
7             Düzce
8        Diyarbakır
9            Elazığ
10        Gaziantep
11          Giresun
12            Hatay
13    Kahramanmaraş
14          Karabük
15             Kars
16          Kayseri
17        Kırıkkale
18       Kırklareli
19         Kırşehir
20          Kocaeli
21            Konya
22          Kütahya
23           Manisa
24           Mardin
25           Mersin
26            Muğla
27             Ordu
28         Osmaniye
29          Sakarya
30           Samsun
31          Trabzon
32             Uşak
33           Yalova
34        Zonguldak
35            Çorum
36         İstanbul
37            İzmir
Name: City, dtype: object