提问人:ARAT 提问时间:11/2/2023 更新时间:11/2/2023 访问量:78
土耳其字符的未知字符
Unknown character for Turkish character
问:
我有一个由两列组成的数据帧:(1) 土耳其城市,(2) 相应的值。
dict_ = {'City': {0: 'ADANA',
1: 'ANKARA',
2: 'ANTALYA',
3: 'AYDIN',
4: 'BALIKESİR',
5: 'BURSA',
6: 'DENİZLİ',
7: 'DÜZCE',
8: 'DİYARBAKIR',
9: 'ELAZIĞ',
10: 'GAZİANTEP',
11: 'GİRESUN',
12: 'HATAY',
13: 'KAHRAMANMARAŞ',
14: 'KARABÜK',
15: 'KARS',
16: 'KAYSERİ',
17: 'KIRIKKALE',
18: 'KIRKLARELİ',
19: 'KIRŞEHİR',
20: 'KOCAELİ',
21: 'KONYA',
22: 'KÜTAHYA',
23: 'MANİSA',
24: 'MARDİN',
25: 'MERSİN',
26: 'MUĞLA',
27: 'ORDU',
28: 'OSMANİYE',
29: 'SAKARYA',
30: 'SAMSUN',
31: 'TRABZON',
32: 'UŞAK',
33: 'YALOVA',
34: 'ZONGULDAK',
35: 'ÇORUM',
36: 'İSTANBUL',
37: 'İZMİR'},
'Value': {0: 15,
1: 25,
2: 19,
3: 2,
4: 6,
5: 5,
6: 3,
7: 1,
8: 1,
9: 1,
10: 7,
11: 2,
12: 31,
13: 5,
14: 1,
15: 1,
16: 4,
17: 5,
18: 1,
19: 1,
20: 6,
21: 4,
22: 2,
23: 1,
24: 1,
25: 5,
26: 5,
27: 4,
28: 3,
29: 2,
30: 3,
31: 2,
32: 2,
33: 1,
34: 2,
35: 2,
36: 221,
37: 6}}
data = pd.DataFrame(dict_)
当我尝试将列大写(其中第一个字母是大写的,其余字母是小写的)时,我遇到了一个奇怪的字符问题。City
data['İl'].apply(str.capitalize)
当我无法识别时,“İ”的小写版本会更改为字符,例如:
或
import unicodedata
unicodedata.name("i̇")
# TypeError: name() argument 1 must be a unicode character, not str
我尝试了很多解决方案,但无济于事!
答:
1赞
Faruk Karaca
11/2/2023
#1
def turkish_title_case(text):
turkish_correction = {"İ": "i", "I": "ı", "Ç": "ç", "Ğ": "ğ", "Ü": "ü", "Ş": "ş", "Ö": "ö"}
for turkish, corrected in turkish_correction.items():
text = text.replace(turkish, corrected)
text = text.capitalize()
turkish_correction = {"I": "İ"}
for turkish, corrected in turkish_correction.items():
text = text.replace(turkish, corrected)
return text
考虑到城市名称是固定的,这可能适用于这种情况。
1赞
Matt Pitkin
11/2/2023
#2
基于此解决方案,您可以尝试使用以下unicode_tr包进行安装:
pip install unicode_tr
有了这个,你可以做:
import pandas as pd
from unicode_tr import unicode_tr
dict_ = {
'City': {
0: 'ADANA',
1: 'ANKARA',
2: 'ANTALYA',
3: 'AYDIN',
4: 'BALIKESİR',
5: 'BURSA',
6: 'DENİZLİ',
7: 'DÜZCE',
8: 'DİYARBAKIR',
9: 'ELAZIĞ',
10: 'GAZİANTEP',
11: 'GİRESUN',
12: 'HATAY',
13: 'KAHRAMANMARAŞ',
14: 'KARABÜK',
15: 'KARS',
16: 'KAYSERİ',
17: 'KIRIKKALE',
18: 'KIRKLARELİ',
19: 'KIRŞEHİR',
20: 'KOCAELİ',
21: 'KONYA',
22: 'KÜTAHYA',
23: 'MANİSA',
24: 'MARDİN',
25: 'MERSİN',
26: 'MUĞLA',
27: 'ORDU',
28: 'OSMANİYE',
29: 'SAKARYA',
30: 'SAMSUN',
31: 'TRABZON',
32: 'UŞAK',
33: 'YALOVA',
34: 'ZONGULDAK',
35: 'ÇORUM',
36: 'İSTANBUL',
37: 'İZMİR'
},
'Value': {
0: 15,
1: 25,
2: 19,
3: 2,
4: 6,
5: 5,
6: 3,
7: 1,
8: 1,
9: 1,
10: 7,
11: 2,
12: 31,
13: 5,
14: 1,
15: 1,
16: 4,
17: 5,
18: 1,
19: 1,
20: 6,
21: 4,
22: 2,
23: 1,
24: 1,
25: 5,
26: 5,
27: 4,
28: 3,
29: 2,
30: 3,
31: 2,
32: 2,
33: 1,
34: 2,
35: 2,
36: 221,
37: 6
}
}
data = pd.DataFrame(dict_)
data["City"].apply(unicode_tr.capitalize)
输出:
0 Adana
1 Ankara
2 Antalya
3 Aydın
4 Balıkesir
5 Bursa
6 Denizli
7 Düzce
8 Diyarbakır
9 Elazığ
10 Gaziantep
11 Giresun
12 Hatay
13 Kahramanmaraş
14 Karabük
15 Kars
16 Kayseri
17 Kırıkkale
18 Kırklareli
19 Kırşehir
20 Kocaeli
21 Konya
22 Kütahya
23 Manisa
24 Mardin
25 Mersin
26 Muğla
27 Ordu
28 Osmaniye
29 Sakarya
30 Samsun
31 Trabzon
32 Uşak
33 Yalova
34 Zonguldak
35 Çorum
36 İstanbul
37 İzmir
Name: City, dtype: object
评论