提问人:Horseman 提问时间:1/31/2023 更新时间:2/1/2023 访问量:1830
极坐标中的字符串操作
String manipulation in polars
问:
我在极地有一个记录,到目前为止没有标题。此标头应引用记录的第一行。在将此行实例化为标题之前,我想操作条目。
import polars as pl
# Creating a dictionary with the data
data = {
"Column_1": ["ID", 4, 4, 4, 4],
"Column_2": ["LocalValue", "B", "C", "D", "E"],
"Column_3": ["Data\nField", "Q", "R", "S", "T"],
"Column_4": [None, None, None, None, None],
"Column_5": ["Global Value", "G", "H", "I", "J"],
}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌──────────┬────────────┬──────────┬──────────┬──────────────┐
│ Column_1 ┆ Column_2 ┆ Column_3 ┆ Column_4 ┆ Column_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞══════════╪════════════╪══════════╪══════════╪══════════════╡
│ ID ┆ LocalValue ┆ Data ┆ null ┆ Global Value │
│ ┆ ┆ Field ┆ ┆ │
│ null ┆ B ┆ Q ┆ null ┆ G │
│ null ┆ C ┆ R ┆ null ┆ H │
│ null ┆ D ┆ S ┆ null ┆ I │
│ null ┆ E ┆ T ┆ null ┆ J │
└──────────┴────────────┴──────────┴──────────┴──────────────┘
首先,我想用下划线替换单词之间的换行符和空格。此外,我想用下划线填充骆驼案例(例如 TestTest -> Test_Test)。最后,所有条目都应为小写。为此,我编写了以下函数:
def clean_dataframe_columns(df):
header = list(df.head(1).transpose().to_series())
cleaned_headers = []
for entry in header:
if entry:
entry = (
entry.replace("\n", "_")
.replace("(?<=[a-z])(?=[A-Z])", "_")
.replace("\s", "_")
.to_lowercase()
)
else:
entry = "no_column"
cleaned_headers.append(entry)
df.columns = cleaned_headers
return df
不幸的是,我有以下错误。我做错了什么?
AttributeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df1 = clean_dataframe_columns(df)
Cell In[12], line 7, in clean_dataframe_columns(df)
4 for entry in header:
5 if entry:
6 entry = (
----> 7 entry.str.replace("\n", "_")
8 .replace("(?<=[a-z])(?=[A-Z])", "_")
9 .replace("\s", "_")
10 .to_lowercase()
11 )
12 else:
13 entry = "no_column"
AttributeError: 'str' object has no attribute 'str'
目标应该是以下数据帧:
shape: (4, 5)
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ str │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4 ┆ B ┆ Q ┆ null ┆ G │
│ 4 ┆ C ┆ R ┆ null ┆ H │
│ 4 ┆ D ┆ S ┆ null ┆ I │
│ 4 ┆ E ┆ T ┆ null ┆ J │
└─────┴─────────────┴────────────┴───────────┴──────────────┘
答:
2赞
glebcom
1/31/2023
#1
在这里,你遍历了python字符串,所以你应该使用相应的方法(比如而不是)。for entry in header:
.lower()
.to_lowercase()
重写 sol-n:
import re
def get_cols(raw_col):
if raw_col is None: return "no_column"
raw_col = re.sub("(?<=[a-z])(?=[A-Z])", "_", raw_col)
return raw_col.replace("\n", "_").replace(" ", "_").lower()
def clean_dataframe_columns(df):
raw_cols = table.head(1).transpose().to_series().to_list()
return df.rename({
col: get_cols(raw_col) for col, raw_col in zip(df.columns, raw_cols)
}).slice(1).with_column(pl.col("id").fill_null(4).cast(pl.Int32))
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4 ┆ B ┆ Q ┆ null ┆ G │
│ 4 ┆ C ┆ R ┆ null ┆ H │
│ 4 ┆ D ┆ S ┆ null ┆ I │
│ 4 ┆ E ┆ T ┆ null ┆ J │
└─────┴─────────────┴────────────┴───────────┴──────────────┘
1赞
Horseman
2/1/2023
#2
我用这种方法自己解决了这个问题:
def clean_select_columns(self, df: pl.DataFrame) -> pl.DataFrame:
"""
Clean columns from a dataframe.
:param df: input Dataframe
:return: Dataframe with cleaned columns
The function takes a loaded Dataframe and performs the following operations:
Transposes the first row of the dataframe to get the header
Selects the required columns defined in the list required_columns
Cleans the header names by:
1. Replacing special characters with underscores
2. Converting CamelCase strings to snake_case strings
3. Converting all columns to lowercase
4. Naming columns with no names as "no_column_X", where X is a unique integer
5. Returns the cleaned dataframe.
"""
header = list(df.head(1).transpose().to_series())
cleaned_headers = []
i = 0
for entry in header:
if entry:
entry = (
re.sub(r"(?i)([\n ?])", "",
re.sub(r"(?<!^)(?=[A-Z][a-z])", "_", entry))
.lower()
)
else:
entry = f"no_column_{i}"
cleaned_headers.append(entry)
i += 1
df.columns = cleaned_headers
return df
上一个:如何将字符串转换为可用格式
评论