没有编写正确的正则表达式集-解网

问：

我正在使用：

蟒蛇 3.11.1
Windows 10 专业版
请求 2.31.0
美丽汤4 4.12.2
熊猫 2.1.2
jupyter（用 jup 编写，但我将在 PyCharm 中完成代码）

我通过他们发布课程表的大学网站的 html 请求获取文本，我得到了它们，但顺序分散，如您在图像和文本文件（下面的链接）中看到的那样，我无法编写正则表达式来使文本可读，帮助我解决这个问题

from bs4 import BeautifulSoup
import re
import requests
import pandas as pd

url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSNR-Gvp7MBcYQo0GM5nU3UC7DSIGMCwKq-eQGIY_alqORpe1pvZ00PI63wNuOyiJbZI_AP6nSeWWop/pubhtml'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

table = soup.find_all('table')[0]

world_titles = table.find_all('td')

world_table_titles = [title.text.strip() for title in world_titles]

# here I can't write reg exp to make text readable
clean_titles = [re.sub(r'[",\s]+', '', title) for title in world_table_titles]

print(clean_titles)

我很乐意得到有关如何使文本按类型可读的说明：

ПОНЕДЕЛЬНИК

Время АСОИ-1-23
8:00
Физика (пр) Нарманбетова Г. Ж
9:30    Математика Алыкулова К.Б
11:00   Основы экономики, менеджмента и маркетинга Алтыбаева Ш.И.
12:40   Русский язык Омуркулова Г.М.

我知道要求很多，但我真的被困住了

文本文件 regular_expression

我正在观看 youtube 教程、regex101、ai 聊天机器人，但一切都无济于事

Python 正则表达式 pandas

# read a brut version of the Google Sheeet html
from io import StringIO; import pandas as pd
tmp = (pd.read_html(StringIO(page.text))[0].iloc[4:, 1:]
           .dropna(how="all").T.set_index(4).T)

# to de-duplicate the headers (optional ?)
s = tmp.columns.to_series()
tmp.columns = (s.str.cat(s.groupby(level=0).cumcount().add(1)
                .astype(str), sep="-").where(s.duplicated(keep=False), s))

df = tmp.set_index(["Время-1", "Время-2"]).rename_axis(columns=None)

这就形成了一个分层的数据帧，loc 将为您提供预期的输出：

df.loc["ПОНЕДЕЛЬНИК", "АСОИ-1-23"]

Время-2
8:00                        Физика (пр) Нарманбетова Г. Ж.
9:30                              Математика Алыкулова К.Б
11:00    Основы экономики, менеджмента и маркетинга Алт...
12:40                         Русский язык Омуркулова Г.М.
Name: АСОИ-1-23, dtype: object

输出（整张表）：

print(df)

                          АСОИ-1-23 ауд.-1  ...      ЭУБДМ-1-23 ауд.-15
Время-1     Время-2                         ...                        
ПОНЕДЕЛЬНИК 8:00     Физика (пр)...    338  ...  Физика (пр)...     338
            9:30     Математика ...    407  ...  Математика ...     407
            11:00    Основы экон...    411  ...  Основы экон...     411
...                             ...    ...  ...             ...     ...
ПЯТНИЦА     9:30     Физика(лек)...    338  ...  Физика(лек)...     338
            11:00    Введение в ...    422  ...             NaN     NaN
            12:40    Кыргыз тил ...    405  ...  Кыргыз тил ...     405

[20 rows x 30 columns]

为了好玩，如果您还想克隆格式，可以使用 Styler ：

def fmt_outeridx(ser):
    return ["""background-color: #00ffff; font-weight: bold;
            font-size: 14pt;text-align: center;""" for _ in ser]

def fmt_aya(ser):
    return np.where(ser.index.str.startswith("ауд"),
                    "background-color:#ffff99", "")

(
    df.style
        .set_properties(**{"font-weight": "bold",
            "border": "1px solid", "text-align": "center"})
        .apply_index(fmt_outeridx, axis=0, level=0)
        .apply_index(fmt_outeridx, axis=1)
        .apply(fmt_aya, axis=1)
)

没有编写正确的正则表达式集

Сan't write the correct set of regular expressions

评论

评论