在 Python 中匹配和提取 URL 部分的正则表达式

Regular expression to match and extract parts from an URL in Python

提问人:Rafa S 提问时间:10/11/2023 更新时间:10/14/2023 访问量:93

问:

我正在尝试从完整的工件 URL 中将工件实例名称、存储库名称和工件名称获取到 3 个变量中,如下所示。

"https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"

"https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

artifactory 实例的 -> 和artifactory.intuit.veg.comartifactory.skopeo.marvel.org

存储库名称为 -> 和annual-budget-localbulletins_virtual

工件名称 -> 和manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgzmanifests-approved/po09ij/annual-f3c.tgz

我可以使用多种组合,但我想了解我在这里使用 Python 的效率如何,任何指导都将非常有用。splitregex

我是否应该在单词之前和之后匹配字符串并执行额外的拆分操作以获取?artifactoryartifact name

python-3.x 正则表达式

评论


答:

2赞 Aymen Azoui 10/11/2023 #1

tyr 这个 :

import re

def extract_artifactory_data(url):
    pattern = r"https://(?P<instance>[^:/]+)(?::\d+)?/artifactory/(?P<repo>[^/]+)/(?P<artifact>.+)"
    match = re.match(pattern, url)
    
    if not match:
        return None
    
    return match.group("instance"), match.group("repo"), match.group("artifact")

url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)

print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)

评论

0赞 Rafa S 10/11/2023
请:)解释一下?
2赞 Cem Polat 10/11/2023 #2

下面是一个代码示例,用于拆分 URL,如您所述:

import re

# Sample URLs
urls = [
    "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz",
    "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
]

for url in urls:
    match = re.search(r'https://([^/]+).+?/([^/]+)/(.+)$', url)
    if match:
        instance_name, repository_name, artifact_name = match.groups()
    else:
        instance_name, repository_name, artifact_name = "N/A", "N/A", "N/A"

    print("Artifactory Instance:", instance_name)
    print("Repository Name:", repository_name)
    print("Artifact Name:", artifact_name)

对于正则表达式 https://([^/]+).+?/([^/]+)/(.+)$

https://:这部分模式与 URL 开头的文字字符“https://”匹配。

([^/]+):这是一个捕获组,它与一个或多个非正斜杠 (/) 的字符匹配。它用括号括起来,这意味着匹配的内容将被捕获,以后可以提取。

.+?/:这部分模式匹配一个或多个字符 (.+?),后跟正斜杠 (/)。.+?是非贪婪匹配,这意味着它将匹配尽可能少的字符,同时仍允许模式的其余部分匹配。

([^/]+):与第一个捕获组类似,这将匹配一个或多个不是正斜杠的字符并捕获它们。

.+)$:模式的这一部分匹配一个或多个字符,后跟行尾 ($) 并捕获它们。这允许它捕获第二个捕获组之后的所有内容,直到 URL 结束。

搜索函数使用正则表达式匹配输入字符串中的 instance_name、repository_name 和 artifact_name 组。

1赞 DuesserBaest 10/11/2023 #3

与Aymen Azouis解决方案非常相似,但进行了小幅优化。

  1. 使用恕我直言应该优先使用的库regexre
  2. 检测选项以及http://https://
  3. 所有格量词
(?x)
^                                  # start of pattern
https?                             # http with an optional s
://
(?P<artifactory_instance>[^/:]++)  # capture everything up to the next ":" or "/"
(?::\d++)?                         # if you encounter a port match it (optional)
/artifactory/
(?P<repository>[^/]++)             # match repository by capturing everything up to next "/"
/
(?P<artifact_names>.++)            # match the rest of URL to artifact names
$

在 regex101 (https://regex101.com/r/7Ww4ui/1) 上,所有格量词被省略,因为模块不处理它们(这是在 rexex101 上实现的)。re

或作为可执行代码:

import regex 

def extract_artifactory_data(url):
    pattern = r"^https?://(?P<artifactory_instance>[^/:]++)(?::\d++)?/artifactory/(?P<repository>[^/]++)/(?P<artifact_names>.++)$"
    match = regex.match(pattern, url)
    
    if not match:
        return None
    
    return match.group("artifactory_instance"), match.group("repository"), match.group("artifact_names")

url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)

print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)

评论

0赞 Rafa S 10/11/2023
太好了,感谢您在这里的投入!您能否详细说明正则表达式语法以及:)谢谢!
1赞 Rafa S 10/11/2023
完美,谢谢你的细节。
1赞 DuesserBaest 10/11/2023
如果您同意这个答案,请考虑接受它。
3赞 treuss 10/11/2023 #4

需要指出的是,还有一个模块,用于拆分 URL。没有理由重新发明轮子。urllib.parse

from urllib.parse import urlparse

urls = [ "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz", "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"]

for url in urls:
    o = urlparse(url)
    instance, repo = o.hostname, o.path.split('/')[2]
    print(instance, repo)