提问人:Rafa S 提问时间:10/11/2023 更新时间:10/14/2023 访问量:93
在 Python 中匹配和提取 URL 部分的正则表达式
Regular expression to match and extract parts from an URL in Python
问:
我正在尝试从完整的工件 URL 中将工件实例名称、存储库名称和工件名称获取到 3 个变量中,如下所示。
"https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
"https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
artifactory 实例的 -> 和artifactory.intuit.veg.com
artifactory.skopeo.marvel.org
存储库名称为 -> 和annual-budget-local
bulletins_virtual
工件名称 -> 和manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz
manifests-approved/po09ij/annual-f3c.tgz
我可以使用多种组合,但我想了解我在这里使用 Python 的效率如何,任何指导都将非常有用。split
regex
我是否应该在单词之前和之后匹配字符串并执行额外的拆分操作以获取?artifactory
artifact name
答:
tyr 这个 :
import re
def extract_artifactory_data(url):
pattern = r"https://(?P<instance>[^:/]+)(?::\d+)?/artifactory/(?P<repo>[^/]+)/(?P<artifact>.+)"
match = re.match(pattern, url)
if not match:
return None
return match.group("instance"), match.group("repo"), match.group("artifact")
url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)
print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)
评论
下面是一个代码示例,用于拆分 URL,如您所述:
import re
# Sample URLs
urls = [
"https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz",
"https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
]
for url in urls:
match = re.search(r'https://([^/]+).+?/([^/]+)/(.+)$', url)
if match:
instance_name, repository_name, artifact_name = match.groups()
else:
instance_name, repository_name, artifact_name = "N/A", "N/A", "N/A"
print("Artifactory Instance:", instance_name)
print("Repository Name:", repository_name)
print("Artifact Name:", artifact_name)
对于正则表达式 https://([^/]+).+?/([^/]+)/(.+)$ :
https://:这部分模式与 URL 开头的文字字符“https://”匹配。
([^/]+):这是一个捕获组,它与一个或多个非正斜杠 (/) 的字符匹配。它用括号括起来,这意味着匹配的内容将被捕获,以后可以提取。
.+?/:这部分模式匹配一个或多个字符 (.+?),后跟正斜杠 (/)。.+?是非贪婪匹配,这意味着它将匹配尽可能少的字符,同时仍允许模式的其余部分匹配。
([^/]+):与第一个捕获组类似,这将匹配一个或多个不是正斜杠的字符并捕获它们。
(.+)$:模式的这一部分匹配一个或多个字符,后跟行尾 ($) 并捕获它们。这允许它捕获第二个捕获组之后的所有内容,直到 URL 结束。
搜索函数使用正则表达式匹配输入字符串中的 instance_name、repository_name 和 artifact_name 组。
与Aymen Azouis解决方案非常相似,但进行了小幅优化。
- 使用恕我直言应该优先使用的库
regex
re
- 检测选项以及
http://
https://
- 所有格量词
(?x)
^ # start of pattern
https? # http with an optional s
://
(?P<artifactory_instance>[^/:]++) # capture everything up to the next ":" or "/"
(?::\d++)? # if you encounter a port match it (optional)
/artifactory/
(?P<repository>[^/]++) # match repository by capturing everything up to next "/"
/
(?P<artifact_names>.++) # match the rest of URL to artifact names
$
在 regex101 (https://regex101.com/r/7Ww4ui/1) 上,所有格量词被省略,因为模块不处理它们(这是在 rexex101 上实现的)。re
或作为可执行代码:
import regex
def extract_artifactory_data(url):
pattern = r"^https?://(?P<artifactory_instance>[^/:]++)(?::\d++)?/artifactory/(?P<repository>[^/]++)/(?P<artifact_names>.++)$"
match = regex.match(pattern, url)
if not match:
return None
return match.group("artifactory_instance"), match.group("repository"), match.group("artifact_names")
url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)
print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)
评论
需要指出的是,还有一个模块,用于拆分 URL。没有理由重新发明轮子。urllib.parse
from urllib.parse import urlparse
urls = [ "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz", "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"]
for url in urls:
o = urlparse(url)
instance, repo = o.hostname, o.path.split('/')[2]
print(instance, repo)
评论