关于语言不可知：获取URL的一部分（正则表达式）

Getting parts of a URL (Regex)

给定URL(单行)：http://test.example.com/dir/subdir/file.html

如何使用正则表达式提取以下部分：

子域(测试)

域(example.com)

不带文件的路径(/dir/subdir/)

文件(file.html)

文件的路径(/dir/subdir/file.html)

没有路径的URL(http://test.example.com)

(添加您认为有用的任何其他内容)

即使输入以下URL，regex也应该正常工作：

1	http://example.example.com/example/example/example.html

A single regex to parse and breakup a
full URL including query parameters
and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http▼显示?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

然后您可以很容易地进一步分析主机("."分隔)。

我要做的是使用这样的工具：

1
2
3
4
5
6
7

/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

进一步分析"其余部分"以尽可能具体。在一个regex中做这件事有点疯狂。

我意识到我迟到了，但是有一个简单的方法可以让浏览器在没有regex的情况下为您解析一个URL：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

我迟到了几年，但是我很惊讶没有人提到统一资源标识符规范中有一节是关于用正则表达式解析URI的。由Berners-Lee等人编写的正则表达式是：

1
2
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9

The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
as $. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

1
2
3
4
5
6
7
8
9
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

值得一提的是，我发现我必须避开javascript中的正斜杠：

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

我发现投票最高的答案(hometoast的答案)对我来说并不完美。两个问题：

它不能处理端口号。

散列部分已损坏。

以下是修改后的版本：

1	^((http[s]?\|ftp):\/)?\/?([^:\/\s]+)(:([^\/]))?((\/\w+)\/)([\w\-\.]+[^#?\s]+)(\?([^#]))?(#(.))?$

零件位置如下：

1	int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

编辑anon用户发布的日志：

1
2
3

function getFileName(path) {
return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

我需要一个正则表达式来匹配所有的URL，并创建了一个：

1	/(?:([^\:])\:\/\/)?(?:([^\:\@])(?:\:([^\@]))?\@)?(?:([^\/\:])\.(?=[^\.\/\:]\.[^\.\/\:]))?([^\.\/\:])(?:\.([^\/\.\:]))?(?:\:([0-9]))?(\/[^\?#](?=.?\/)\/)?([^\?#])?(?:\?([^#]))?(?:#(.))?/

它匹配所有的URL，任何协议，甚至类似的URL

1	ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag

结果(在javascript中)如下：

1	["ftp","user","pass","www.cs","server","com","8080","/dir1/dir2/","file.php","param1=value1","hashtag"]

类似URL的URL

1	mailto://admin@www.cs.server.com

如下所示：

1	["mailto","admin", undefined,"www.cs","server","com", undefined, undefined, undefined, undefined, undefined]

我试图用javascript解决这个问题，它应该由以下人员处理：

1	var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');

因为(至少在Chrome中)它解析为：

1
2
3
4
5
6
7
8
9
10
11
12
13

{
"hash":"#foobar/bing/bo@ng?bang",
"search":"?foo=bar&bingobang=&king=kong@kong.com",
"pathname":"/path/wah@t/foo.js",
"port":"890",
"hostname":"example.com",
"host":"example.com:890",
"password":"b",
"username":"a",
"protocol":"http:",
"origin":"http://example.com:890",
"href":"http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
}

但是，这不是跨浏览器(https://developer.mozilla.org/en-us/docs/web/api/url)，所以我把它拼凑在一起，将上面的相同部分提取出来：

1	^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]))?)@)?(([^:\/#\?\]\[]+\|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+))(?:[^\?#])))?(\?[^#]+)?)(#.)?

此regex的作者是https://gist.github.com/rpflorence，他发布了此jspef http://jspef.com/url-analysis(最初在这里找到：https://gist.github.com/jlong/2428561 comment-310066)，他提出了最初基于此的regex。

零件顺序如下：

1
2
3
4
5
6
7
8
9
10
11
12
13

var keys = [
"href", // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
"origin", // http://user:pass@host.com:81
"protocol", // http:
"username", // user
"password", // pass
"host", // host.com:81
"hostname", // host.com
"port", // 81
"pathname", // /directory/file.ext
"search", // ?query=1
"hash" // #anchor
];

还有一个小库，它包装它并提供查询参数：

https://github.com/sadams/lite-url(也可在鲍尔网站上找到)

如果您有改进，请创建一个包含更多测试的请求，我将接受并与感谢合并。

提出一个更可读的解决方案(在python中，但适用于任何regex)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

def url_path_to_dict(path):
pattern = (r'^'
r'((?P<schema>.+?)://)?'
r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
r'(?P<host>.*?)'
r'(:(?P<port>\d+?))?'
r'(?P<path>/.*?)?'
r'(?P<query>[?].*?)?'
r'$'
)
regex = re.compile(pattern)
m = regex.match(path)
d = m.groupdict() if m is not None else None

return d

def main():
print url_path_to_dict('http://example.example.com/example/example/example.html')

印刷品：

1
2
3
4
5
6
7
8
9

{
'host': 'example.example.com',
'user': None,
'path': '/example/example/example.html',
'query': None,
'password': None,
'port': None,
'schema': 'http'
}

请尝试以下操作：

1	^((ht\|f)tp(s?)\:\/\/\|~/\|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+\|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

它支持http/ftp、子域、文件夹、文件等。

我是从谷歌的快速搜索中找到的：

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

子域和域是困难的，因为子域可以有多个部分，顶级域也是如此，http://sub1.sub2.domain.co.uk/

1
2
3
4

the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)
the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$
the path with the file : http://[^/]+/(.*)
the URL without the path : (http://[^/]+/)

(降价对Regex不太友好)

这个改进的版本应该像解析器一样可靠地工作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

// Applies to URI, not just URL or URN:
// http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
//
// http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
//
// (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
//
// http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
//
// $@ matches the entire uri
// $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
// $2 matches authority (host, user:pwd@host, etc)
// $3 matches path
// $4 matches query (http GET REST api, etc)
// $5 matches fragment (html anchor, etc)
//
// Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
// Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
//
// (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
//
// Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
{
if( !schemes )
schemes = '[^\\s:\/?#]+'
else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
throw TypeError( 'expected URI schemes' )
return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
}

// http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
function uriSchemesRegExp()
{
return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
}

1	/^((?P<scheme>https?\|ftp):\/)?\/?((?P<username>.?)(:(?P<password>.?)\|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]))?(?P<path>(\/\w+)\/)(?P<filename>[-\w.]+[^#?\s])?(?P<query>\?([^#]))?(?P<fragment>#(.*))?$/

从我对类似问题的回答来看。比前面提到的一些更好，因为它们有一些错误(例如不支持用户名/密码、不支持单字符文件名、碎片标识符被破坏)。

这是一个完整的协议，不依赖任何协议。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

function getServerURL(url) {
var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
console.log(m[1]) // Remove this
return m[1];
}

getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")

印刷品

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

http://dev.test.se

http://dev.test.se

//ajax.googleapis.com

//

www.dev.test.se

www.dev.test.se

www.dev.test.se

www.dev.test.se

//dev.test.se

http://www.dev.test.se

http://localhost:8080

https://localhost:8080

以上都不适合我。我最后使用的是：

1	/^(?:((?:https?\|s?ftp):)\/\/)([^:\/\s]+)(?::(\d))?(?:\/([^\s?#]+)?([?][^?#])?(#.*)?)?/

我喜欢"javascript：好的部分"中发布的regex。它不太短也不太复杂。GitHub上的这个页面也有使用它的javascript代码。但它适用于任何语言。https://gist.github.com/voodoogq/4057330

您可以使用.NET中的uri对象来获取所有的http/https、主机、端口、路径以及查询。只是困难的任务是将主机分为子域、域名和TLD。

没有标准可以这样做，不能简单地使用字符串解析或regex来生成正确的结果。首先，我使用的是regex函数，但并不是所有的url都能正确解析子域。实践方法是使用TLD列表。在定义了URL的TLD之后，左侧部分是域，其余部分是子域。

然而，由于新的TLD是可能的，所以列表需要维护它。目前我知道的是public suffix.org维护最新的列表，您可以使用google代码中的域名解析器工具来解析公共后缀列表，并使用域名对象：domainname.sub domain、domainname.domain和domainname.tld轻松获取子域、域和tld。

这个答案也很有帮助：从URL获取子域

卡勒梅林

Java提供了一个URL类来实现这一目标。查询URL对象。

附带说明，PHP提供parse_url()。

我尝试了一些不满足我需求的方法，尤其是最高分的那些没有路径的网址(http://example.com/)

同时，由于缺少组名，所以无法用"Ansible"(或者我的"jinja2"技能不具备)。

所以这是我的版本，稍微修改了一下，源代码是这里投票最高的版本：

1	^((?P<protocol>http[s]?\|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)\/)([\w\-\.]+[^#?\s]+))(.*)?(#[\w\-]+)?$

我建议不要使用regex。类似winhttpcrackurl()的API调用不太容易出错。

http://msdn.microsoft.com/en-us/library/aa384092%28与.85%29.aspx

regexp获取不带文件的URL路径。

url='http://domain/dir1/dir2/somefile'url.scan(/^(http://[^/]+)(？/[[^／] + ] +？=)？？[[^／] + ]？美元/我)

它对于向该URL添加相对路径很有用。

1
2
3
4
5
6
7
8

String s ="https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex ="(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1:" + s.replaceAll(regex,"$1"));
System.out.println("2:" + s.replaceAll(regex,"$2"));
System.out.println("3:" + s.replaceAll(regex,"$3"));
System.out.println("4:" + s.replaceAll(regex,"$4"));

将提供以下输出：1：http://2:www.thomas-bayer.com3：4:axis2/services/blzservice？WSDL如果将URL更改为string s="https://www.thomas-bayer.com？wsdl=qwerwer&ttt=888"；输出如下：1：http://2:www.thomas-bayer.com3：？4:wsdl=qwerwer&ttt=888

享受…约西列夫

我知道你在说语言不可知论，但是你能告诉我们你在使用什么吗，这样我们就知道你有什么regex功能了？

如果您具有不捕获匹配项的功能，则可以修改hometoast的表达式，以便按如下方式设置不感兴趣的子表达式：

(?:SOMESTUFF)

您仍然需要复制和粘贴(并稍微修改)regex到多个地方，但这是有意义的——您不仅仅是检查子表达式是否存在，而是检查它是否作为URL的一部分存在。对子表达式使用非捕获修饰符可以为您提供所需的内容，而且没有其他内容，如果我正确地读取了您的信息，这就是您想要的。

就像一个小纸条一样，hometoast的表达式不需要在"s"周围加括号来表示"https"，因为它只有一个字符。量词量化直接在它们前面的一个字符(或字符类或子表达式)。所以：

https?

匹配"http"或"https"就可以了。

使用http://www.fileformat.info/tool/regex.htm hometoast的regex非常有效。

但这里是交易，我想在我的程序中的不同情况下使用不同的regex模式。

例如，我有这个URL，并且我有一个枚举，它列出了我的程序中所有支持的URL。枚举中的每个对象都有一个方法getregexpattern，该方法返回regex模式，然后将使用该模式与URL进行比较。如果特定的regex模式返回true，那么我知道我的程序支持这个URL。因此，每个枚举都有自己的regex，这取决于它应该在URL中查找的位置。

hometoast的建议很好，但是在我的例子中，我认为它没有帮助(除非我在所有枚举中复制粘贴相同的regex)。

这就是为什么我希望答案分别给出每个情况下的regex。尽管+1是家庭土司。；)

执行完全解析的regex非常可怕。为了易读性，我已经包含了命名的backreferences，并将每个部分分解为单独的行，但它仍然是这样的：

1
2
3
4
5
6

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

要求它如此冗长的事情是，除了协议或端口之外，任何部分都可以包含HTML实体，这使得对片段的描述非常复杂。因此，在最后几种情况下——主机、路径、文件、查询字符串和片段，我们允许任何HTML实体或任何不是?或#的字符。HTML实体的regex如下所示：

1	$htmlentity ="&(?:amp\|apos\|gt\|lt\|nbsp\|quot\|bull\|hellip\|[lr][ds]quo\|[mn]dash\|permil\|\#[1-9][0-9]{1,3}\|[A-Za-z][0-9A-Za-z]+);"

当它被提取出来时(我用了胡子语法来表示它)，它会变得更加清晰：

1
2
3
4
5
6

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

当然，在javascript中，不能使用命名的backreferences，因此regex

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

在每一个匹配中，协议是\1，主机是\2，端口是\3，路径是\4，文件是\5，查询字符串\6和片段\7。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

//USING REGEX
/**
* Parse URL to get information
*
* @param url the URL string to parse
* @return parsed the URL parsed or null
*/
var UrlParser = function (url) {
"use strict";

var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
matches = regx.exec(url),
parser = null;

if (null !== matches) {
parser = {
href : matches[0],
withoutHash : matches[1],
url : matches[2],
origin : matches[3],
protocol : matches[4],
protocolseparator : matches[5],
credhost : matches[6],
cred : matches[7],
user : matches[8],
pass : matches[9],
host : matches[10],
hostname : matches[11],
port : matches[12],
pathname : matches[13],
segment1 : matches[14],
segment2 : matches[15],
search : matches[16],
hash : matches[17]
};
}

return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);