JAVA-正则提取img标签src属性中请求协议、域名、图片后缀
正则表达式:src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)
示例代码:
Pattern pattern = Pattern.compile("src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)");
Matcher matcher = pattern.matcher(url);
while(matcher.find()){
System.out.println("-------------------");
String host = matcher.group(4);
String imgUrl = matcher.group(2) + matcher.group(3) + matcher.group(4) + matcher.group(5) + matcher.group(6) + "." + matcher.group(7);
System.out.println(host);
System.out.println(imgUrl);
}
}
正则表达式拆分:src=(\"|'| |)([\\S]{1,}?|[/]{1,})([/]{1,})(.+?)([/]{1,})(.+?)\\.(png|jpg|jpeg)(\"|'| |/>)
- “src=”:匹配文本中src=开头
- $1“(\"|'| |)”:匹配src=" 或 src=' 或 src= 或src=空格,举个例子:src='https://*****.png';src="https://*****.png";src=https://*****.png;src= https://*****.png
- $2“([\\S]{1,}?|[/]{1,})”:[\\S]{1,}?匹配协议,任意非空白字符出现一次或多次非贪婪模式,[/]{1,}匹配单斜线开头或多斜线开头,一般图片url为了遵循源站的协议,默认使用//img.xxx.com/imgs/test.png这种格式,这一段正则兼容http:、https:、ftp:、或/(此处正则只能获取到单斜线)
- $3“([/]{1,})”:匹配协议后面的斜线:例如https://、http://、//
- $4+$5“(.+?)([/]{1,})”:(.+?)匹配https://img.xxx.com/imgs/test.png,从协议/匹配到下一个/,中间的即为域名信息,$4=img.xxx.com;$5=/
- $6+$7“(.+?)\\.(png|jpg|jpeg)”:匹配https://img.xxx.com/imgs/test.png,$6=imgs/test;$7=png;\\.匹配纯文本的.;\\为转义符
- $8“(\"|'| |/>”:匹配src属性的结尾,同$1作用,匹配以:"、'、空格、/>结尾的字符
示例代码截图:
运行结果:
- -------------------
- path1
- host1/path1/name1.jpg
- '
- host1
- /
- path1
- /
- name1
- jpg
- -------------------
- -------------------
- paht2
- host2/paht2/name2.png
- '
- host2
- /
- paht2
- /
- name2
- png
- -------------------
- -------------------
- path3
- host3/path3/name3.png
- host3
- /
- path3
- /
- name3
- png
- -------------------
- -------------------
- imgsa.baidu.com
- //imgsa.baidu.com/exp/w=480/sign=306a19aebb3533faf5b6922698d2fdca/1ad5ad6eddc451daa8799c4bbcfd5266d1163286.jpg
- "
- /
- /
- imgsa.baidu.com
- /
- exp/w=480/sign=306a19aebb3533faf5b6922698d2fdca/1ad5ad6eddc451daa8799c4bbcfd5266d1163286
- jpg
- -------------------
- -------------------
- static.228.cn
- http://static.228.cn/upload/Image/201705/1496220590906_8212_x.jpg
- "
- http:
- //
- static.228.cn
- /
- upload/Image/201705/1496220590906_8212_x
- jpg
- -------------------
- -------------------
- static.228.cn
- //static.228.cn/upload/Image/201705/1496220556164_5314_x.jpg
- "
- /
- /
- static.228.cn
- /
- upload/Image/201705/1496220556164_5314_x
- jpg
- -------------------
正则表达式参考菜鸟教程,链接:http://www.runoob.com/java/java-regular-expressions.html