【Python爬虫学习】五、正则表达式

正则表达式的优势

简洁,可以很简洁的表达一组字符串的特征,一行胜千言,主要用于字符串匹配

正则表达式的使用

编译:将符合正则表达式语法的字符串转换成正则表达式特征

【Python爬虫学习】五、正则表达式

编译后的特征与一组字符串是对应的,而编译前的只是一个符合正则表达式语法的单一字符串,并不是真正意义上的正则表达式

正则表达式的语法

正则表达式的语法是由字符和操作符构成

  • 操作符:

 【Python爬虫学习】五、正则表达式

【Python爬虫学习】五、正则表达式

注意,\w表示A-Z,a-z,0-9以及下划线

正则表达式语法实例 

P(Y | YT | YTH | YTHO)?N··········································· 表示Y 或YT 或 YTH 或 YTHO出现0次或1次

PYTHON+·································································表示N重复1次或n次

PY[ TH ]ON·······························································方括号表示对单个字符给出取值范围,即该位置取值T或Y

PY[ ^TH ]ON······························································该位置不能为T或H,但至少有一个字符

PY[ ^TH ]?ON···························································该位置不能有T或H,可以没有字符

PY{ :3 }N·································································表示对大括号前的字符扩展0次或3次

【Python爬虫学习】五、正则表达式

经典正则表达式语法实例

^[ A-Za-z ]+$·····························································由26个字母组成的字符串,^和$分别约束字符串的开头和结尾

^[ A-Za-z0-9 ]+$·························································由26个字母和数字组成的字符串,^和$分别约束字符串的开头和结尾

^-?\d+$····································································整数形式的字符串

[ 1-9 ]\d{ 5 }·······························································中国境内邮政编码,6位

\d{ 3 }-\d{ 8 } | \d{ 4 }-\d{ 7 }··········································匹配国内电话号码

[ \u4e00-\u9fa5 ]························································匹配中文字符(utf8)

【Python爬虫学习】五、正则表达式

中文字符在utf8的编码位置是4e00-9fa5, 一共可收录 20901个中文字符

  • 匹配ip地址正则表达式(IP地址分4端,每段0-255)

【Python爬虫学习】五、正则表达式 

((25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)\.){3}(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)

备注:1\d{2}|[1-9]?\d 这里|前后的顺序不可颠倒,先匹配3位数的,如果顺序颠倒,先匹配到2位就不匹配三位了