ZhengZixiang / nlp_corpus

A list of NLP corpus, datasets and other language toolkits

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nlp_corpus

语料资源

Datasets

Toolkits

RegExp List

  • 邮箱
email_pattern = '^[*#\u4e00-\u9fa5 a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*\.[a-zA-Z0-9]{2,6}$'emails = re.findall(email_pattern, text, flags=0)
  • 手机号
cellphone_pattern = '^((13[0-9])|(14[0-9])|(15[0-9])|(17[0-9])|(18[0-9]))\d{8}$'phoneNumbers = re.findall(cellphone_pattern, text, flags=0)
  • 身份证号
IDCards_pattern = r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])$'IDs = re.findall(IDCards_pattern, text, flags=0)
  • QQ号
[1-9]([0-9]{5,11})
  • 国内固话号码
[0-9-()()]{7,18}
  • IP地址
(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)
  • 用户名
[A-Za-z0-9_\-\u4e00-\u9fa5]+

About

A list of NLP corpus, datasets and other language toolkits