BITCS-Information-Retrieval-2020 / search-bit-tdxl

search-bit-tdxl created by GitHub Classroom

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

检索模块-TDXL

项目介绍

搭建一个学术论文的综合搜索引擎,用户可以检索到一篇论文的综合信息,不仅有pdf文件,还有oral视频,数据集,源代码等多模态信息。 本项目为检索模块,最终提供了url和pip安装python包两种检索方式

TDXL小组分工

姓名 学号 分工
王洪飞 3520180030 es环境及服务器运行环境搭建
高佳蕊 3520200005 pdf转文本模块开发
冯昭凯 5720202085 mongo倒入es数据模块开发
刘君 3520190034 es封装为url模块开发
刘旭冬 3520190035 es查询模块开发
王大为 3520190039 url接口封装及pip打包
纪校锋 3120201030 视频转文本模块开发

数据处理

主要了爬虫组提供的两个网站数据: CrossMinds数据包含视频(部分)、pdf(部分)、摘要、标题、发布时间等字段 ACL Anthology数据包含视频(部分)、pdf(部分)、数据集(部分)、摘要、标题、发布时间等字段

使用方法

  • url方式

    1.输入关键字提示
    接口详情
    地址 http://39.96.43.48/suggestion
    请求方式 GET

    url示例

    http://39.96.43.48/suggestion?keyword=A


    请求参数
    字段 说明 类型 备注 是否必填
    keyword 搜索关键字 string

    #####响应示例

     [{
         value: "Audrey Acken article"
     },
     {
         value: "Audrey Acken"
     }]
    2.搜索接口
    接口详情
    地址 http://39.96.43.48/search
    请求方式 GET

    url示例

    http://39.96.43.48/search?theme=author&query=uclanlp&page=1&order=1&year=2020


    请求参数
    字段 说明 类型 备注 是否必填
    query 搜索关键字 string
    theme 搜索主题 string
    page 页码 string
    order 排序 string
    year 检索年份 string
    响应示例详情
     {
       "code": 1, 
       "data": [
         {
           "abstract": "With the COVID-19 pandemic raging world-wide since the beginning of the 2020 decade, the need for monitoring systems to track relevant information on social media is vitally important. This paper describes our submission to the WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. We investigate the effectiveness for a variety of classification models, and found that domain-specific pre-trained BERT models lead to the best performance. On top of this, we attempt a variety of ensembling strategies, but these attempts did not lead to further improvements. Our final best model, the standalone CT-BERT model, proved to be highly competitive, leading to a shared first place in the shared task. Our results emphasize the importance of domain and task-related pre-training.", 
           "author": "Lilei,Hanmeimei", 
           "dataset_url": "http://39.96.43.48/static/dadataset/W18-0516.Datasets.zip", 
           "keyword_in_video_time": "00:2:58", 
           "pdf": "http://39.96.43.48/static/pdf/2020.wnut-1.47.pdf", 
           "publish_at": "2018", 
           "publisher": "Association for Computational Linguistics", 
           "title": "Pace English", 
           "video": "http://39.96.43.48/static/video/123.mp4"
         }, 
         {
           "abstract": "With the COVID-19 pandemic raging world-wide since the beginning of the 2020 decade, the need for monitoring systems to track relevant information on social media is vitally important. This paper describes our submission to the WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. We investigate the effectiveness for a variety of classification models, and found that domain-specific pre-trained BERT models lead to the best performance. On top of this, we attempt a variety of ensembling strategies, but these attempts did not lead to further improvements. Our final best model, the standalone CT-BERT model, proved to be highly competitive, leading to a shared first place in the shared task. Our results emphasize the importance of domain and task-related pre-training.", 
           "author": "Lilei,Hanmeimei", 
           "dataset_url": "http://39.96.43.48/static/dadataset/W18-0516.Datasets.zip", 
           "keyword_in_video_time": "00:2:58", 
           "pdf": "", 
           "publish_at": "2018", 
           "publusher": "Association for Computational Linguistics", 
           "titile": "Pace English", 
           "video": ""
         }
       "total": 2
       ]
     }
  • ###pip安装包方式(pip install search_utils-1.0.0.tar.gz)

      在文件目录使用 import search_utils 进行导入
    
    源代码及使用示例
     	import json
     	import requests
    
    
     	def _build_url_params(url, params):
     	    url = url + '?'
     	    for k, v in params.items():
     	        url += '{}={}&'.format(k, v)
     	    return url
    
    
     	def get_suggestion(keyword):
     	    url = "http://39.96.43.48/suggestion"
     	    if not keyword:
     	        return
    
     	    params = {
     	        'keyword': keyword
     	    }
     	    request_url = _build_url_params(url, params)
     	    return requests.get(request_url, timeout=4).json()
    
    
     	def search(query, theme=None, page=None, order=None, year=None):
     	    url = "http://39.96.43.48/search"
     	    if not query:
     	        return
    
     	    params = {
     	        'query': query
     	    }
     	    if theme:
     	        params['theme'] = theme
     	    if page:
     	        params['page'] = page
     	    if order:
     	        params['order'] = order
     	    if year:
     	        params['year'] = year
     	    request_url = _build_url_params(url, params)
     	    return requests.get(request_url, timeout=4).json()
    
     	if __name__ == '__main__':
     		# 搜索提示
     	    print(get_suggestion("Audrey"))
     	    theme = "author"
     	    query = "uclanlp"
     	    page = 1
     	    order = 1
     	    year = 2020
     	    # 检索接口
     	    print(search(query, theme=theme))

其他模块简介

  • mongo倒入es
      利用pymongo和elasticsearch两个包将数据库实现对接,将以上及其他字段倒入es
    
    源代码示例
     	ES = ['127.0.0.1:9200']
     	# 创建elasticsearch客户端
     	es = Elasticsearch(
     		ES,
     		# 启动前嗅探es集群服务器
     		sniff_on_start=True,
     		# es集群服务器结点连接异常时是否刷新es节点信息
     		sniff_on_connection_fail=True,
     		# 每60秒刷新节点信息
     		sniffer_timeout=60)
     	def insert_es(id_,datas):
     	name_list.append(datas['author'])
     	es.create(index="search", doc_type="doc",
     			  id=id_, ignore=[400, 409], body=datas)
     	def mongo_to_es():
     	datas= []
     	i = 0
     	myclient = pymongo.MongoClient("mongodb://localhost:27017/")
     	mydb = myclient["crossmind"]
     	mycol = mydb["crossmind"]
     	x = mycol.find()
     	for y in x:
     			y.pop('_id')
     			y['author'] = y['author']['name']
     			y['description'] = ''.join([aa.replace('\n','') for aa in y["description"] ])
     			pdf_list = pdf_to_text(y['pdf_path'].split('/')[-1])
     			if pdf_list:
     				y['pdf_text'] = ''.join([aa.replace('\n','') for aa in pdf_list ])
     		insert_es(i,y)
     		i += 1
  • pdf转文本
      使用pdfminer将pdf转为pdf
    
    源代码示例
     	from pdfminer.pdfinterp import PDFPageInterpreter,PDFResourceManager
     	from pdfminer.converter import TextConverter,PDFPageAggregator
     	from pdfminer.layout import LAParams
     	from pdfminer.pdfparser import PDFParser
     	from pdfminer.pdfdocument import PDFDocument
     	from pdfminer.pdfdevice import PDFDevice
     	from pdfminer.pdfpage import PDFPage
     	# 获取pdf文档
     	fp = open('5ee96b86b1267e24b0ec2354.pdf','rb')
     	# 创建一个与文档相关的解释器
     	parser = PDFParser(fp)
     	# pdf文档的对象,与解释器连接起来
     	doc = PDFDocument(parser=parser)
     	parser.set_document(doc=doc)
     	# 如果是加密pdf,则输入密码
     	# doc._initialize_password()
     	# 创建pdf资源管理器
     	resource = PDFResourceManager()
     	# 参数分析器
     	laparam=LAParams()
     	# 创建一个聚合器
     	device = PDFPageAggregator(resource,laparams=laparam)
     	# 创建pdf页面解释器
     	interpreter = PDFPageInterpreter(resource,device)
     	# 获取页面的集合
     	for page in PDFPage.get_pages(fp):
     	    # 使用页面解释器来读取
     	    interpreter.process_page(page)
     	    # 使用聚合器来获取内容
     	    layout = device.get_result()
  • 视频转文本
      使用tencentcloud-sdk-python讲视频转为文本并记录了每段话在视频出现的时间
    
    源代码示例
     def video_to_text(video_url,video_id):
         """
         input:
             video_url
             video_id
         output:
             results 
                 format: [{time_start:xxx,time_end:xxx,text:xxx,vid:xxx},...]
         """
         from tencentcloud.common import credential
         from tencentcloud.common.profile.client_profile import ClientProfile
         from tencentcloud.common.profile.http_profile import HttpProfile
         from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException 
         from tencentcloud.asr.v20190614 import asr_client, models 
         import base64
         import io 
         import time
         import sys 
         if sys.version_info[0] == 3:
             sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
         # post the audio to tencent cloud platform
         try: 
             # <Your SecretId><Your SecretKey> of tencent cloud
             cred = credential.Credential("*****", "****") 
             # set http request
             httpProfile = HttpProfile()
             httpProfile.endpoint = "asr.tencentcloudapi.com"
             clientProfile = ClientProfile()
             clientProfile.httpProfile = httpProfile
             clientProfile.signMethod = "TC3-HMAC-SHA256"  
             client = asr_client.AsrClient(cred, "ap-beijing", clientProfile) 
             # set params and create recognition task
             req = models.CreateRecTaskRequest()
             params = {"ChannelNum":1,"ResTextFormat":0,"SourceType":0}
             req._deserialize(params)
             req.EngineModelType = "16k_en"
             req.Url = video_url
             resp = client.CreateRecTask(req)
             # set result query
             req = models.DescribeTaskStatusRequest()
             params = '{"TaskId":%s}'%resp.Data.TaskId
             req.from_json_string(params)
             # polling
             resp = client.DescribeTaskStatus(req)
             while resp.Data.Status != 2:
                 print("video ID: "+str(video_id)+" is "+resp.Data.StatusStr,flush=True)
                 time.sleep(1)
                 resp = client.DescribeTaskStatus(req)
             # process the result
             print("video ID: "+str(video_id)+" is "+resp.Data.StatusStr,flush=True)
             results = []
             lines = resp.Data.Result.split('\n')
             for line in lines:
                 if len(line) < 1:
                     continue
                 time, text = line.split(']')[0][1:], line.split(']')[1][2:-1]
                 result_dict = {} 
                 result_dict["time_start"], result_dict["time_end"] = time.split(',')[0], time.split(',')[1]
                 result_dict["text"] = text
                 result_dict["vid"] = video_id
                 results.append(result_dict)
             return results
         except TencentCloudSDKException as err: 
             print(err) 
     	r = video_to_text("http://39.96.43.48/static/video/123.mp4",0)
     	print(r)
  • es查询模块
      根据不同查询条件拼接成完整的查询语句
    
    源代码示例
     	def search_es(key_word, themes='', order='1', year='', page=1):
         # 基础查询语句
         query = {"query": {"bool": {"should": []}}}
         if themes:
             # 有搜索主题,添加搜索主题
             themes = themes.split(',')
             for theme in themes:
                 theme = theme if theme != 'adbtract' else "description"
                 query["query"]["bool"]["should"].append(
                     {"match": {theme: key_word}})
         else:
             # 无搜索主题,默认全搜索
             query["query"]["bool"]["should"].append(
                 {"match": {"author": key_word}})
             query["query"]["bool"]["should"].append(
                 {"match": {"description": key_word}})
             query["query"]["bool"]["should"].append({"match": {"title": key_word}})
         # 如果有时间限制,添加时间查询
         if year:
             start_time, end_time = parse_year(year)
             if start_time and end_time:
                 query["query"]["bool"]["must"] = [
                     {"range": {"published_at": {"gte": start_time, "lte": end_time}}}]
             query["sort"] = {"published_at": {"order": "asc"}
                              } if order == "1" else {"published_at": {"order": "desc"}}
         # 分页 每页10条
         query["size"] = 10
         query["from"] = (page - 1) * 10
         ret = es.search(body=query, index='ceshi3')
         # print(ret['hits']['total'])
         return ret['hits']['hits'],ret['hits']['total']
     	if __name__ == '__main__':
     	    theme = "adbtract"
     	    order = "2"
     	    year = "2020"
     	    page = 1
     	    key_word = 'HazyResearch'
     	    search_es(key_word, theme, order, year, page)

About

search-bit-tdxl created by GitHub Classroom


Languages

Language:Python 100.0%