xuchen / pupsniffer

Automatically exported from code.google.com/p/pupsniffer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is Pup(Parallel URL Pattern) Sniffer, An Efficient Multilingual Web Corpus Tool.

What is it

An implementation and enhancement based on the following paper (Kit and Ng 2007):

Chunyu Kit and Jessica Y. H. Ng. 2007. An intelligent Web agent to mine bilingual parallel pages via automatic discovery of URL pairing patterns. In 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops: Workshop on Agents and Data Mining Interaction (ADMI-07), pp.526-529. Silicon Valley, California, November 2-5, 2007. It discovers URL patterns of parallel webpages and download them. For instance, it tells you the following two webpages have the same content but different languages: • English: http://www.legco.gov.hk/yr99-00/english/fc/esc/minutes/es061099.htm

• Chinese: http://www.legco.gov.hk/yr99-00/chinese/fc/esc/minutes/es061099.htm Accuracy: It retrieves 98% true parallel webpages on 20 selected Hong Kong websites (Kit and Ng 2007).
Based on this original algorithm, we implement three enhancement algorithms to crawl more credible bilingual Web pages. You can see more detailed information about these works at:
http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/

Data Set & Evaluation

Please click Pupsniffer evaluation website:

http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/ This website provides a Web interface to evaluate the result of bilingual Web pages collecting by Pupsniffer. In order to run the evaluation, please register an account throught the evaluation website, i.e.
http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/register.jsp.
After completing the register form, your account will be registered and we will send you an email about your role and right of this website. It may take several hours before you receive this email. If you received the email and have the role and right of this website, yon can login the website at:
http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/login.html.
There are four kinds of evaluation as follows:
  1. Evaluation for Original Algorithm
  2. Evaluation for Incremental Algorithm of Bilingual Webpages Extracting
  3. Evaluation for Algorithm of Weak Keys Rescuing
  4. Evaluation for Algorithm of Bilingual Deep Webpages Detecting
    Just select one of them. On the evaluation website, two urls pair lists are provided. The user can check these urls pais through convenient interface. If two ulrs pair DOES NOT respond to a pair of bilingual Web pages, the user can select the check box.

Publications

Chengzhi Zhang, Xuchen Yao and Chunyu Kit. Finding More Bilingual Web Pages with High Credibility via Link Analysis. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora (BUCC2013). August 8, 2013, Sofia, Bulgaria
Chunyu Kit and Jessica Y. H. Ng.2007. An intelligent Web agentto mine bilingual parallel pages via automatic discovery of URL pairing patterns. In Proceedings of the2007IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops: Workshopon Agents and Data Mining Interaction (ADMI-07),Silicon Valley,California, November 2-5, 2007, Silicon Valley,California.

Local Version Support (from v1.2)

From version 1.2, PupSniffer supports crawling from file systems. This is due to that the internal crawler used in PupSniffer is not a full-featured crawler. For instance, it doesn't follow web pages generated by javascript or flash and the downloading speed isn't satisfactory. Thus external web crawling tools, such as wget or Apach Nutch, are encouraged for using. When the crawling job is done, point PupSniffer to the saving directory and it will read the local web pages and analyze
URL patterns.

How to Run it

Modify config.txt accordingly. You must have Java 1.6 to run!

Under Linux/Mac: ./run.sh
Under Windows: run.cmd

Build Instructions

You can do either:

  1. Setup an Eclipse project with this package. Eclipse builds Pup Sniffer for you.
  2. run "ant jar" if you have ant installed.

Note

  1. All source code and data set of Pupsniffer are all free and released under the GNU/GPL License.
  2. Pupsniffer is language-independent. You can modify 'config.txt' in the Pupsniffer to crawling bilingual URL pairs according to your requirement.





About

Automatically exported from code.google.com/p/pupsniffer


Languages

Language:Java 99.5%Language:HTML 0.3%Language:Shell 0.1%Language:Batchfile 0.1%