wsheffel/PubMed-Text-Mining-Tool

Project Name:
A Simple Text Mining Tool for Analyzing Research Paper Abstracts

Description:
This project is a text mining tool using search results from National Center for
Biotechnology Information's database (http://www.ncbi.nlm.nih.gov/pubmed).
It uses Perl and Python for text processing and statistic analysis.

Modules and files (not all):
pubmed_result.txt     -- results downloaded from PubMed
export.csv            -- results exported from IEEE Xplore
raw_data.json         -- results selected from pyspider sqlite database
keywords.txt          -- keywords provided by user
preProcess.pl         -- take pubmed_result.txt as input;
                         make it easy for later process
csvParser.py          -- take export.csv as input;
                         make data into json format and append it to
                         raw_data.json;
                         only support IEEE Xplore format
jsonParser.pl         -- take raw_data.json as input;
                         make data into the same format as preProcess.pl
splitFunction.pl      -- core split function; shared by raw data parsers
myFormat.txt          -- generated by preProcess.pl
                         jsonParser.pl append data after preProcess.pl
stem.pl               -- take myFormat.txt as input;
                         stem each word in every sentence
stemDict.txt          -- stemmed words and their corresponding original words
                         generated by stem.pl
stemmedSentence.txt   -- stemmed words in sentences; generated by stem.pl
selectSentence.pl     -- take stemmedSentence.txt as input;
                         take stemKeyword.pl as sub-module;
                         handle all stemmed sentences and select those contains
                         given keywords; if no keywords is provided, take
                         myFormat.txt as result instead.
stemKeyword.pl        -- take keywords.txt as input; stem the keywords
                         selectSentence.pl's subsidiary module
stemFunction.pl       -- core stem function; Porter stemmer
pmidList.txt          -- pmid list file; generated by selectSentence.pl
dict.py               -- take stemDict.txt as input; eliminate stop words and
                         proceed simple statistic
stats_words.txt       -- stemmed words and their frequencies; generate by
                         dict.py
htmlGenerator.py      -- use pmidList.txt to generate a simple webpage for easy
                         database access
PMIDList.html         -- simple webpage contains titles and URLs
nextStep.py           -- access original raw data; extract the articles'
                         original entries listed in pmidList.txt; keep using
                         original format: MEDLINE or json

HOWTO:
= Generate Raw Data

==== pubmed_result.txt
1. Make a search on http://www.ncbi.nlm.nih.gov/pubmed.
2. Press "Send to" on the right top of page and select "File" & "MEDLINE".
   Press "Create File"
3. Put this file "pubmed_result.txt" into the same directory as these codes.

==== raw_data.json
1. Go to pyspider's data folder
2. Type "sqlite3 result.db" in the command line
3. Type ".output raw_data.json", "select result from resultdb_YourProjectName;"
   (You may want to type ".table" to check current resultdb table before
   selecting)
4. Type ".quit" and copy "raw_data.json" into the same directory as these
   codes.

==== export.csv
1. Make a search on http://ieeexplore.ieee.org/Xplore/home.jsp
2. Press "Export to CSV" on the right of "Download Citations".
3. Change the file name as "export.csv" and sive it into the same directory as
   these codes.

= Use This Tool

1. Type make<RETURN> in the command line; this may take several minutes, which
   depends on the size of raw data
2. Type make<SPACE>html<RETURN> in the command line to generate PMIDList.html.
   Use any web browser to open PMIDList.html for easy access
3. Type make<SPACE>next<RETURN> in the command line to backup current data and
   generate new raw data for a new round of Make
4. Change keywords in keywords.txt and goto step 1
5. Revert manually according to the time stamp in backup folder if it is needed

= Installation (Ubuntu as example):
#install perl, python and make.
#you can install build-essential too.
$sudo apt-get install perl python make

#install CPAN for perl modules
$sudo perl -MCPAN -e shell
#press <RETURN> until the installation is finished
$sudo cpan
cpan[1]> install Lingua:EN:Sentence
cpan[2]> install Unicode:Normalize
#quit cpan shell
cpan[3]> exit
#DONE

#install sqlite command line tool
$sudo apt-get install sqlite3

Detials about pyspider please see:
https://github.com/binux/pyspider

LICENSE:
See LICENSE
wsheffel / PubMed-Text-Mining-Tool

About

Languages