This is very light weight data gathering framework to search and gather information from web sources like Bing, Stackoverflow and etc.
# install this package from PyPi
pip install trawler
# or for latest code
pip install git+https://github.com/rrmerugu/trawler.git#egg=trawler
# install selenium components including drivers (you need chrome installed in your machine)
npm install selenium-standalone@latest -g
selenium-standalone install # installs the drivers
selenium-standalone start # starts the selenium server
pip install -r requirements/requirements.txt
from trawler import TrawlIt
trawl = TrawlIt(kw="MongoDB", generate_kws=True, browser="bing", method="requests")
#trawl = TrawlIt(kw="MongoDB", generate_kws=True, browser="bing")
trawl.run() # this will gather data from all generated keywords and saves it to MongoDB
trawl.generated_keywords # access the generated keywords ['learning MongoDB', 'Programming with MongoDB', 'MongoDB tutorials' ]
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
# or
trawl = TrawlIt(kw="Python Exception Error", browser="stackoverflow")
trawl.run() # this will gather data and saves it to MongoDB
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
trawl = TrawlIt(kw="django", browser="stackoverflow-doc")
trawl.run() # this will gather the topics from the stackoverflow documentation
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
trawl = TrawlIt(kw="django", browser="wordpress")
trawl.run() # this will gather the topics from the stackoverflow documentation
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
from trawler.browsers.wordpress import BrowseWordPress
stack = BrowseWordPress( max_page=1, base_url="http://econbrowser.com/")
# stack = BrowseWordPress(kw="invaana", max_page=1, base_url="http://econbrowser.com")
stack.search()
stack.data # returns the data
Current this framework supports, automating searches with
- Bing
- Bing Images
- Bing Keywords
- StackOverFlow
- StackOverFlow Documentation
- Wordpress
Please understand https://advertise.bingads.microsoft.com/en-in/resources/policies/web-crawling before using this framework. Make sure you comply with the respective website privacy policies before you crawl them.