Scraper is a toy to extract web data just by mouse.
- All css styles and layouts come from https://dexi.io.
- All images including logo comes from https://dexi.io.
- All functions come from https://dexi.io. but are self-implemented.
Dexi.io provides such amazing tool, so I just want to make a little copy from it.
Here is a gif to show the little toy how to work.
- Open a web page in Scraper.
- Move mouse on the page to choose which part to extract.
- Or select DOM element in
Elements
tab like using Chrome Web Debug tool to select element. - All these operations will be written out to a robot definition file.
Run
means running theses operations defined in robot JSON file.Result
tab shows the extracted raw text.
example 1:
Features
- Not need coding, but need web concepts include html,XPath.
- using XPath selector , but dexi.io using CSS selector.
- using phantomjs engine at backend
Installation
To install scraper, you should install rails and download phantomjs.
After install rails.
For Centos ,Mac and Windows, you should update scraper project
cd <scraper root>
bundle update
then all dependencies will be installed.
For Mac, installing nokogiri gem may have some problem, but keep some patience.
If you cannot link http://rubygems.org or https://rubygems.org, you had better use proxy to update.
Set path for phantomjs
Edit ./config/scraper.yml
and modify the path.
The path must be like
development:
phantomjs_full_path : /home/ym/ym/phantomjs/bin/phantomjs
For Windows, the path is like this
development:
phantomjs_full_path : d:/work/bin/phantomjs.exe
Start scraper
Enter scraper root directory, start server.
cd <scraper root>
rails server
For Windows, you can run cmd.exe to open a cmdline window.
then server will start successfully like this below.
[ym@centos7 scraper]$ rails server
=> Booting WEBrick
=> Rails 4.2.6 application starting in development on http://localhost:3000
=> Run `rails server -h` for more startup options
=> Ctrl-C to shutdown server
[2016-11-03 17:59:46] INFO WEBrick 1.3.1
[2016-11-03 17:59:46] INFO ruby 2.2.1 (2015-02-26) [x86_64-linux]
[2016-11-03 17:59:46] INFO WEBrick::HTTPServer#start: pid=4052 port=3000
Then you can open chrome browser(not support IE or Firefox or Safari), and open http://127.0.0.1:3000.
Only support Chrome browser
You know scraper is a toy, so I just develop it using my favorite browser.
Documentation and Help
If you don't know how to play this little toy, you can search some document for dexi.io.
Contributing
This project is just a little toy, so it has much more bugs. If you find them ,please let me know.