This is a project for the Practical Software Engineering course of ITM13
- This repository represents the whole project, not a container for partially solutions. Your implementations should be included in
Java Resources/src
. - Please do not create unnecessary packages, if there is an existing package which you could use - USE IT!
- Take code reviews, submitted code should be reusable, easy to understand and well documented.
- Try your best to produce "clean" code
- Always attach a description to your pull-request, so me or @MICSTI know what's the aim of your pull-request without the need to take a deep look inside your code or commit messages.
- Root path to java files will be src/at/fhj/itm/pswe
- Package PageCrawler: Contains all files and the whole functionality of the Pagecrawler
- Subpackages LinkCrawler, WordAnalyzer according to the different tasks of the algorithm
- Within every Subpackes contained will be related Packages such as Model, Business, Helper, ...
- Subpackages LinkCrawler, WordAnalyzer according to the different tasks of the algorithm
- Package Database: Contains everything with regard to Database Access
- Examples Would be DAO's, or Connection Classes.
- Package REST: Like standard Wildflypackage, contains Endpoints
- May contain Helper and Business Packages to format data
In case you get the "Filename too long" error message while cloning the repository on a windows system there is a git config command to fix it.
git config --system core.longpaths true
In case of editing front-end files (/WebContent
directory), please take use of gulp automation toolkit.
Please do not edit the .min files, those are being generated by gulp.
- Install gulp globally using the node package manager
$ npm install --global gulp
- Run gulp inside the
/WebContent
directory$ gulp
- Position: Wildfly-Path/bin/result/crawl
- naming: subdomain_domain_tld_MM_DD_YYYY-HH_MM
Example for pswengi.bamb.at
started crawling on 23.11.2015 14:21
pswenig_bamb_at_11_23_2015-14_21.txt
Filename can be gathered from Init_LinkCrawler object
with .getFilename();
1 Line:
URL
http://pswengi.bamb.at
2 Line:
Date of crawling the page (in dd:MM:yyyy)
13:12:2015
3/5/7... Line:
URL of the text, that is located in the next line
http://pswengi.bamb.at/article1.html
4/6/8... Line:
Gathered Text from the "current" URL. Repeates for each visited URL
Here are some random generated Word :P
Last line:
Time, how long the crawler was running (in hh:mm:ss)
0:0:43
##Subsites
- Subsites( Word/Site Overview) are called via Servlet
- url:
TermStatistics/SiteOverview/{idOfSite}
##Input Validation
- Due to CORS, URL check may fail even if url is valid
- Crawldepth has to be at least 1
/rest/action/crawler/{crawlerid}
-> restart the Crawler for an already safed Website in the Database/rest/article/{articleid}/words
-> get all words from one article identified by its id with amount/rest/article/{articleid}/words/{num}
-> get limited number of words (by num) from one article identified by its id with amount/rest/website
-> all Websites in the Database/rest/website/{websiteid}/articles
-> all articles on this website/rest/website/{websiteid}/words
-> all words an amount on this website/rest/website/{websiteid}/words/{num}
-> all words an amount on this website, limit amount of words by num/rest/website/{websiteid}/period/10.11.2015/30.11.2015
-> words of one site in the given period/rest/word/{word}/websites
-> all websites of specific word with corresponding amounts/rest/word/{word}/period/10.11.2015/30.11.2015
-> one word with all dates & amounts in the given period