This README would normally document whatever steps are necessary to get your application up and running.
-
Java 8
-
Scala > 2.10
-
Fast (branch SPARSE_DATASTRUCTURE), ListFlattener, WebPageTraverserWrapper
The System generate sitemaps, given in input an homepage and a max depth.
git clone https://github.com/fabiana001/HyLiEn
cd HyLiEn
sbt clean compile publish-local
git clone https://fabianaLanotte@bitbucket.org/datatoknowledge/url2vec.git
cd url2vec
sbt clean compile assembly
The previous commands generate a jar with all required dependencies in ./target/scala-2.11/Url2vec-assembly-1.0.jar.
Remember to install the local dependencies
Run
java -cp /path/to/target/scala-2.11/Url2vec-assembly-1.0.jar graphgenerator.withListConstraint.GraphGeneratorWithListConstraintMain https://cs.illinois.edu/ 0 2
The Jar requires as **Input:
- ** the starting url (e.g. https://cs.illinois.edu/)
- the max number of characters to extract from each web page (e.g. 0)
- the max depth to crawl (e.g.2)
Output: Folder containing the following files:
-
SequenceCss.txt contains sequences which data item are tuples in the format (url, dom-path with Css)
-
SequenceCssIds.txt contains sequences which data item are tuples in the format (u_i, d_j, t_h), where u_i = url, d_j = dom-path with Css, t_h = anchor text.
-
SequenceCssMapDom is a map that associate to each dom-path with Css a code
-
SequenceCssMapText is a map that associate to each anchor-text a code
-
SequenceCssMapUrl is a map that associate to each url a code
-
Sequence.txt contains sequences which data item are tuples in the format (url, dom-path without Css)
-
SequenceCIDs.txt contains sequences which data item are tuples in the format (u_i, d_j, t_h), where u_i = url, d_j = dom-path with Css, t_h = anchor text.
-
SequenceMapDom is a map that associate to each dom-path without Css a code
-
SequenceMapText is a map that associate to each anchor-text a code
-
SequenceMapUrl is a map that associate to each url a code
####2. FrequentSequenceExtractorMain#### **Input: ** folder, generated by khachaturian, containing files of sequences and the minimum support threshold. Example:
/home/fabiana/git/khachaturian/src/test/resources/inputfileFrequentSequenceExtractor 0.01
Output: Same output of khachaturian
####3. JsonCreatorMain#### **Input: ** file containing the sequence database and the minimum support threshold;
**Output: ** Same output of khachaturian plus file json containing the tree of closed sequences.
If you want visualize the output using d3.js library, then go here or in the field d3/example in this project.
Remember: to use d3.js on web browser (e.g. Chrome) is needed to allow the XMLHttpRequest (d3.json etc.) when running files from the local file system (file:///).
Fonte 1
Fonte 2
For this reason you need to run that code in a web server. From command line do:
python -m SimpleHTTPServer 8888 &
Once this is running, go to http://localhost:8888/. For more details go here
- Redirect of urls: the system doesn't manage redirect to other web pages. This can be a problem in terms of performance, since redirected pages having a different domain from homepage in input are discarded.
- HDTM: return a tree organized using topic modeling. A web page A in the discovered hierarchy is the parent of a web page B if the topic in A is more generic than topic in B.
To run HDTM we need 2 files:
- edges.txt : contains edges of web graph to analyze in the format "idNode tab idNode" (3 tab 5)
- vertex.txt: contains vertices of web graph to analyze in the format "idNode tab {idToken space}+" (e.g. 3 tab 1 space 5 space 2). Note idNode and idToken must start from 3 (the numbers 0,1 and 2 are used as special values). Moreover each row in the file vertex.txt must have at least 2 idToken. The class VertexConverter convert the vertex file generated by project Url2Vec in a file that HDTM can use