README

This README would normally document whatever steps are necessary to get your application up and running.

Dependencies

Java 8
Scala > 2.10
Fast (branch SPARSE_DATASTRUCTURE), ListFlattener, WebPageTraverserWrapper

What is this repository for?

The System generate sitemaps, given in input an homepage and a max depth.

To Create a Jar

Install Hylien

git clone https://github.com/fabiana001/HyLiEn
cd HyLiEn
sbt clean compile publish-local

Install graph generator

git clone https://fabianaLanotte@bitbucket.org/datatoknowledge/url2vec.git
cd url2vec
sbt clean compile assembly

The previous commands generate a jar with all required dependencies in ./target/scala-2.11/Url2vec-assembly-1.0.jar.

How do I get set up?

Remember to install the local dependencies

Mains to run

Crawl a website

Run

java -cp /path/to/target/scala-2.11/Url2vec-assembly-1.0.jar graphgenerator.withListConstraint.GraphGeneratorWithListConstraintMain https://cs.illinois.edu/ 0 2

The Jar requires as **Input:

** the starting url (e.g. https://cs.illinois.edu/)
the max number of characters to extract from each web page (e.g. 0)
the max depth to crawl (e.g.2)

Output: Folder containing the following files:

SequenceCss.txt contains sequences which data item are tuples in the format (url, dom-path with Css)
SequenceCssIds.txt contains sequences which data item are tuples in the format (u_i, d_j, t_h), where u_i = url, d_j = dom-path with Css, t_h = anchor text.
SequenceCssMapDom is a map that associate to each dom-path with Css a code
SequenceCssMapText is a map that associate to each anchor-text a code
SequenceCssMapUrl is a map that associate to each url a code
Sequence.txt contains sequences which data item are tuples in the format (url, dom-path without Css)
SequenceCIDs.txt contains sequences which data item are tuples in the format (u_i, d_j, t_h), where u_i = url, d_j = dom-path with Css, t_h = anchor text.
SequenceMapDom is a map that associate to each dom-path without Css a code
SequenceMapText is a map that associate to each anchor-text a code
SequenceMapUrl is a map that associate to each url a code

####2. FrequentSequenceExtractorMain#### **Input: ** folder, generated by khachaturian, containing files of sequences and the minimum support threshold. Example:

/home/fabiana/git/khachaturian/src/test/resources/inputfileFrequentSequenceExtractor 0.01

Output: Same output of khachaturian

####3. JsonCreatorMain#### **Input: ** file containing the sequence database and the minimum support threshold;

**Output: ** Same output of khachaturian plus file json containing the tree of closed sequences.

If you want visualize the output using d3.js library, then go here or in the field d3/example in this project.

Remember: to use d3.js on web browser (e.g. Chrome) is needed to allow the XMLHttpRequest (d3.json etc.) when running files from the local file system (file:///).
Fonte 1 Fonte 2 For this reason you need to run that code in a web server. From command line do:

python -m SimpleHTTPServer 8888 &

Once this is running, go to http://localhost:8888/. For more details go here

Problem to fix

Redirect of urls: the system doesn't manage redirect to other web pages. This can be a problem in terms of performance, since redirected pages having a different domain from homepage in input are discarded.

Competitors

HDTM: return a tree organized using topic modeling. A web page A in the discovered hierarchy is the parent of a web page B if the topic in A is more generic than topic in B. To run HDTM we need 2 files:
- edges.txt : contains edges of web graph to analyze in the format "idNode tab idNode" (3 tab 5)
- vertex.txt: contains vertices of web graph to analyze in the format "idNode tab {idToken space}+" (e.g. 3 tab 1 space 5 space 2). Note idNode and idToken must start from 3 (the numbers 0,1 and 2 are used as special values). Moreover each row in the file vertex.txt must have at least 2 idToken. The class VertexConverter convert the vertex file generated by project Url2Vec in a file that HDTM can use

fabiana001 / sitemap-generator