fquellec / Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A Deep Analysis of the DeepWeb Evolution


Since its first appearance in 2009, the term "Deep Web" has designated the non-indexed parts of the World Wide Web, that is by standard search engine. Using the development of a FOSS anonymity network software called TOR , a whole digital world was born and has been growing ever since. Making the best out of the anonymity that is provided to them, TOR users have, over the time, developed complex infrastructure in this Deep Web to make the discussion, the advertisement and the purchase of any service or item that would be deemed illegal by local authorities, accessible to all.

However, if the anonymity factor remains intact, tools have been developed to scrape and archive most services available on the TOR network. From forums to marketplaces, including search engines, messaging services, etc. - the archive explored in the scope of this Project is as vast as the web is Deep. This Project will try to get an overview of its content and extract some meaning from it, understand what this data says about the people behind such services, and those using it.

Research questions

A list of research questions you would like to address during the project.


DN Archives (2013-2015)

  • Description

The archive contains mostly scrapped html pages from the many marketplaces, forums and other services (e.g. Grams search engine) that were active during the period mentioned in the title. This raw data is organized first by service, then by date (meaning that for every service, one can go to a specific date and see a list of html pages). All the directories are compressed using tar.gz compression. The whole archive is about 60 GiB when compressed and estimated to be about 1TiB completely uncompressed.

  • Data Management and Processing

Unshaken by the enormous size of this archive, a large amount of processing work is expected in order to filter out all the html formatting data List the dataset(s) you want to use, and some ideas on how do you expect to get, manage, process and enrich it/them. Show us you've read the docs and some examples, and you've a clear idea on what to expect. Discuss data size and format if relevant.

A list of internal milestones up until project milestone 2

Add here a sketch of your planning for the next project milestone.

Questions for TAa

Add here some questions you have for us, in general or project-specific.
