FriendlyToS is a project to explain Terms of Services in a friendly way to the users of the web.
This is the technical implementation of FriendlyToS and contains four subprojects:
- HTML and MD of ToSes being monitored ('./documents')
- FriendlyToS' Django website ('./frndlyts')
- Terms of Service monitoring framework ('./tosWatch')
- Natural language ToS annotation system ('./tosGrokr')
- Web Crawler for ToS' ('./tosCrawl')
- create Goal words list from Anton, Veil, &etc for parsing
- pg. 14 of Anton & Earp
- (i.e. TRACK, AGGREGATE, PROVIDE)
- Pull goal/vulnerability descriptions from Anton & Earp
- type up in spreadsheet with extra fields of metadata
- (i.e. "PREVENT use of cookies to send spam")
- Initial Crawl list (source ToSBACK)
- Source for list of trackers: Ghostery
- Decide if ToS diff engine is in tosWatch or tosGrokr
- Decide when a paragraph has changed enough to be considered a different paragraph instead of merly a revision of a paragraph.
- Decide if and how comments on a paragraph will relate to later versions of that paragraph.
- OS packages
- python-lxml
- or libxml2-dev and libxslt1-dev plus lxml in requirements.txt (haven't tried this yet)
- git
- python-lxml
- Python
- Use requirements.txt
- nltk punkt
- Use nltk.download() in an interactive session to download the punkt package
Database diagram at https://cacoo.com/diagrams/DFHkFqrvdvr863xi
TODO: finish db schema (draft 1) TODO: decide what tool implements database schema TODO: research South for db migrations
Paragraph
key > prev Paragarph
key > next Paragraph
fk > first Sentence
fk > prev revision of paragraph;
key > version
Sentence
fk > parent Setence
key > prev Paragarph
key > next Sentence
Comments
fkey > user
fkey > paragraph_id
Use Django's logging, which is basically just Python's logging with a couple of added methods.
Django and Python both do not provide a handler for logging to a database, so we'll need to write that.
Need to define log formats and the dictConfig
We should define what happens for each log level, and when they are used in FriendlyToS:
- critical - use DB and AdminEmailHandler handlers
- error - use DB handler
- warning - use DB handler
- info - use DB handler
- debug - ???
Should filter for DB messages and send those to a file.
Pro Bono Privacy Initiative - Provides data privacy expertise to non-profits
Python Parsing: lxml - http://lxml.de/ - looks like a possible solution for building DOMs that can be queried via Xpath. I've had some success retreiving the content I want from a page, but more experimentation is needed.
lxml supports at least three seperate parsers:
- libxml2 - The default
- BeautifulSoup - Via lxml.html.ElementSoup
- html5lib - For parsing HTML 5, via lxml.html.html5parser
There are some functions/methods in Lxml that might be of interest:
- lxml.html.HtmlElement.interlinks() - Returns (element, attribute, link, pos) for every link in the element
- lxml.html.diff.htmldiff(doc1, doc2) - A diff function that wraps differences in <ins> and <del> elements
- lxml.html.diff.html_annotate - Another diff function that behaves like svn blame
- See http://lxml.de/lxmlhtml.html#html-diff for the above two
Python Markdown Reading: python-markdown2
Python and Git GitPython - Python library for interacting with git repos
Thoughts on Scraping
Default lxml seems to work on the few sites tried so far. However, it might be a good idea to support multiple forms of scraping. Lxml includes three parsers (default, BeautifulSoup, HTML5). Regexs could be a fallback. BTE is a potential last result.
Dumping content into the database: convert <div>, <span>, <p>, <hX>, <tr>,   into paragraphs in the table. Will have to split <pre> on newlines. Will we keep the formatting of lists? And will we keep links?
Thoughts on Errors in Scraping
Should log IOErrors when they are thrown.
A message should be generated when a urlopen results in some error response (4xx). The message should include the url attempted, the error returned, and the timestamp of the attempt.
A message (ScrapeError ?) should be generated when html.parse returns an empty list, as we can assume that the xpath query in the database has successfully tested before. The generated message should include a timestamp of when the scrape was attempted, the last-modified header from the response, the url called, and the xpath query attempted.
Javascript Runtimes
- https://github.com/davisp/python-spidermonkey/ - Python module based on SpiderMonkey
- Bug tracking and old code/instructions at http://code.google.com/p/python-spidermonkey/
- Possible issue: Doesn't provide useful error feedback from Javascript execution
- Possible issue: Can't call functions defiend in javascript (WTF?? how does it do anything then?)
- Coolness: Smooth transition of objects to/from Javascript
- TODO: Install and play around with it.
- https://developer.mozilla.org/en/SpiderMonkey - C/C++ Javascript runtime
- http://www.mozilla.org/rhino/ - This is for Java
Future Blog Content Ideas
Discussing history of law, some background on privacy policies, policy analysis of various new laws/cases/etc..., some background of privacy theory
Important Laws and Statutes
- Communications Decency Act
- DMCA
- Video Privacy Protection Act
Interesting or Related Projects
- Collusion - Firefox plugin that lets you see connections between trackers you have encountered.
- Privacy Score - As of 3/5/2012, scores 1600+ websites based on how risky a privacy policy is to a user.
- OpenCalais - Semantic analysis and linking service provided for free by Thomson Reuters.
- WikiSummary - WikiSummary created summaries of a handful of ToSes five years ago.
Academic References
- Barth - Design and analysis of privacy policies (Dissertation)
- Anton - A Requirements taxonomy for reducing web site privacy vulnerabilities
- Anton - A taxonomy for web site privacy requirements
- FTC - Privacy Online: Fair information practices in the marketplace
- Schwaig - Compliance to the fair information practices: How are the Fortune 500 handling online privacy disclosures?
- Vail - An empirical study of consumer perceptions and comprehension of web site privacy policies
- Williams - Internet Privacy Policies: A composite index for measuring compliance to the fair information principles