pdwittig / ZiegleIt

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ZiegleIt

Ever wanted to be able to rip through books like your name was Alex Ziegler? Well, now you can ZiegleIt! With our app you can quickly generate a summary of any article on the web to expedite your learning. When you're short on time, ZiegleIt.

##MVP Our MVP will have the following features:

  • Scrape web URLs and copy content with Nokogiri
  • Take scraped content and generate a summary based on a fixed(?) compression ratio
  • Deliver content to a txt file
  • At this point in time we expect our algorithm to be optimized for Wikipedia articles

##Document Structure

  • Title (depth: 1)
    • Chapter (2)
    • Chapter (2)
    • Chapter (2)
      • Section (3)
        • Sub Section (4)
          • Paragraph (5)
            • Sentence (6)
              • Word (6.content)
          • Paragraph
          • Paragraph
        • Sub Section
        • Sub Section
      • Section
      • Section
    • Chapter

##Parsing Rules v1

  1. There are a lot of blank elements (I'm guessing closing tags?) so first and foremost we need a guard clause that prevents nodes with blank inner_xml from making their way into the content:
if node.inner_xml != ""
  1. Break when See also</span> is reached in the current Nokogiri node inner xml. This is checked by matching a RegExp:
break if node.inner_xml.match(/(See also<\/span>)/)
  1. The current section's text is no longer meaningful if a </span> tag has been reached. This is checked via RegExp as well:
puts "Section: #{node.inner_xml.match(/[ \w]+(?=<\/span>)/)}"
  1. The table of contents is ignored (we are building our own afterall). We achieve this by excluding any node that has an h2 parent with inner xml of Contents. We add it to our guard clause.
if (node.inner_xml != "") && (node.inner_xml != "Contents")

##Algorithm

##Next Steps

  • Start calculating some word scores

##Resources

About


Languages

Language:Ruby 100.0%