Tshas1 / Capstone

Capstone project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Capstone

Capstone project

<title>Next Word Prediction App</title> <script type="text/javascript"> var SLIDE_CONFIG = { // Slide settings settings: { title: 'Next Word Prediction App', useBuilds: true, usePrettify: true, enableSlideAreas: true, enableTouch: true, }, // Author information presenters: [ { name: 'Amber Wang' , company: '', gplus: '', twitter: '', www: '', github: '' }, ] }; </script> <script src="data:application/x-javascript;base64,"></script> <script src="data:application/x-javascript;base64,"></script> <script src="data:application/x-javascript;base64,Ly8gQ29weXJpZ2h0IChDKSAyMDEyIEplZmZyZXkgQi4gQXJub2xkCi8vCi8vIExpY2Vuc2VkIHVuZGVyIHRoZSBBcGFjaGUgTGljZW5zZSwgVmVyc2lvbiAyLjAgKHRoZSAiTGljZW5zZSIpOwovLyB5b3UgbWF5IG5vdCB1c2UgdGhpcyBmaWxlIGV4Y2VwdCBpbiBjb21wbGlhbmNlIHdpdGggdGhlIExpY2Vuc2UuCi8vIFlvdSBtYXkgb2J0YWluIGEgY29weSBvZiB0aGUgTGljZW5zZSBhdAovLwovLyAgICAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNFTlNFLTIuMAovLwovLyBVbmxlc3MgcmVxdWlyZWQgYnkgYXBwbGljYWJsZSBsYXcgb3IgYWdyZWVkIHRvIGluIHdyaXRpbmcsIHNvZnR3YXJlCi8vIGRpc3RyaWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFuICJBUyBJUyIgQkFTSVMsCi8vIFdJVEhPVVQgV0FSUkFOVElFUyBPUiBDT05ESVRJT05TIE9GIEFOWSBLSU5ELCBlaXRoZXIgZXhwcmVzcyBvciBpbXBsaWVkLgovLyBTZWUgdGhlIExpY2Vuc2UgZm9yIHRoZSBzcGVjaWZpYyBsYW5ndWFnZSBnb3Zlcm5pbmcgcGVybWlzc2lvbnMgYW5kCi8vIGxpbWl0YXRpb25zIHVuZGVyIHRoZSBMaWNlbnNlLgoKCi8qKgogKiBAZmlsZW92ZXJ2aWV3CiAqIFJlZ2lzdGVycyBhIGxhbmd1YWdlIGhhbmRsZXIgZm9yIFMsIFMtcGx1cywgYW5kIFIgc291cmNlIGNvZGUuCiAqCiAqCiAqIFRvIHVzZSwgaW5jbHVkZSBwcmV0dGlmeS5qcyBhbmQgdGhpcyBmaWxlIGluIHlvdXIgSFRNTCBwYWdlLgogKiBUaGVuIHB1dCB5b3VyIGNvZGUgaW4gYW4gSFRNTCB0YWcgbGlrZQogKiAgICAgIDxwcmUgY2xhc3M9InByZXR0eXByaW50IGxhbmctciI+IGNvZGUgPC9wcmU+CiAqCiAqIExhbmd1YWdlIGRlZmluaXRpb24gZnJvbQogKiBodHRwOi8vY3Jhbi5yLXByb2plY3Qub3JnL2RvYy9tYW51YWxzL1ItbGFuZy5odG1sLgogKiBNYW55IG9mIHRoZSByZWdleGVzIGFyZSBzaGFyZWQgIHdpdGggdGhlIHB5Z21lbnRzIFNMZXhlciwKICogaHR0cDovL3B5Z21lbnRzLm9yZy8uCiAqCiAqIE9yaWdpbmFsOiBodHRwczovL3Jhdy5naXRodWIuY29tL2pybm9sZC9wcmV0dGlmeS1sYW5nLXItYnVncy9tYXN0ZXIvbGFuZy1yLmpzCiAqCiAqIEBhdXRob3IgamVmZnJleS5hcm5vbGRAZ21haWwuY29tCiAqLwpQUlsncmVnaXN0ZXJMYW5nSGFuZGxlciddKAogICAgUFJbJ2NyZWF0ZVNpbXBsZUxleGVyJ10oCiAgICAgICAgWwogICAgICAgICAgICBbUFJbJ1BSX1BMQUlOJ10sICAgICAgIC9eW1x0XG5cciBceEEwXSsvLCBudWxsLCAnXHRcblxyIFx4QTAnXSwKCSAgICBbUFJbJ1BSX1NUUklORyddLCAgICAgIC9eXCIoPzpbXlwiXFxdfFxcW1xzXFNdKSooPzpcInwkKS8sIG51bGwsICciJ10sCgkgICAgW1BSWydQUl9TVFJJTkcnXSwgICAgICAvXlwnKD86W15cJ1xcXXxcXFtcc1xTXSkqKD86XCd8JCkvLCBudWxsLCAiJyJdCiAgICAgICAgXSwKICAgICAgICBbCiAgICAgICAgICAgIFtQUlsnUFJfQ09NTUVOVCddLCAgICAgL14jLiovXSwKCSAgICBbUFJbJ1BSX0tFWVdPUkQnXSwgICAgIC9eKD86aWZ8ZWxzZXxmb3J8d2hpbGV8cmVwZWF0fGlufG5leHR8YnJlYWt8cmV0dXJufHN3aXRjaHxmdW5jdGlvbikoPyFbQS1aYS16MC05Xy5dKS9dLAoJICAgIC8vIGhleCBudW1iZXMKCSAgICBbUFJbJ1BSX0xJVEVSQUwnXSwgL14wW3hYXVthLWZBLUYwLTldKyhbcFBdWzAtOV0rKT9bTGldPy9dLAoJICAgIC8vIERlY2ltYWwgbnVtYmVycwogICAgICAgICAgICBbUFJbJ1BSX0xJVEVSQUwnXSwgL15bKy1dPyhbMC05XSsoXC5bMC05XSspP3xcLlswLTldKykoW2VFXVsrLV0/WzAtOV0rKT9bTGldPy9dLAoJICAgIC8vIGJ1aWx0aW4gc3ltYm9scwoJICAgIFtQUlsnUFJfTElURVJBTCddLCAvXig/Ok5VTEx8TkEoPzpfKD86aW50ZWdlcnxyZWFsfGNvbXBsZXh8Y2hhcmFjdGVyKV8pP3xJbmZ8VFJVRXxGQUxTRXxOYU58XC5cLig/OlwufFswLTldKykpKD8hW0EtWmEtejAtOV8uXSkvXSwKCSAgICAvLyBhc3NpZ25tZW50LCBvcGVyYXRvcnMsIGFuZCBwYXJlbnMsIGV0Yy4KCSAgICBbUFJbJ1BSX1BVTkNUVUFUSU9OJ10sIC9eKD86PDw/LXwtPj4/fC18PT18PD18Pj18PHw+fCYmP3whPXxcfFx8P3xcKnxcK3xcXnxcL3whfCUuKj8lfD18fnxcJHxAfDp7MSwzfXxbXFtcXSgpe307LD9dKS9dLAoJICAgIC8vIHZhbGlkIHZhcmlhYmxlIG5hbWVzCgkgICAgW1BSWydQUl9QTEFJTiddLCAvXig/OltBLVphLXpdK1tBLVphLXowLTlfLl0qfFwuW2EtekEtWl9dWzAtOWEtekEtWlwuX10qKSg/IVtBLVphLXowLTlfLl0pL10sCgkgICAgLy8gc3RyaW5nIGJhY2t0aWNrCgkgICAgW1BSWydQUl9TVFJJTkcnXSwgL15gLitgL10KICAgICAgICBdKSwKICAgIFsncicsICdzJywgJ1InLCAnUycsICdTcGx1cyddKTsK"></script> <script src="data:application/x-javascript;base64,dmFyIGE9bnVsbDsKUFIucmVnaXN0ZXJMYW5nSGFuZGxlcihQUi5jcmVhdGVTaW1wbGVMZXhlcihbWyJwdW4iLC9eWzo+P3xdKy8sYSwiOnw+PyJdLFsiZGVjIiwvXiUoPzpZQU1MfFRBRylbXlxuXHIjXSsvLGEsIiUiXSxbInR5cCIsL14mXFMrLyxhLCImIl0sWyJ0eXAiLC9eIVxTKi8sYSwiISJdLFsic3RyIiwvXiIoPzpbXiJcXF18XFwuKSooPzoifCQpLyxhLCciJ10sWyJzdHIiLC9eJyg/OlteJ118JycpKig/Oid8JCkvLGEsIiciXSxbImNvbSIsL14jW15cblxyXSovLGEsIiMiXSxbInBsbiIsL15ccysvLGEsIiBcdFxyXG4iXV0sW1siZGVjIiwvXig/Oi0tLXxcLlwuXC4pKD86W1xuXHJdfCQpL10sWyJwdW4iLC9eLS9dLFsia3dkIiwvXlx3KzpbXG5cciBdL10sWyJwbG4iLC9eXHcrL11dKSxbInlhbWwiLCJ5bWwiXSk7Cg=="></script> <script src="data:application/x-javascript;base64,"></script> <script src="data:application/x-javascript;base64,KGZ1bmN0aW9uKHdpbmRvdykgewoKdmFyIE9SSUdJTl8gPSBsb2NhdGlvbi5wcm90b2NvbCArICcvLycgKyBsb2NhdGlvbi5ob3N0OwoKLy8gY2hlY2sgZm9yIGxvY2FsIHN0b3JhZ2UKdmFyIGhhdmVMb2NhbFN0b3JhZ2UgPSAoZnVuY3Rpb24oKSB7CiAgdmFyIG1vZCA9ICdtb2QnOwogIHRyeSB7CiAgICBsb2NhbFN0b3JhZ2Uuc2V0SXRlbShtb2QsIG1vZCk7CiAgICBsb2NhbFN0b3JhZ2UucmVtb3ZlSXRlbShtb2QpOwogICAgcmV0dXJuIHRydWU7CiAgfSBjYXRjaChlKSB7CiAgICByZXR1cm4gZmFsc2U7CiAgfQp9KCkpOwoKZnVuY3Rpb24gU2xpZGVDb250cm9sbGVyKCkgewogIHRoaXMucG9wdXAgPSBudWxsOwogIHRoaXMuaXNQb3B1cCA9IHdpbmRvdy5vcGVuZXI7CgogIGlmICh0aGlzLnNldHVwRG9uZSgpKSB7CiAgICB3aW5kb3cuYWRkRXZlbnRMaXN0ZW5lcignbWVzc2FnZScsIHRoaXMub25NZXNzYWdlXy5iaW5kKHRoaXMpLCBmYWxzZSk7CgogICAgLy8gQ2xvc2UgcG9wdXBzIGlmIHdlIHJlbG9hZCB0aGUgbWFpbiB3aW5kb3cuCiAgICB3aW5kb3cuYWRkRXZlbnRMaXN0ZW5lcignYmVmb3JldW5sb2FkJywgZnVuY3Rpb24oZSkgewogICAgICBpZiAodGhpcy5wb3B1cCkgewogICAgICAgIHRoaXMucG9wdXAuY2xvc2UoKTsKICAgICAgfQogICAgfS5iaW5kKHRoaXMpLCBmYWxzZSk7CiAgfQp9CgpTbGlkZUNvbnRyb2xsZXIuUFJFU0VOVEVSX01PREVfUEFSQU0gPSAncHJlc2VudG1lJzsKClNsaWRlQ29udHJvbGxlci5wcm90b3R5cGUuc2V0dXBEb25lID0gZnVuY3Rpb24oKSB7CiAgdmFyIHBhcmFtcyA9IGxvY2F0aW9uLnNlYXJjaC5zdWJzdHJpbmcoMSkuc3BsaXQoJyYnKS5tYXAoZnVuY3Rpb24oZWwpIHsKICAgIHJldHVybiBlbC5zcGxpdCgnPScpOwogIH0pOwoKICB2YXIgcHJlc2VudE1lID0gbnVsbDsKICBmb3IgKHZhciBpID0gMCwgcGFyYW07IHBhcmFtID0gcGFyYW1zW2ldOyArK2kpIHsKICAgIGlmIChwYXJhbVswXS50b0xvd2VyQ2FzZSgpID09IFNsaWRlQ29udHJvbGxlci5QUkVTRU5URVJfTU9ERV9QQVJBTSkgewogICAgICBwcmVzZW50TWUgPSBwYXJhbVsxXSA9PSAndHJ1ZSc7CiAgICAgIGJyZWFrOwogICAgfQogIH0KCiAgaWYgKHByZXNlbnRNZSAhPT0gbnVsbCkgewogICAgaWYgKGhhdmVMb2NhbFN0b3JhZ2UpCiAgICAgIGxvY2FsU3RvcmFnZS5FTkFCTEVfUFJFU0VOVE9SX01PREUgPSBwcmVzZW50TWU7CiAgICAvLyBUT0RPOiB1c2Ugd2luZG93Lmhpc3RvcnkucHVzaFN0YXRlIHRvIHVwZGF0ZSBVUkwgaW5zdGVhZCBvZiB0aGUgcmVkaXJlY3QuCiAgICBpZiAod2luZG93Lmhpc3RvcnkucmVwbGFjZVN0YXRlKSB7CiAgICAgIHdpbmRvdy5oaXN0b3J5LnJlcGxhY2VTdGF0ZSh7fSwgJycsIGxvY2F0aW9uLnBhdGhuYW1lKTsKICAgIH0gZWxzZSB7CiAgICAgIGxvY2F0aW9uLnJlcGxhY2UobG9jYXRpb24ucGF0aG5hbWUpOwogICAgICByZXR1cm4gZmFsc2U7CiAgICB9CiAgfQoKICB2YXIgZW5hYmxlUHJlc2VudGVyTW9kZSA9IGhhdmVMb2NhbFN0b3JhZ2UgJiYgbG9jYWxTdG9yYWdlLmdldEl0ZW0oJ0VOQUJMRV9QUkVTRU5UT1JfTU9ERScpOwogIGlmIChlbmFibGVQcmVzZW50ZXJNb2RlICYmIEpTT04ucGFyc2UoZW5hYmxlUHJlc2VudGVyTW9kZSkpIHsKICAgIC8vIE9ubHkgb3BlbiBwb3B1cCBmcm9tIG1haW4gZGVjay4gRG9uJ3Qgd2FudCByZWN1cnNpdmUgcG9wdXAgb3BlbmluZyEKICAgIGlmICghdGhpcy5pc1BvcHVwKSB7CiAgICAgIHZhciBvcHRzID0gJ21lbnViYXI9bm8sbG9jYXRpb249eWVzLHJlc2l6YWJsZT15ZXMsc2Nyb2xsYmFycz1ubyxzdGF0dXM9bm8nOwogICAgICB0aGlzLnBvcHVwID0gd2luZG93Lm9wZW4obG9jYXRpb24uaHJlZiwgJ215d2luZG93Jywgb3B0cyk7CgogICAgICAvLyBMb2FkaW5nIGluIHRoZSBwb3B1cD8gVHJpZ2dlciB0aGUgaG90a2V5IGZvciB0dXJuaW5nIHByZXNlbnRlciBtb2RlIG9uLgogICAgICB0aGlzLnBvcHVwLmFkZEV2ZW50TGlzdGVuZXIoJ2xvYWQnLCBmdW5jdGlvbihlKSB7CiAgICAgICAgdmFyIGV2dCA9IHRoaXMucG9wdXAuZG9jdW1lbnQuY3JlYXRlRXZlbnQoJ0V2ZW50Jyk7CiAgICAgICAgZXZ0LmluaXRFdmVudCgna2V5ZG93bicsIHRydWUsIHRydWUpOwogICAgICAgIGV2dC5rZXlDb2RlID0gJ1AnLmNoYXJDb2RlQXQoMCk7CiAgICAgICAgdGhpcy5wb3B1cC5kb2N1bWVudC5kaXNwYXRjaEV2ZW50KGV2dCk7CiAgICAgICAgLy8gdGhpcy5wb3B1cC5kb2N1bWVudC5ib2R5LmNsYXNzTGlzdC5hZGQoJ3dpdGgtbm90ZXMnKTsKICAgICAgICAvLyBkb2N1bWVudC5ib2R5LmNsYXNzTGlzdC5hZGQoJ3BvcHVwJyk7CiAgICAgIH0uYmluZCh0aGlzKSwgZmFsc2UpOwogICAgfQogIH0KCiAgcmV0dXJuIHRydWU7Cn0KClNsaWRlQ29udHJvbGxlci5wcm90b3R5cGUub25NZXNzYWdlXyA9IGZ1bmN0aW9uKGUpIHsKICB2YXIgZGF0YSA9IGUuZGF0YTsKCiAgLy8gUmVzdHJpY3QgbWVzc2FnZXMgdG8gYmVpbmcgZnJvbSB0aGlzIG9yaWdpbi4gQWxsb3cgbG9jYWwgZGV2ZWxvcG1ldAogIC8vIGZyb20gZmlsZTovLyB0aG91Z2guCiAgLy8gVE9ETzogSXQgd291bGQgYmUgZG9wZSBpZiBGRiBpbXBsZW1lbnRlZCBsb2NhdGlvbi5vcmlnaW4hCiAgaWYgKGUub3JpZ2luICE9IE9SSUdJTl8gJiYgT1JJR0lOXy5pbmRleE9mKCdmaWxlOi8vJykgIT0gMCkgewogICAgLy8gSWdub3JlIG1lc3NhZ2VzIGZyb20gdW5yZWNvZ25pemVkIG9yaWdpbnMKICAgIHJldHVybjsKICB9CgogIC8vIGlmIChlLnNvdXJjZS5sb2NhdGlvbi5ob3N0bmFtZSAhPSAnbG9jYWxob3N0JykgewogIC8vICAgYWxlcnQoJ1NvbWVvbmUgdHJpZWQgdG8gcG9zdE1lc3NhZ2UgZnJvbSBhbiB1bmtub3duIG9yaWdpbicpOwogIC8vICAgcmV0dXJuOwogIC8vIH0KCiAgaWYgKCdrZXlDb2RlJyBpbiBkYXRhKSB7CiAgICB2YXIgZXZ0ID0gZG9jdW1lbnQuY3JlYXRlRXZlbnQoJ0V2ZW50Jyk7CiAgICBldnQuaW5pdEV2ZW50KCdrZXlkb3duJywgdHJ1ZSwgdHJ1ZSk7CiAgICBldnQua2V5Q29kZSA9IGRhdGEua2V5Q29kZTsKICAgIGRvY3VtZW50LmRpc3BhdGNoRXZlbnQoZXZ0KTsKICB9Cn07CgpTbGlkZUNvbnRyb2xsZXIucHJvdG90eXBlLnNlbmRNc2cgPSBmdW5jdGlvbihtc2cpIHsKICAvLyAvLyBTZW5kIG1lc3NhZ2UgdG8gcG9wdXAgd2luZG93LgogIC8vIGlmICh0aGlzLnBvcHVwKSB7CiAgLy8gICB0aGlzLnBvcHVwLnBvc3RNZXNzYWdlKG1zZywgT1JJR0lOXyk7CiAgLy8gfQoKICAvLyBTZW5kIG1lc3NhZ2UgdG8gbWFpbiB3aW5kb3cuCiAgaWYgKHRoaXMuaXNQb3B1cCkgewogICAgLy8gVE9ETzogSXQgd291bGQgYmUgZG9wZSBpZiBGRiBpbXBsZW1lbnRlZCBsb2NhdGlvbi5vcmlnaW4uCiAgICB3aW5kb3cub3BlbmVyLnBvc3RNZXNzYWdlKG1zZywgJyonKTsKICB9Cn07Cgp3aW5kb3cuU2xpZGVDb250cm9sbGVyID0gU2xpZGVDb250cm9sbGVyOwoKfSkod2luZG93KTsKCg=="></script> <script src="data:application/x-javascript;base64,"></script> <style type="text/css"> b, strong { font-weight: bold; } em { font-style: italic; } slides > slide { -webkit-transition: all 0.4s ease-in-out; -moz-transition: all 0.4s ease-in-out; -o-transition: all 0.4s ease-in-out; transition: all 0.4s ease-in-out; } .auto-fadein { -webkit-transition: opacity 0.6s ease-in; -webkit-transition-delay: 0.4s; -moz-transition: opacity 0.6s ease-in 0.4s; -o-transition: opacity 0.6s ease-in 0.4s; transition: opacity 0.6s ease-in 0.4s; opacity: 0; } </style>

3/31/2018

Introduction

  • This is the tenth course of the Coursera Data Science Specialization, Data Science Capstone. This course focuses on analyzing a large corpus of text documents to discover the structure in the data and how words are put together to build a predictive text model.
  • Contents
    • Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
    • Predictive modeling: build basice n-gram models and develop algorithms to facilitate text prediction
    • Shiny app development: produce a web-based Shiny app interphase to predict next words

Modeling

  1. Getting and cleaning the data: profanity was first removed and words tokenized
  2. Exploratory data analysis: the frequencies of words and word paris were calculated
  3. Modeling: 2-7 gram models were built to facilitate word prediction
  4. Prediciton model:
    • Katz's back-off model was used to predict the next word
    • The model iterates from 7-gram to 2-gram to find matches in the last n-1 words
    • In the case of unseen n-gram, the most frequent word, 'the', is returned
    • To improve efficiency, word pairs that appear less than 5 times in the corpus were removed

Results

  • The data analysis and model building writeups can be found on GitHub
  • The Shiny app for prediction can be found here
  • The app takes in the following inputs:
    1. query word/phrase that the user inputs
    2. number of predicted next word
  • The predicted next word(s) will show up in the order of most frequently used to less frequently used

Reference

<script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> <script> (function() { if (window.jQuery) { window.jQuery(document).on('slideleave', function(e) { window.jQuery(e.target).trigger('hidden'); }); window.jQuery(document).on('slideenter', function(e) { window.jQuery(e.target).trigger('shown'); }); } })(); </script>

About

Capstone project