thejonsnow / CourseProject

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

StockTwits Search Engine Team-RSJB CS410

Project Video

https://drive.google.com/file/d/1-M08jBvJin0lV7XjcFDc5k3yOELH6CJ7/view

Project Overview

Team RSJB developed a search engine for StockTwits which enables users to enter terms or phrases to find relevant posts on that website. The search engine is built using a BM25 search algorithm and a Flask front end, and the dataset was generated utilizing the requests and BS4 python libraries. The Flask web app relies on user input that connects with the BM25 search algorithm hosted as a Google Cloud Platform cloud function and returns the 15 most relevant results. To continuously record more and more posts over time, a scraper is deployed as another GCP cloud function and is made to run every hour.

The code structure is as follows. The Flask web app is located in src/rsjb_WebApp.

The file “requirements.txt” is also present in this folder, and contains the libraries needed to run it. They can be installed by calling “pip install -r requirements.txt” or “python/python3 -m pip install requirements.txt”, depending on the python version installed. It may be helpful to install these libraries within a virtual environment (instructions can be found at https://docs.python.org/3/library/venv.html). After installing all dependencies, you can run the flask app by running “python/python3 TwitApp.py”, again depending on the python version (“python” for Python 2, “python3” for Python 3). Once this has been run, you can access the application by going to a browser and typing “localhost:5000” into your address bar. Once there, you can enter a query of your choice, and the top 15 StockTwits posts will be displayed once the function retrieves the relevant data. In “src/search_algorithm”, there is a Jupyter notebook with information on how the search algorithm was designed, tested, and demonstrated to work. In src/scripts are Jupyter notebooks containing research code in which exploration into web scraping was conducted and code was developed to retrieve StockTwits posts. In src/cloud are “f1.py” and “f2.py”, which are the functions currently running in the cloud. They are adaptations of the search algorithm and web scraper previously discussed. Also inside that folder are the web scraper Jupyter notebooks that were locally run to generate the 200,000 data points for the search algorithm to generate results based on. Going through that jupyter notebook, you can first install the required pip libraries, then import them, run google credentials, and start the function. If any errors arise while starting the application or various parts of the codebase, it will probably be due to not having sufficient pip modules installed, and the error should reflect that. If an error arises due to a missing library, then run “pip install <missing_library>”, which should fix that error.

William Skedd was responsible for connecting the various pieces of the code base together, along with making adjustments to the existing code to ensure it would correctly run, debugging the code where needed, and running scrapers on a local machine to generate a dataset large enough for the search algorithm. Time was first spent researching how to connect all the different pieces of our code requirements together. After doing research, he decided to use Google Cloud Platform for our cloud functions, Firestore to store indexing variables for the search algorithm, and Firestore storage to store the CSVs containing our scraped data. After the search algorithm was finished by Sam, William designed the CSV that would hold all of the scraped posts along with the headers for the CSV. William then had to familiarize himself with and debug GCP cloud functions in order to deploy the search algorithm to GCP cloud. After deploying the function, adjustments were needed to get the query from the data parameters of the function into a form where it could be used as an input for our search algorithm, and to convert the returned values into an array of JSONs to be returned to the front end. This was carried out in “cloud/f1.py”. Then, after Jeremy concluded his work on web scraping, William made adjustments to the scraping functions to record more data and perform error handling before deploying the function to GCP cloud. The final web scraping code can be found in “cloud/f2.py”. He also set up the function so that it would run each hour to add new data to our dataset CSV. As the functions in the cloud could not be run for long times and cost money to run frequently, William made local web scrapers and spent time running them long enough to investigate 500,000 post ids and retrieve around 200,000 good data points for our dataset. William made sure the scrapers were always running and handled various errors that occurred (see “cloud/scraper(2,3,4,5).ipynb”). William made “cloud/combine_csvs.ipynb” to combine the different CSVs that were generated from each of the local scrapers and stored in Firestore Storage, and then save them into the CSV used in the searching cloud function. Lastly, William researched how to call the search algorithm via a post request using google auth libraries in Python, send query data to the function, and handle responses within the front end of our application. William wrote his code using Python within Jupyter notebooks and utilized the “google-auth” library to make secure requests to our cloud functions. Also, the “requests” and “bs4” libraries were used in the scraping function. William used Google Cloud Platform, Firestore and Firestore Storage to host the different parts of the backend system for our project.

Jeremy Bao was responsible for writing the code used for scraping posts from StockTwits. First, he spent some time reading tutorials on how to obtain information from websites, as he did not have any experience with this (aside from a single workshop on web scraping attended 5 years ago). While teammates had previously stated that Selenium and BeautifulSoup should be used, he determined that it would be sufficient to use the “requests” library to get the contents of web pages and BeautifulSoup to extract information from them. He discussed with his group members how the posts on StockTwits could be sequentially crawled, and determined that each post could be accessed by going to “https://stocktwits.com/message/”, where each post is given a unique id number, and these id numbers are assigned sequentially. So, the first post ever made could be found at “https://stocktwits.com/message/1”, the second post could be found at “https://stocktwits.com/message/2”, and so on, though many of the early posts had been deleted (note that although when you navigate through StockTwits manually, the username of each post’s author will be found between the “.com” and the “message”, so you might get something like “https://stocktwits.com/Prospectus/message/4”, the usernames were not initially needed to reach the web pages containing each post). He also found (by creating a StockTwits account and creating a post) that it was impossible to edit StockTwits posts, so that we would only have to store each post’s contents once. His initial explorations were done in “src/scripts/Try_Beautiful_Soup.ipynb”, where he played around with web scraping, figured out how to extract the text from StockTwits posts (this involved some usage of the “Inspect Element” tool provided by various browsers), and investigated how dates could be extracted from those posts (they could be successfully extracted from older posts, but not ones made in the past few days). Jeremy did all of his coding online, in Google Colab.

William then created “/src/scripts/twitsscraper.ipynb”, in which he extracted authors from StockTwits posts and created code to store information about StockTwits posts and flags for where crawling should start. Jeremy then came to have another look at his code, and realized that StockTwits changed their addressing scheme, thus breaking our previous code. Now, entering “https://stocktwits.com/message/”, where N is some integer, would not take you to the N-th post submitted to StockTwits. You could still access each post if you knew the username of its author (by going to “https://stocktwits.com//message/”), but we did not know which user had authored each post. Jeremy investigated various alternative approaches. Getting a token to use StockTwit’s API would be difficult, and he could not find any information on their site about it. He also read that even with a token, only 500 requests could be made per hour using that API, which would be insufficient. He also looked into the “pytwits” library. However, that library was poorly documented, he could not get it to work, and it seemed to require an API token to run. Jeremy spent some time looking around the internet and throughout StockTwit’s website for a solution.

Eventually, he randomly decided to type “https://stocktwits.com/user/message/4” into the address bar and successfully accessed the fourth post ever made to StockTwits. Apparently, you now had to put “user” (or any other text) between “.com” and “message” in order to access StockTwits posts without entering a username. He worked on scraping information from these posts in “src/scripts/twitsscraper_1_1.ipynb”, where he created code to iterate through many different posts until 1,000 non-deleted posts had been retrieved and then store the post ids, text content, and authors of those posts. One complication was encountered at this point, as the method William had developed to scrape post authors no longer worked. While post authors could be found in each post’s HTML representation in multiple places when accessing posts via a web browser, they could only be found in a JSON object stored within a “script” element with an “id” of “NEXT_DATA” when accessing these posts using requests and BeautifulSoup. The “json” library had to be used to obtain information from these JSON objects. Jeremy recorded how long it took to crawl 1,000 non-deleted posts and found that it took a bit over 5 minutes, which led to our group deciding not to crawl every single one of the ~500 million currently on StockTwits (as it would take too long). He also suggested that we store only post id numbers, authors, and text in order to save storage space, as the other information could be computed from those values easily, but the others decided to store more information.

Jeremy’s code was written in Python and contained within Jupyter notebooks. It used the “requests” library to get the web pages containing StockTwits posts, the BeautifulSoup library to scrape information from them, the “json” library to deal with a JSON object stored within the HTML representations of each web page, the “csv” library to store information to CSV files, and the “time” library to record how long things took to run. “firebase_admin” was imported, though it was mostly just used for William’s exploratory code. “pandas” was imported because Jeremy used it while following a web scraping tutorial, and was used in the version of his code deployed to Google Cloud Platform. Jeremy’s code can be found in “src/scripts” within our repository. It is not run during the normal operation of our web application, but the code used to scrape information from StockTwits and store it on Google Firestore was based off of it.

Sewoong (Sam) Lee was responsible for developing the search algorithm for our search engine. In order to use the Okapi BM25 algorithm, he researched various libraries containing text ranking algorithms. He finally selected the “rank-bm25” library, due to its API structure. By wrapping libraries and adding features, the class of search algorithm provides features such as updating ranker, calculating scores, printing and plotting top search results for given corpus and query. Pandas, NumPy, and matplotlib were also used to develop our algorithm. Then, he connected his algorithm with the formatted CSV file, so that teammates can easily use the algorithm. After making the algorithm compatible with the CSV files, he also implemented the interfacing code with JSON objects. To be used in web applications, the algorithm returns its search result in the array of JSON objects and JSON file of which filename extension is ".json". To sum up, the input of the algorithm is given in CSV format and returns search results in JSON format. The example of converting JSON file into pandas dataframe is also implemented. Lastly he made documentation and summarized results in the Jupyter Notebook format for readability.

Ritik Kulkarni was responsible for creating the Web application using the Flask web framework along with HTML and CSS. Initially, he took time to learn and understand flask commands and how they are implemented. The flask application ‘TwitApp.py’ created by Ritik consists of two html pages called ‘index.html’ and ‘data.html’ along with CSS style files. 'index.html’ is the page where the user enters their query and ‘data.html’ is the page where the top 15 search results of the query entered by the user are displayed. The basic structure of the Flask application is as follows:

  1. The flask program is executed and it creates a website with a specific URL (currently localhost).
  2. When a User enters the specific URL, they are directed to the main flask application page (which is an html page called index.html).
  3. Once in the webpage, the user can enter their respective query that they want to search, in the provided text field.
  4. After clicking the Search button, the flask application takes this query and sends it to the search algorithm in the google cloud platform (GCP).
  5. The algorithm then sends the top 15 results of that query search back to the flask application.
  6. We then perform some processing on the results to convert it into a presentable format and display it to the user (results are displayed using an html page called data.html).
  7. The user can now view these results. If the user wishes to do a search using another query, then they can make use of the button labeled “Go Back” to return to the main application window where they can enter another query.

About


Languages

Language:Jupyter Notebook 83.3%Language:HTML 12.3%Language:Python 3.6%Language:JavaScript 0.8%Language:CSS 0.0%