roohy / CS599_HW1

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CS599_HW1

S3_Downloader

S3_Connector.py file in the S3_Downloader directory will help us download files from S3 repo. To run, it needs boto3 python library which can be installed using pip3. sudo pip3 install boto3

This file uses Python3 to run.

tika_connection

(tika connection was evetually never used ) Tika connection uses Python2 and tika for python to get to know the file types in order to use them in later analysis. To access these files later, this code stores them in a Mongo DB collection. This code snippet uses tika for Python and Pymongo. If you have mongo on your machine, you can use AdminMongo to manage it and add new collections to it. https://github.com/mrvautin/adminMongo

pymongo can be installed using pip. sudo pip install pymongo To connect to S3 servers and basic process we have developed testTika project. It should be run first with the same mongoDB URI. You can get the project from: https://github.com/roohy/TikaParser_CS599_HW1 #S3_Download Is for people with enough disk space. They can download everything to the drives and then run the parser or anything. But we did not have enough disk space. #Analysis analysis folder contains offline analysis code. For when we have enough disk space, which we did not. But if you do and have downloaded stuff using S3_Downloader and populated the DB with offline files, you can use this part. (Removed from the project) #Online Analysis this folder contains stuff should be run with access to the database populated with the code on https://github.com/roohy/TikaParser_CS599_HW1 This will make JSON outputs required for the our website to work. Website can be found under site subfolder. JSON located in data section of the site are generated by the analysis code in main. Configurations can be changed under congis file. :-)

#Tika similarity

This part is our codes for tika similarity. Just two simple codes to run over downloaded files and cluster them. This codes are written in bash and use python2 and python3.

#resulst Our work resulted in an updated tika mime repo that can be found on : https://github.com/roohy/updatedTika we could found some new files in octet-stream typed files.

About

License:Apache License 2.0


Languages

Language:Python 42.6%Language:HTML 33.6%Language:JavaScript 21.1%Language:CSS 2.8%