Description
Crawler for job description from job search engine monster.de. The job descriptions then were analyzed using Rapidminer: We built a Document-Term-Matrix for the whole corpus of real job offfers and for fictional job descriptions of our "dreamjobs". Then we used Cosine-Similarity to find the Job offers most similar to our dream-jobs.
Context
Master programme Data Science & Business Analytics
Lecture Introduction to Data Science
At University of Media, Stuttgart (DE)
Goal / Task
Come up with a use-case for clustering, classifaction or text analysis and implement a Proof-of-Concept using a self-service-analytics tool like Rapidminer.
Authors
Sanna and me (dynobo)
Timeline
Mar. 2017 - Apr. 2017
Repo
Web-Crawler implemented with Scrapy in Python; Rapidminer workflows for data cleaning, preparation and similiarity search.
Install dependencies:
- scrapy
In /spiders/jobs_spider.py
, in the class linkSpider
, adjust the variable url
in order to get the result for desired region and keywords.
First extract links to job offers from search results:
scrapy crawl links -a search=datascience -o datascience.json
scrapy crawl links -a search=itinstuttgart -o itinstuttgart.json
Second extract job descriptions:
scrapy crawl jobs -a search=datascience -o datascience.xml
scrapy crawl jobs -a search=itinstuttgart -o itinstuttgart.xml
Third combine the xmls :
xml_grep --pretty_print indented --wrap items --descr '' --cond "item" *.xml > jobs.xml