Technical Track of Computer Tools for Linguistic Research (2021/2022)

As a part of a compulsory course Computer Tools for Linguistic Research in National Research University Higher School of Economics.

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements.

Instructors:

Khomenko Anna Yurievna - linguistic track lecturer
Lyashevskaya Olga Nikolaevna - linguistic track lecturer
Demidovskij Alexander Vladimirovich - technical track lecturer
Uraev Dmitry Yurievich - technical track practice lecturer
Kazyulina Marina Sergeevna - technical track assistant

Project Timeline

Scrapper
1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format
2. Deadline: March 25th, 2022
3. Format: each student works in their own PR
4. Dataset volume: 5-7 articles
5. Design document: ./docs/scrapper.md
6. Additional resources:
  1. List of media websites to select from: at the Resources section on this page
Pipeline
1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
2. Deadline: April 29th, 2022
3. Format: each student works in their own PR
4. Dataset volume: 5-7 articles
5. Design document: ./docs/pipeline.md

Lectures history

Date	Lecture topic	Important links
21.02.2022	Lecture: Exceptions: built-in and custom for error handling and information exchange.	Introduction tutorial
25.02.2022	Practice: Programming assignment: main concept and implementation details.	N/A
04.03.2022	Lecture: installing external dependencies with `python -m pip install -r requirements.txt`, learning `requests` library: basics, tricks. Practice: downloading your website pages, working with exceptions.	Exceptions practice, `requests` practice
11.03.2022	Lecture: learning `beautifulsoup4` library: find elements and get data from them. Practice: parsing your website pages.	`beautifulsoup4` practice
18.03.2022	Lecture: working with file system via `pathlib`, `shutil`. Practice: parsing dates, creating and removing folders.	Dates practice, `pathlib` practice
25.03.2022	First deadline: crawler assignment	N/A
01.04.2022	EXAM WEEK: skipping lecture and seminars	N/A
08.04.2022	Lecture: Programming assignment (Part 2): main concept and implementation details. Lemmatization and stemming. Existing tools for morphological analysis	N/A
15.04.2022	Lecture: morphological analysis via `pymystem3`, `pymorphy2`. Practice: analyzing words	`pymystem3` basics, `pymorphy2` basics
22.04.2022	Lecture: information retrieval with `re`. Practice: analyzing web server logs	`re` basics

Technical solution

Module	Description	Component	I need to know them, if I want to get at least
`pathlib`	module for working with file paths	scrapper	4
`requests`	module for downloading web pages	scrapper	4
`BeautifulSoup4`	module for finding information on web pages	scrapper	4
`PyMuPDF`	Optional module for opening and reading PDF files	scrapper	4
`lxml`	Optional module for parsing HTML as a structure	scrapper	6
`wget`	Optional module for parsing HTML as a structure	scrapper	6
`pymystem3`	module for morphological analysis	pipeline	6
`pymorphy2`	module for morphological analysis	pipeline	8
`pandas`	module for table data analysis	pipeline	10

Software solution is built on top of three components:

scrapper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
article.py - a module for article abstraction to encapsulate low-level manipulations with the article

Handing over your work

Order of handing over:

lab work is accepted for oral presentation.
a student has explained the work of the program and showed it in action.
a student has completed the min-task from a mentor that requires some slight code modifications.
a student receives a mark:
1. that corresponds to the expected one, if all the steps above are completed and mentor is satisfied with the answer;
2. one point bigger than the expected one, if all the steps above are completed and mentor is very satisfied with the answer;
3. one point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied;
4. two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied.

NOTE: a student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.

A lab work is accepted for oral presentation if all the criteria below are satisfied:

there is a Pull Request (PR) with a correctly formatted name: Laboratory work #<NUMBER>, <SURNAME> <NAME> - <UNIVERSITY GROUP NAME>. Example: Laboratory work #1, Kuznetsova Valeriya - 19FPL1.
has a filled file target_score.txt with an expected mark. Acceptable values: 4, 6, 8, 10.
has green status.
has a label done, set by mentor.

Resources

Academic performance: link
Media websites list: link
Python programming course from previous semester: link
Scrapping tutorials: YouTube series (russian)
HOWTO: Running tests

AcipenserSturio / 2021-2-level-ctlr