NiccoloGranieri / SpanExtract

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Span Extract

A tool to extract a span of lines around a found marker in a text file.

Easy how-to guide

Mac / UNIX

It is reccomended, for ease of use, to store both the script and the text file to search in the same folder before following this step-by-step guide.

  • Open the terminal
  • Navigate using the cd command to the folder where both the files are stored. For example, if your files are both stored on the Desktop:
cd /Users/_your_user_name_/Desktop
  • Type the python3 command, followed by the name of the Span Extract script, the command line argument --fileParsing and the name of your file, including extensions. If the script name is unchanged, your command will look something like this:
python3 SpanExtract.py --fileParsing myFileName.txt
  • This will generate a .txt file called myFileName_Parsed.txt that contains a span of 10 lines around each marker found. In the script, the default markers are different kinds of laughter found in a transcript. (LAUGHS, laughs, laughing, chuckles, chuckling, hehe, heh, ehh, thh)

Optional Command Line Arguments

The script allows to modify its behaviour through a series of command line arguments. Command line arguments are key words preceded by "--" than set the script to behave in different ways. Below, a list of all the available command line arguments.

--mode

The --mode command line argument, followed by a 0 or a 1, lets us switch between the different modes of the script. Mode 0 looks for tags and extracts lines around it. To set personalised markers, see the --tags argument, to change the span of lines extracted, see the --span argument, to take into consideration duplicate lines, see the --duplicates argument.

python3 SpanExtract.py --fileParsing myFileName.txt --mode 0

Mode 1, looks for tags and extracts lines around it only if a second feature is found in the surrounding lines. To set personalised markers, see the --tags argument, to take into consideration duplicate lines, see the --duplicates argument.

python3 SpanExtract.py --fileParsing myFileName.txt --mode 1

--tags

If you would like to search for your own markers, use the --tags command line argument and add them one after the other in inverted commas. For example, if I wanted to look for the markers:

  • ((smile))
  • giggle
  • (hug)

I would type:

python3 SpanExtract.py --fileParsing myFileName.txt --mode 0 --tags '((smile))' 'giggle' '(hug)'

--span

If you would like to change the range of lines saved around the found markers, use the --span command line argument followed by a number. For example, if I wanted only 3 lines before and after the marker, I would type:

python3 SpanExtract.py --fileParsing myFileName.txt --mode 0 --span 3

--duplicates

If you would like to set the script to run, search, and output lines taking into account duplicates, and removing them, resulting in a file with no duplicate lines, you should call the --duplicates command line argument. This argument is valid only in mode 0.

python3 SpanExtract.py --fileParsing myFileName.txt --mode 0 --duplicates

If instead, you would like to set the script to run, search, and output lines not taking into account duplicates, and printing the span every time a marker is found, you should set the mode to 1. To do so, just put a one after the number that sets the lines saved.

--featureTwo

When in --mode 1, there is also the possibility to change the second feature to seek in the span around the marker found. The second feature by default is 'you', but it can be changed to one or multiple features by calling the command line argument --featuretwoTwo.

python3 SpanExtract.py --fileParsing myFileName.txt --mode 1 --featureTwo 'his'

Note This mode will effectively ignore any --span or --duplicates arguments previously explained. This mode will always output 4 lines: the line where the marker has been found, one previous line, and the following two lines.

--fileParsing and --folderParsing

One of these two arguments is essential for the script to run. --fileParsing as previously explained, lets us set the .txt file to parse. --folderParsing instead lets us iteratively run the script through all the .txt files in a specified folder. The syntax is the same for both, with the name of the txt file being called in the first instance, and the name of the folder in the second one with a forward slash / at the end.

python3 SpanExtract.py --fileParsing myFileName.txt
python3 SpanExtract.py --folderParsing myFolderName/

--verbose

This last command line arguement, when called, enables verbose mode. Verbose mode will print on the terminal windows a series of useful text relative to the parsing of the text file chosen. This mode is useful to enable when trying to figure out why a certain file is being parsed a certain way, or to check that the process is running smoothly.

python3 SpanExtract.py --fileParsing myFileName.txt --verbose

About

License:MIT License


Languages

Language:Python 100.0%