Create extra columns for governor word, lemma, POS and function
skip_morph
bool
Enable if you'd like to skip the parsing of morphological and extra fields
v2
bool/'auto'
CONLL-U version of file. By default, detect from data
drop
list
list of column names you don't need
add_meta
bool
add columns for sentence-level metadata
categories
bool
Convert columns to categorical format where possible
file_index
bool
Include filename in index levels
extra_fields
list/'auto'
`Names of extra fields in the last column. By default, detect from data
kwargs
dict
additional arguments to pass to pandas.read_csv()
Configuring these arguments can increase speed a lot, so if speed is important to you, turn off the features you don't need.
Where to from here?
If you're working with Python and CONLL-U, you might want to take a look at tücan, which provides a command-line and web-app interface for exploring CONLL-U datasets.
Alternatively, there's plenty of cool stuff you can do with Pandas by itself. Here are some toy examples:
defsearcher(df, column, query, inverse=False):
"""Search column for regex query"""bool_ix=df[column].str.contains(query)
returndf[bool_ix] ifnotinverseelsedf[~bool_ix]
pd.DataFrame.search=searcher# get nominal subjects starting with a, b or cdf.search('f', 'nsubj').search('w', '^[abc]').head().to_html()
w
l
x
p
g
f
e
type
gender
Case
Definite
Degree
Foreign
Gender
Mood
Number
Person
Poss
Reflex
Tense
Voice
Type
s
i
3
4.0
authorities
authority
NOUN
NNS
5
nsubj
_
_
_
_
_
_
_
_
_
Plur
_
_
_
_
_
_
8
2.0
cells
cell
NOUN
NNS
4
nsubj
_
_
_
_
_
_
_
_
_
Plur
_
_
_
_
_
_
9
3.0
announcement
announcement
NOUN
NN
6
nsubj:pass
_
_
_
_
_
_
_
_
_
Sing
_
_
_
_
_
_
12
3.0
commander
commander
NOUN
NN
7
nsubj
_
_
_
_
_
_
_
_
_
Sing
_
_
_
_
_
_
9.0
bombings
bombing
NOUN
NNS
11
nsubj
_
_
_
_
_
_
_
_
_
Plur
_
_
_
_
_
_
Create a concordancer
def_conclines(match, df=False, column=False):
"""Apply this to each sentence"""s, i=match.namesent=df['w'].loc[s]
match['left'] =sent.loc[:i-1].str.cat(sep=' ')
match['right'] =sent.loc[i+1:].str.cat(sep=' ')
formatted=match['w']
ifcolumn!='w':
formatted+='/'+match[column]
match['match'] =formattedreturnmatchdefconc(df, column, query):
"""Build simple concordancer"""# get query matchesmatches=df[df[column].str.contains(query)]
# add left and right columnslines=matches.apply(_conclines, df=df, column=column, axis=1)
returnlines[['left', 'match', 'right']]
pd.DataFrame.conc=conclines=df.head(1000).conc('l', 'be')
lines.head(10).to_html()