chbrown / liwc-python

Linguistic Inquiry and Word Count (LIWC) analyzer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

liwc

PyPI version Travis CI Build Status

This repository is a Python package implementing two basic functions:

  1. Loading (parsing) a Linguistic Inquiry and Word Count (LIWC) dictionary from the .dic file format.
  2. Using that dictionary to count category matches on provided texts.

This is not an official LIWC product nor is it in any way affiliated with the LIWC development team or Receptiviti.

Obtaining LIWC

The LIWC lexicon is proprietary, so it is not included in this repository.

The lexicon data can be acquired (purchased) from liwc.net.

  • If you are a researcher at an academic institution, please contact Dr. James W. Pennebaker directly.
  • For commercial use, contact Receptiviti, which is the company that holds exclusive commercial license.

Finally, please do not open an issue in this repository with the intent of subverting encryption implemented by the LIWC developers. If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable *.dic file, please contact the distributor directly.

Setup

Install from PyPI:

pip install liwc

Example

This example reads the LIWC dictionary from a file named LIWC2007_English100131.dic, which looks like this:

%
1   funct
2   pronoun
[...]
%
a   1   10
abdomen*    146 147
about   1   16  17
[...]

Loading the lexicon

import liwc
parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
  • parse is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings)
  • category_names is all LIWC categories in the lexicon (a list of strings)

Analyzing text

import re

def tokenize(text):
    # you may want to use a smarter tokenizer
    for match in re.finditer(r'\w+', text, re.UNICODE):
        yield match.group(0)

gettysburg = '''Four score and seven years ago our fathers brought forth on
  this continent a new nation, conceived in liberty, and dedicated to the
  proposition that all men are created equal. Now we are engaged in a great
  civil war, testing whether that nation, or any nation so conceived and so
  dedicated, can long endure. We are met on a great battlefield of that war.
  We have come to dedicate a portion of that field, as a final resting place
  for those who here gave their lives that that nation might live. It is
  altogether fitting and proper that we should do this.'''.lower()
gettysburg_tokens = tokenize(gettysburg)

Now, count all the categories in all of the tokens, and print the results:

from collections import Counter
gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token))
print(gettysburg_counts)
#=> Counter({'funct': 58, 'pronoun': 18, 'cogmech': 17, ...})

N.B.:

  • The LIWC lexicon only matches lowercase strings, so you will most likely want to lowercase your input text before passing it to parse(...). In the example above, I call .lower() on the entire string, but you could alternatively incorporate that into your tokenization process (e.g., by using spaCy's token.lower_).

License

Copyright (c) 2012-2020 Christopher Brown. MIT Licensed.

About

Linguistic Inquiry and Word Count (LIWC) analyzer

License:MIT License


Languages

Language:Python 100.0%