explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stop_words assigned but not used?

ExplodingCabbage opened this issue · comments

Maybe I'm being dense, but when I search the entire repo case-insensitively for stop_words, it looks like you're defining a list of stop words but never using it. Every match in the Sublime search below is an assignment; all you seem to do is define STOP_WORDS constants in language_data.py files and then assign those constants to the stop_words class property of a Language's Defaults, without ever then reading from it:

Searching 535 files for "stop_words"

/home/mark/spaCy/spacy/language.py:
  153      tagger_features = Tagger.feature_templates # TODO -- fix this
  154  
  155:     stop_words = set()
  156  
  157      lex_attr_getters = {

/home/mark/spaCy/spacy/de/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/de/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/en/__init__.py:
   29          tag_map = dict(language_data.TAG_MAP)
   30  
   31:         stop_words = set(language_data.STOP_WORDS)
   32  

/home/mark/spaCy/spacy/en/language_data.py:
    4  
    5  # improved list from Stone, Denis, Kwantes (2010)
    6: STOP_WORDS = set("""
    7  a about above across after afterwards again against all almost alone 
    8  along already also although always am among amongst amoungst amount 

/home/mark/spaCy/spacy/es/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  

/home/mark/spaCy/spacy/es/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/fr/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/fr/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/it/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/it/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/pt/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/pt/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

19 matches across 13 files

Does this list still have a purpose, or should it be culled? I thought I'd flag this up before any of you dutifully hunt down stop word lists for the new languages you're adding!

Apologies if there's some reason for this to exist that I'm missing.

Aside: the English STOP_WORDS list contains some surprising entries like "computer", "fire", and "mill" that it seems bizarre and arbitrary to treat as stop words. I've tracked down the source of this to http://onlinelibrary.wiley.com/store/10.1111/j.1756-8765.2010.01108.x/asset/supinfo/TOPS_1108_sm_supmat.pdf?v=1&s=715bd019aab0c2df0c269b487209c1342143a0a6, and it seems that this was indeed the stop word list used in http://onlinelibrary.wiley.com/doi/10.1111/j.1756-8765.2010.01108.x/full; regardless, it's bizarre and if this list is sticking around perhaps the presence of these seemingly inappropriate entries should be addressed.

Thanks.

What's supposed to happen is the IS_STOP attribute in the Language class should be mapping to a function that looks up the word in the stop list. I see that this got broken somewhere.

Agree about the English stop list.

Re English stopwords: I'm currently in the process of reorganising the language data. Just posted an update here: #649

Put a band-aid on this for now. A more satisfying fix will come alongside the data reorganisation.

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.