dipanjanS / text-analytics-with-python

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert code base for Python 3.x

dipanjanS opened this issue · comments

Python 3 is the future and even though a lot of legacy code and systems run on Python 2 (including our applications, which is why I had written this book in Python 2 in the first place). We need to slowly start migrating and building our code, apps and systems based on Python 3.

Looking for experts in Python 3.x as well as NLP and text analytics who could help out in migrating each chapter's codebase to Python 3.x, since I am occupied for a major part of this year on other projects. I do have some parts of it ready for Python 3.x and can offer help and support whenever needed.

Successful codebase migrations will make sure you are mentioned as a contributor in the acknowledgements & contributor list of this repository and project. Also you will get a mention in future versions of the book whenever that is in the pipeline.

I think I can help you with this.

If anyone is interested, I have updated almost all of chapters 1 to 4. Chapters 5 to 7 are displaying some 'lfs' error. I will try to resolve that later, but feel free to fork and make pull requests.

Here's the repo: text-analytics-with-python

Thanks, but like I said we need to follow a structured workflow and approach instead of working in an ad-hoc manner for this to have hassle free merges. Please wait before further conversions because I need to restructure the current repo sand put out a plan. I will do so in a couple of days.

Okay, will hold off on it. I would like to note that the module "pattern" does not seem to have support for Python 3 yet. This will hopefully change in the future, but Chapters 5-7 (and I think some of Chapter 4) are at a roadblock for now.

Sure and yeah I'm aware of the issue with pattern. There is an unofficial port but it's incomplete sadly since the last couple of years. I've thought of some strategies to tackle the same. Let me restructure the current repository then we can get started on this in more detail. I'll update once that is done then we can port and merge chapter by chapter.

Here is the first phase of the plan, once each step is done it will be checked off to keep track. I am currently on vacation so will update you guys as soon as the re-structuring is done.

  • Re-structure current repository @dipanjanS

  • Contributors to pull in latest changes

  • Port code for chapters 1-3 and send pull requests for each chapter separately

  • Merge subsequent pull requests to main repository after review @dipanjanS

  • Look into the pattern repository and necessary modules needed @dipanjanS

  • Discuss strategies for porting remaining chapters and post the plan for the same

@dipanjanS
This idea might sound a bit weird... but do you think it makes sense adding type hints into Python 3.x code examples?

it just might be a bit easier to read through the code in the book.
and code completion / correct jump to definition within PyCharm...

@ambientlight Sorry I'm a bit tied up with work and a couple of other things so I'm not getting time to look into this. Maybe I will sometime soon. With regard to your query, are you talking about the type hints as in specifying the data type per variable in the code? If so maybe we can look into it once the entire code is ported.

@dipanjanS got it! thanks a lot!
method parameters and return types I think would be good enough. normally variable is evident enough from the rhs of the expression.

I ported few things up to CH4. Something like this:

class Normalizer:

    stopwords: List[str] = nltk.corpus.stopwords.words('english')
    wnl = WordNetLemmatizer()

    @staticmethod
    def tokenize_text(text: str) -> List[str]:
        tokens: List[str] = nltk.word_tokenize(text)
        tokens = [token.strip() for token in tokens]
        return tokens

    @staticmethod
    def expand_contractions(text: str, contraction_mapping: Dict[str, str]) -> str:
        contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                          flags=re.IGNORECASE | re.DOTALL)

        def expand_match(contraction):
            match = contraction.group(0)
            first_char = match[0]
            expanded_contraction = \
                contraction_mapping.get(match) \
                if contraction_mapping.get(match) \
                else contraction_mapping.get(match.lower())

            expanded_contraction = first_char + expanded_contraction[1:]
            return expanded_contraction

        expanded_text = contractions_pattern.sub(expand_match, text)
        expanded_text = re.sub("'", "", expanded_text)
        return expanded_text

    # Annotate text tokens with POS tags
    @staticmethod
    def pos_tag_text(text: str) -> List[Tuple[str, str]]:
        # convert Penn treebank tag to wordnet tag
        def penn_to_wn_tags(pos_tag):
            if pos_tag.startswith('J'):
                return wn.ADJ
            elif pos_tag.startswith('V'):
                return wn.VERB
            elif pos_tag.startswith('N'):
                return wn.NOUN
            elif pos_tag.startswith('R'):
                return wn.ADV
            else:
                return None

        tagged_text = nltk.pos_tag(Normalizer.tokenize_text(text))
        tagged_lower_text = [(word.lower(), penn_to_wn_tags(pos_tag)) for word, pos_tag in tagged_text]
        return tagged_lower_text

I can contribute the typing later on if it would be appropriate.

I am not sure if all is now ported to python 3, if not I can contribute, I will checkout repo and add some tests for python 3
Bhushan

@ambientlight @pribond

Sure, thanks for the interest. Code is currently in Python 2. Unfortunately I am a bit pre-occupied with several things at work and one of my books. I'm planning to resume this around end of August hopefully or even earlier.

I still need to refactor the repository so that we have the code separate for Python 2 and 3. I will notify all in this thread once we are ready to start porting.

@dipanjanS What's the status of this issue? I'd be happy to help out.

@dipanjanS is there any plan to convert this to Jupyter notebook?

Sorry folks, a bit tied up with multiple engagements at the moment. Following is what I promise as soon as I can get to it.

  • Code in both Python 2 and 3
  • Jupyter notebooks besides normal code files

Collaborating with some folks from work for better output and ease of communication. In case I need additional help I will update here.

@dipanjanS I can help you with this is this issue is still open. I think creating Jupyter notebooks will be more interactive. Let me know if you need help on this.

Thanks

Hi,

Can you please help me with latest code for python 3.5 64bit operating system? I am using visual studio 2017 to run the code.

I would say, use Jupiter notebook rather than Visual studio. Converting python 2 into python 3 is simple

Kindly go through the book to get details of what have been used. For now the code runs on Python 2.7.x and you can use the anaconda distribution. The same is mentioned in the book. There is a work in progress to convert the code into Python 3 as well as jupyter notebooks. Once that is done it will be updated here.

Can you please give any steps guideline documents on how to convert the code in python 2.X to 3.X using jupitor notebook?

Jupyter notebook is not used for code conversion, it is a mechanism to run code, document your findings and share it across with others easily if needed. You need to use your own logic and utility libraries like 2to3 or six to convert the code.

Any plans to port the code to Python 3 in 2018?

@peterotool Thanks for bringing this up! Yep, work is already underway on this, we are planning to bring out a new revised edition of this book with all code in Python 3 and also adding new examples, use-cases and so on. Stay tuned! The book is going to come back better and with more content!

@dipanjanS it is possible to create a chatbot using some deep learning architecture?

@samuelxmli Can you please stop spamming the same question everywhere? You have already created two issues\comments. Closing this issue since I have replied on the other thread and soon we will be doing a revised version of this book in Python 3.x