cgpotts / cs224u

Code for Stanford CS224u

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trouble with hw_wordsim: Combining PPMI and LSA

AndrewLim1990 opened this issue Β· comments

In the second homework question for hw_wordsim, we are provided with the following instructions:

Gigaword with LSA at different dimensions [0.5 points]
We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter  π‘˜  – the dimensionality of the final representations – that will impact performance. For this problem, write a wrapper function run_ppmi_lsa_pipeline that does the following:

1. Takes as input a count pd.DataFrame and an LSA parameter k.
2. Reweights the count matrix with PPMI.
3. Applies LSA with dimensionality k.
4. Evaluates this reweighted matrix using full_word_similarity_evaluation. The return value of run_ppmi_lsa_pipeline should be the return value of this call to full_word_similarity_evaluation.
The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem.

The function test_run_ppmi_lsa_pipeline will test your function on the count matrix in data/vsmdata/giga_window20-flat.csv.gz.

When I construct run_ppmi_lsa_pipeline with the following steps, I get the wrong output:

  1. Reweight input count_df using: ppmi_df = vsm.pmi(count_df, positive=True)
  2. Perform LSA using: vsm.lsa(ppmi_df, k)
  3. Calculate similarities

This leads to a similarity evaluation for men of 0.57. Not the expected 0.16

However, when I construct run_ppmi_lsa_pipeline using the following steps (without applying PPMI), I get the correct output:

  1. Perform LSA using vsm.lsa(count_df, k)
  2. Calculate similarities

This leads to the expected similarity for men of 0.16.

Is there a mistake in the instructions/expected results? Perhaps I did something wrong? Any and all help would be appreciated.

@AndrewLim1990 Many thanks for this! You're quite right: the test is incorrectly assuming an input raw count matrix. I suppose I need tests for these tests! I will push an update later today.

@cgpotts Awesome! Thanks for the update.

This is fixed now. Thanks again!