Trouble with hw_wordsim: Combining PPMI and LSA

Question

Trouble with hw_wordsim: Combining PPMI and LSA

AndrewLim1990 opened this issue 4 years ago · comments

In the second homework question for hw_wordsim, we are provided with the following instructions:

Gigaword with LSA at different dimensions [0.5 points]
We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter  𝑘  – the dimensionality of the final representations – that will impact performance. For this problem, write a wrapper function run_ppmi_lsa_pipeline that does the following:

1. Takes as input a count pd.DataFrame and an LSA parameter k.
2. Reweights the count matrix with PPMI.
3. Applies LSA with dimensionality k.
4. Evaluates this reweighted matrix using full_word_similarity_evaluation. The return value of run_ppmi_lsa_pipeline should be the return value of this call to full_word_similarity_evaluation.
The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem.

The function test_run_ppmi_lsa_pipeline will test your function on the count matrix in data/vsmdata/giga_window20-flat.csv.gz.

When I construct run_ppmi_lsa_pipeline with the following steps, I get the wrong output:

Reweight input count_df using: ppmi_df = vsm.pmi(count_df, positive=True)
Perform LSA using: vsm.lsa(ppmi_df, k)
Calculate similarities

This leads to a similarity evaluation for men of 0.57. Not the expected 0.16

However, when I construct run_ppmi_lsa_pipeline using the following steps (without applying PPMI), I get the correct output:

Perform LSA using vsm.lsa(count_df, k)
Calculate similarities

This leads to the expected similarity for men of 0.16.

Is there a mistake in the instructions/expected results? Perhaps I did something wrong? Any and all help would be appreciated.

Christopher Potts · Answer 1 · Tue Feb 11 2020 06:15:53 GMT+0800 (China Standard Time)

@AndrewLim1990 Many thanks for this! You're quite right: the test is incorrectly assuming an input raw count matrix. I suppose I need tests for these tests! I will push an update later today.

andrew.lim · Answer 2 · Tue Feb 11 2020 06:16:42 GMT+0800 (China Standard Time)

@cgpotts Awesome! Thanks for the update.

Christopher Potts · Answer 3 · Tue Feb 11 2020 09:03:54 GMT+0800 (China Standard Time)

This is fixed now. Thanks again!