Trouble with hw_wordsim: Combining PPMI and LSA
AndrewLim1990 opened this issue Β· comments
In the second homework question for hw_wordsim, we are provided with the following instructions:
Gigaword with LSA at different dimensions [0.5 points]
We might expect PPMI and LSA to form a solid pipeline that combines the strengths of PPMI with those of dimensionality reduction. However, LSA has a hyper-parameter π β the dimensionality of the final representations β that will impact performance. For this problem, write a wrapper function run_ppmi_lsa_pipeline that does the following:
1. Takes as input a count pd.DataFrame and an LSA parameter k.
2. Reweights the count matrix with PPMI.
3. Applies LSA with dimensionality k.
4. Evaluates this reweighted matrix using full_word_similarity_evaluation. The return value of run_ppmi_lsa_pipeline should be the return value of this call to full_word_similarity_evaluation.
The goal of this question is to help you get a feel for how much LSA alone can contribute to this problem.
The function test_run_ppmi_lsa_pipeline will test your function on the count matrix in data/vsmdata/giga_window20-flat.csv.gz.
When I construct run_ppmi_lsa_pipeline
with the following steps, I get the wrong output:
- Reweight input
count_df
using:ppmi_df = vsm.pmi(count_df, positive=True)
- Perform LSA using:
vsm.lsa(ppmi_df, k)
- Calculate similarities
This leads to a similarity evaluation for men
of 0.57. Not the expected 0.16
However, when I construct run_ppmi_lsa_pipeline
using the following steps (without applying PPMI), I get the correct output:
- Perform LSA using
vsm.lsa(count_df, k)
- Calculate similarities
This leads to the expected similarity for men
of 0.16.
Is there a mistake in the instructions/expected results? Perhaps I did something wrong? Any and all help would be appreciated.
@AndrewLim1990 Many thanks for this! You're quite right: the test is incorrectly assuming an input raw count matrix. I suppose I need tests for these tests! I will push an update later today.
@cgpotts Awesome! Thanks for the update.
This is fixed now. Thanks again!