percyliang / brown-cluster

C++ implementation of the Brown word clustering algorithm.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there any limit for the vocab size (#types)?

rasoolims opened this issue · comments

The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.

Is there any limit for the vocabulary size (#types)?

I'm not sure what the exact limit is but I'm not surprised that it failed
with 20M types. You can try using the restrict command line option to
restrict it to a smaller vocabulary. The brown clustering algorithm dates
back to a time when people didn't have 14GB text files to work with.

On Thu, Mar 3, 2016 at 9:04 PM, Mohammad Sadegh Rasooli <
notifications@github.com> wrote:

The code fails (with core dump: segmentation fault message) when I run it
on a huge txt file (about 20M types and 14GB file size). I already used
wcluster for different files with much less types and it worked pretty well.

Is there any limit for the vocabulary size (#types)?


Reply to this email directly or view it on GitHub
#14.

I have noticed that at the end of March a new commit was performed. The commit is labeled "Enable >= 2^31 tokens in input data" so I thought it would have addressed the issue raised here. However, I still ran into an issue similar to the one mentioned by rasoolims. I'm able to successfully run the code only with a file containing 10M tokens (700K types). With bigger files it fails saying "core dump: segmentation fault".
Any suggestion?

thanks

Did you try using the flag to restrict the vocabulary?

On Thursday, July 14, 2016, lavelli notifications@github.com wrote:

I have noticed that at the end of March a new commit was performed. The
commit is labeled "Enable >= 2^31 tokens in input data" so I thought it
would have addressed the issue raised here. However, I still ran into an
issue similar to the one mentioned by rasoolims. I'm able to successfully
run the code only with a file containing 10M tokens (700K types). With
bigger files it fails saying "core dump: segmentation fault".
Any suggestion?

thanks


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#14 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGgz7eAylui6B0cmpcscgN284LJzMC8Oks5qVe3UgaJpZM4HpFcE
.

Do you mean the min-occur flag?
It seems to have an impact only on efficiency.

I know this is late and probably not important to OP anymore but for any other people facing the same issue, this pr fixed the issue for me.