unnest_tokens error

Question

unnest_tokens error

kanatea opened this issue 4 years ago · comments

Hi!

I started to get the following error message when I attempted to tokenize tweets with unnest_tokens():
Error in cut.default(seq_along(out), docindex, include.lowest = TRUE, : 'breaks' are not unique

The dataset I'm running is comprised of several binded Twitter pools using rbind() but it seems like the issue is only with certain sets within that dataset (I ran the command with the smaller sub-datasets and got the error messages for two out of three of them). Other datasets binded in a similar fashion are working fine with this command, so I'm wondering what the issue is.

I tried the issue solution from 2018 but it didn't solve the issue. Would appreciate any input!

Julia Silge · Answer 1 · Sat Oct 31 2020 03:03:39 GMT+0800 (China Standard Time)

Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for me to recreate your problem so that I can understand it and/or fix it; I need to be able to see something about the data that is causing this error.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

Julia Silge · Answer 2 · Tue Nov 03 2020 23:08:10 GMT+0800 (China Standard Time)

Thanks so much for working with me on this @kanatea! What you have shared here is an image/screenshot, rather than a reprex. The idea of a reprex is that you give me enough information to be able to run the code myself and we can find out what the problem is together. I unfortunately can't run this myself, because I don't have access to this data.

Check out reprex do's and don'ts here to find out some ways to get effective help; especially pay attention to the recommendations on how to share data.

Kana Tateishi · Answer 3 · Fri Nov 06 2020 03:20:59 GMT+0800 (China Standard Time)

Hi there!

Sorry for the trouble earlier, R wasn't allowing me to export the reprex. Also, we realized it isn't the code itself, but the data for some reason, so I will also be attaching the data and reprex here in a zip file
sampledata.zip

Thank you so much !
.

Julia Silge · Answer 4 · Sat Nov 07 2020 01:30:17 GMT+0800 (China Standard Time)

Hello there, @kanatea! Unfortunately, loading up whole R workspaces with an .Rdata file isn't going to set us up for success in being able to solve a problem like this.

To be able to get help with R problems online, it's important to be able to create small, self-contained examples that demonstrate what you are running into. I really do recommend you check out the resources on the reprex site, like maybe this article on how to use datapasta + reprex, or this article on the RStudio Community site that has some more info.

I recommend taking the part of the data that is causing you problems and making it smaller and simpler to find the smallest, most simple example dataset you possibly can that demonstrates the problem and then cutting down the code you are sharing to find the first thing that errors. This page has some great slides to demonstrate.

I know this is tough if you haven't done it before much:

This seems like a lot of work!

Julia Silge · Answer 5 · Thu Dec 10 2020 00:37:46 GMT+0800 (China Standard Time)

Feel free to open a new issue in the future! 🙌

github-actions · Answer 6 · Thu Mar 24 2022 08:08:27 GMT+0800 (China Standard Time)

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.