juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unnest_tokens error

kanatea opened this issue · comments

Hi!

I started to get the following error message when I attempted to tokenize tweets with unnest_tokens():
Error in cut.default(seq_along(out), docindex, include.lowest = TRUE, : 'breaks' are not unique

The dataset I'm running is comprised of several binded Twitter pools using rbind() but it seems like the issue is only with certain sets within that dataset (I ran the command with the smaller sub-datasets and got the error messages for two out of three of them). Other datasets binded in a similar fashion are working fine with this command, so I'm wondering what the issue is.

I tried the issue solution from 2018 but it didn't solve the issue. Would appreciate any input!

Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for me to recreate your problem so that I can understand it and/or fix it; I need to be able to see something about the data that is causing this error.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

Thanks so much for working with me on this @kanatea! What you have shared here is an image/screenshot, rather than a reprex. The idea of a reprex is that you give me enough information to be able to run the code myself and we can find out what the problem is together. I unfortunately can't run this myself, because I don't have access to this data.

Check out reprex do's and don'ts here to find out some ways to get effective help; especially pay attention to the recommendations on how to share data.

Hi there!

Sorry for the trouble earlier, R wasn't allowing me to export the reprex. Also, we realized it isn't the code itself, but the data for some reason, so I will also be attaching the data and reprex here in a zip file
sampledata.zip

Thank you so much !
.

Hello there, @kanatea! Unfortunately, loading up whole R workspaces with an .Rdata file isn't going to set us up for success in being able to solve a problem like this.

To be able to get help with R problems online, it's important to be able to create small, self-contained examples that demonstrate what you are running into. I really do recommend you check out the resources on the reprex site, like maybe this article on how to use datapasta + reprex, or this article on the RStudio Community site that has some more info.

I recommend taking the part of the data that is causing you problems and making it smaller and simpler to find the smallest, most simple example dataset you possibly can that demonstrates the problem and then cutting down the code you are sharing to find the first thing that errors. This page has some great slides to demonstrate.

I know this is tough if you haven't done it before much:

This seems like a lot of work!

Feel free to open a new issue in the future! 🙌

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.