Quotation marks cause inaccurate word count

Question

Quotation marks cause inaccurate word count

leegrey opened this issue 11 years ago · comments

It seems that any block of characters that begins with a non-alphanumeric character is not being counted toward the word count. This means that the first word of any line of dialog, or the first word of any parenthetical statement, is being ignored.

For example, none of these words would be counted:

"foo" 'foo' (foo) {foo}

Depending on the content, this behaviour can put the word count quite far off.

James Brooks · Answer 1 · Mon Sep 16 2013 22:18:44 GMT+0800 (China Standard Time)

@leegrey which branch are you using? WordCount actually splits by one single space character. Nothing fancy goes on.

Lee Grey · Answer 2 · Tue Sep 17 2013 18:37:12 GMT+0800 (China Standard Time)

Hi. I'm using the Master branch in Sublime Text 2, on OSX 10.6.8. Both the version from the package manager and from github ( the same ? ) have the same issue. Basically, if I write "foo" "foo" "foo" "foo" "foo" ( any text in quote marks ) as many times as I like it will say I have a wordcount of zero. It is not a huge deal, but it is inaccurate. @jbrooksuk - do you see the behaviour I'm describing?

Tito · Answer 3 · Tue Sep 17 2013 19:06:06 GMT+0800 (China Standard Time)

Interesting problem, I'm working into a fix

Pro App · Answer 4 · Wed Sep 18 2013 22:00:53 GMT+0800 (China Standard Time)

I think has some problem also if words are separated by other white characters like tab character. Could you please check that?

Tito · Answer 5 · Wed Sep 18 2013 22:10:09 GMT+0800 (China Standard Time)

I think the problems described here are fixed. :-)

Pro App · Answer 6 · Thu Sep 19 2013 00:24:39 GMT+0800 (China Standard Time)

Have you considered if a sentence contain only &^% *^% #$^? It is still 3 valid words. I proposed a new pull that solve this problem. Please have a look. Thank you.

Tito · Answer 7 · Thu Sep 19 2013 00:26:26 GMT+0800 (China Standard Time)

A real world example will help, &^% *^% #$^? is not in my dictionary. :-)

James Brooks · Answer 8 · Thu Sep 19 2013 00:38:11 GMT+0800 (China Standard Time)

I could totally understand adding a fix for this if it was SublimeText/CharacterCount but it's words. You can get the character count by selecting all of the text. WordCount should only ever count actual words.

Tito · Answer 9 · Thu Sep 19 2013 02:33:33 GMT+0800 (China Standard Time)

@jbrooksuk WordCount already counts characters too :-P hehe

Pro App · Answer 10 · Thu Sep 19 2013 03:41:46 GMT+0800 (China Standard Time)

For example. if there is a line of text like this:
{+-} + {/} = {+-/}

Will the code be able to capture it and display the number of words and characters? In seems to me that it is reasonable to report that it has 5 words, 20 characters. But the first thing is that we need to capture that line with the regular expression that accept any sequence of characters, given there is a non-space character as pointed our in the pull I proposed:

Pref.wrdRx                  = re.compile("^.*\S+.*$", re.U)

What do you think?

Tito · Answer 11 · Thu Sep 19 2013 03:47:33 GMT+0800 (China Standard Time)

These are not words.
On 18 Sep 2013 16:41, "harryngh" notifications@github.com wrote:

For example. if there is a line of text like this:
{+-} + {/} = {+-/}

Will the code be able to capture it and display the number of words and
characters? In seems to me that it is reasonable to report that it has 5
words, 20 characters. But the first thing is that we need to capture
that line with the regular expression that accept any sequence of
characters, given there is a non-space character as pointed our in the pull
I proposed:

Pref.wrdRx = re.compile("^.\S+.$", re.U)

What do you think?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-24693211
.

Pro App · Answer 12 · Thu Sep 19 2013 03:51:35 GMT+0800 (China Standard Time)

In some software like Microsoft Word they count it :). So that is something that need to be considered.
By the way, even if they are not words. The plugin still should report the number of characters? Doesn't it?

Tito · Answer 13 · Thu Sep 19 2013 04:04:23 GMT+0800 (China Standard Time)

Don't know, aren't you using it? :-P
On 18 Sep 2013 16:51, "harryngh" notifications@github.com wrote:

In some software like Microsoft Word they count it :). So that is
something that need to be considered.
By the way, even if they are not words. The plugin still should report the
number of characters? Doesn't it?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-24693954
.

Pro App · Answer 14 · Thu Sep 19 2013 04:41:09 GMT+0800 (China Standard Time)

@titoBouzout : I dig into the code more than using it :D
Another option is changing \S as non-space character to \w as actually a character from a-zA-Z0-9_ like this:

Pref.wrdRx                  = re.compile("^.*\w+.*$", re.U)

quondammelody · Answer 15 · Sun Dec 29 2013 04:29:57 GMT+0800 (China Standard Time)

This mostly works, but when it considers contractions like I'd, would've, don't, etc, they count as zero words. One way to fix this would be to change the current regex from: ^[^\w]?\w+[^\w]$ 'to ^[^\w]?(\w|')+[^\w]$ which counts contractions as well as things like hack'n'slash as one word (previously zero words).

Tito · Answer 16 · Thu Jan 16 2014 06:57:56 GMT+0800 (China Standard Time)

Thanks @quondammelody fixed :)

Peter Poulsen · Answer 17 · Sat Mar 21 2015 17:50:35 GMT+0800 (China Standard Time)

As an extension to this issue: words that start with `` are not counted as words

Tito · Answer 18 · Wed Mar 25 2015 10:57:16 GMT+0800 (China Standard Time)

which word stars with ?

``

James Brooks · Answer 19 · Wed Mar 25 2015 19:15:01 GMT+0800 (China Standard Time)

@titoBouzout that's not the problem. It's when you've got a code block, say:

hello

Now hello doesn't count from what I understand.

Tito · Answer 20 · Wed Mar 25 2015 19:33:28 GMT+0800 (China Standard Time)

It counts

On Wed, 25 Mar 2015 at 08:15 James Brooks notifications@github.com wrote:

@titoBouzout https://github.com/titoBouzout that's not the problem.
It's when you've got a code block, say:

Now hello doesn't count from what I understand.

—
Reply to this email directly or view it on GitHub
#16 (comment)
.

Peter Poulsen · Answer 21 · Wed Mar 25 2015 19:54:51 GMT+0800 (China Standard Time)

@titoBouzout I've encountered the problem in a LaTeX document where a quotation is done as: ``word''. This results in the first word in all quotations in LaTeX not being counted.

Tito · Answer 22 · Thu Mar 26 2015 13:05:01 GMT+0800 (China Standard Time)

okeii I updated the regular expression. If you update it should work now, maybe