Silly values on web status interface during 3GB raid1 resync (overflow?)

Question

Silly values on web status interface during 3GB raid1 resync (overflow?)

GoogleCodeExporter opened this issue 9 years ago · comments

Google Code Exporter commented 9 years ago

What steps will reproduce the problem?
1. Make a dual 3GB raid 1.
2. It starts resyncing.
3. Look at the web interface.

What is the expected output? What do you see instead?

I get on the web interface:

RAID
Dev.    Capacity    Level   State   Status  Action  Done    ETA
md0 2794.0 GB   raid1   active  OK   resync 
107%    -10.7min

This corresponds (roughly) to /proc/mdstat:

$ cat /proc/mdstat
Personalities : [linear] [raid1] 
md0 : active raid1 sda2[1] sdb2[0]
      2929740112 blocks super 1.2 [2/2] [UU]
      [======>..............]  resync = 30.2% (885531648/2929740112) finish=300.7min speed=113284K/sec
      bitmap: 16/22 pages [64KB], 65536KB chunk

unused devices: <none>
$ 


What Alt-F version are you using? Have you flashed it?

Alt-F 0.1RC3  Flashed.

What is the box hardware revision level? A1, B1 or C1? (look at the label
at the box bottom)
N/A

What is your disk configuration? Standard, RAID (what level)...

Raid1

What operating system are you using on your computer? Using what browser?

Chrome linux.

Please provide any additional information below.

Original issue reported on code.google.com by brian.br...@gmail.com on 1 May 2013 at 1:53

Google Code Exporter · Answer 1 · Mon Jun 08 2015 17:33:08 GMT+0800 (China Standard Time)

It is an arithmetic issue (big numbers with 3TB disks, probably awk %d should 
be replaced with a %f).

The issue must be at /usr/www/cgi-bin/status.cgi, at around line 295 (where 
$mdev is md0 in your case)

compl=$(drawbargraph $(awk '{printf "%d", $1 * 100 / $3}' 
/sys/block/$mdev/md/sync_completed))
speed=$(cat /sys/block/$mdev/md/sync_speed)
exp=$(awk '{printf "%.1fmin", ($3 - $1) * 512 / 1000 / '$speed' / 60}' 
/sys/block/$mdev/md/sync_completed 2> /dev/null)

If it is still resync can you please post the output of

cat /sys/block/md0/md/sync_completed
cat /sys/block/md0/md/sync_speed

Thanks

Original comment by whoami.j...@gmail.com on 1 May 2013 at 2:46

Google Code Exporter · Answer 2 · Mon Jun 08 2015 17:33:08 GMT+0800 (China Standard Time)

Still going. Web currently says md0 2794.0 GB raid1 active OK resync 20% 
142.2min
which I would think was ok except /proc/mdstat sadly disagrees :-(

$ cat /proc/mdstat
Personalities : [linear] [raid1] 
md0 : active raid1 sda2[1] sdb2[0]
      2929740112 blocks super 1.2 [2/2] [UU]
      [===============>.....]  resync = 78.6% (2303036672/2929740112) finish=108.9min speed=95859K/sec
      bitmap: 6/22 pages [24KB], 65536KB chunk

unused devices: <none>
$ cat /sys/block/md0/md/sync_completed
310542592 / 1564512928
$ cat /sys/block/md0/md/sync_speed
89338

awk saying 20% is about right for that sync_completed numbers.

And I just manually tried some big numbers in awk, and it doesn't seem to 
overflow, so I think awk must be using fp or longs for those calculations 
already.

So looks like we have a kernel overflow issue here...

Yeah, I just had a look at kernel source md.c, sync_completed_show function.
It uses unsigned long in 2.6.25, and has been fixed to long long sometime since.

Might be wise to change the web script to parse it out of /proc/mdstat instead!

Original comment by brian.br...@gmail.com on 1 May 2013 at 6:12

Google Code Exporter · Answer 3 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

/proc/mdstat contains very different type/formated information, it is difficult 
to parse it.

I'm trying to port Alt-F to a more recent kernel, 3.8.11, and perhaps that will 
solve the issue.

Original comment by whoami.j...@gmail.com on 24 May 2013 at 11:48

Google Code Exporter · Answer 4 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

From what I saw in the kernel source, it was definitely fixed by that
version.

Original comment by brian.br...@gmail.com on 25 May 2013 at 11:31

Google Code Exporter · Answer 5 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

I can confirm this on my recently flashed DLINK DNS-323 running Alt-F 0.1RC3. I 
built a 2x3TB array Raid1 and am seeing he same here - Currently:

RAID
Dev.    Capacity    Level   State   Status  Action  Done    ETA
md0     2794.0 GB   raid1   active  OK  resync  210%    -152.4min

Original comment by crazymac...@gmail.com on 28 Jun 2013 at 10:47

Google Code Exporter · Answer 6 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

Original comment by whoami.j...@gmail.com on 29 Jun 2013 at 4:27

Changed state: Accepted

Google Code Exporter · Answer 7 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

Same here, see my ticket on sourceforge for details:  
https://sourceforge.net/p/alt-f/tickets/10/ 

RAID
Dev.    Capacity    Level   State   Status  Action  Done    ETA
md0     2794.0 GB   raid1   active  OK  resync  158%    -6517.8min

how can I make sure this is a false positive and that the resyncing is actually 
done?

Stephane

Original comment by stephane...@gmail.com on 2 Sep 2013 at 6:29

Google Code Exporter · Answer 8 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

[deleted comment]

Google Code Exporter · Answer 9 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

I tried the same commands on my box. our problem looks similar

$ cat /sys/block/md0/md/sync_completed
2564345088 / 1564512928

$ cat /sys/block/md0/md/sync_speed
150809

$ cat /proc/mdstat
Personalities : [linear] [raid1]
md0 : active raid1 sda2[1] sdb2[0]
      2929740112 blocks super 1.2 [2/2] [UU]
      [========>............]  resync = 43.8% (1285270528/2929740112) finish=189.7min speed=144439K/sec
      bitmap: 13/22 pages [52KB], 65536KB chunk

unused devices: <none>

Original comment by stephane...@gmail.com on 2 Sep 2013 at 6:47

Google Code Exporter · Answer 10 · Mon Jun 08 2015 17:33:09 GMT+0800 (China Standard Time)

Don't worry, cat /proc/mdstat is telling the truth about what's happening, its 
only the other numbers used by the web interface that are overflowing. (We 
found there was a fixed kernel bug)

Original comment by brian.br...@gmail.com on 2 Sep 2013 at 7:32

Google Code Exporter · Answer 11 · Mon Jun 08 2015 17:33:10 GMT+0800 (China Standard Time)

Yep I realized that. Thanks Brian!

Original comment by stephane...@gmail.com on 2 Sep 2013 at 9:46