Upgrade 3.62 -> 3.63 - sorted needle write error: write file.sdx: bad file descriptor
SystemZ opened this issue · comments
I've seen data loss bugs in versions before 3.63 so I upgraded but maybe there is data loss, hard to tell.
I expected to not have write errors or to rewrite volumes somehow to fix problem easily but I'm stuck 😢
I use official docker images for all my nodes.
I upgraded my volume servers by setting container's entrypoint to /bin/sleep 90000
and using 3.63 image.
This way volume server didn't start before I used weed fix
inside container shell if I understood correctly those instructions
#5348
After weed fix
I turned on volume server with newer image.
I repeated this for all my 4 volume servers one by one.
I noticed that during volume server start, some volumes are broken
W0312 17:58:14.860120 volume_checking.go:121 data file /data/depo_93069.dat actual 10910368424 bytes expected 10908271200 bytes!
I tried restarting volume servers, master, no change.
I tried volume.fsck but it didn't solve problem.
volume.fsck -reallyDeleteFromVolume -verifyNeedles -forcePurging -collection mastodon -volumeId 74
dataNode:192.168.2.2:8072 volume:74 entries:27811 orphan:1 0.00% 14928B
temporarily marked 74 on server 192.168.2.2:8072 writable for forced purge
marked 74 on server 192.168.2.2:8072 writable for forced purge
purging orphan data for volume 74...
error: findExtraChunksInVolumeServers: purging volume 74: delete fileId 74,e183c00000000: sorted needle write error: write /data/mastodon_74.sdx: bad file descriptor
Any idea how to try fix issue, preferably without data loss?
I think problematic volumes like this had pretty big changes in trash in/out.
"some volumes are broken" do they fail to load or just become readonly?
3.63 removed wrong logic to auto fixing volume data, and needs to manually fix the errors.
There should not be any data lost.
W0312 17:58:14.860120 volume_checking.go:121 data file /data/depo_93069.dat actual 10910368424 bytes expected 10908271200 bytes!
for this particular warning, you can truncate file /data/depo_93069.dat to 10908271200 bytes
Those volumes are read only, seems to load.
How can I truncate file?
truncate -s [number of bytes] filename
I still have problems with this :/
Case 1
Before using it, volume server log showed this at start:
W0313 05:54:23.722144 volume_checking.go:121 data file /data/depo_55.dat actual 11451781248 bytes expected 11449684024 bytes!
After using it, there is some needle verify error during fsck
truncate -s 11449684024 /data/depo_55.dat
# volume server restart, it loaded without errors in log about volume 55
> volume.fsck -collection depo -verifyNeedles -volumeId 55
total 5920 directories, 25770 files
failed to read 55:157934 needle status of file REDACTED: rpc error: code = Unknown desc = EOF
Total entries:5479 orphan:0 0.00% 0B
This could be normal if multiple filers or no filers are used.
no orphan data
and error in volume server log
E0313 05:59:54.112599 needle_read.go:45 /data/depo_55.dat read 0 dataSize 2097224 offset 11449684024 fileSize 11449684024: EOF
Case 2
Replication 010, different sizes on two hosts.
Strawberry use EXT4, nas use XFS.
I didn't modify it yet.
I0313 06:52:01.357663 volume_loading.go:142 loading memory index /data/danbooru_4.idx to memory
I0313 06:52:01.360191 disk_location.go:182 data file /data/danbooru_4.dat, replication=010 v=3 size=1754610808 ttl=
root@strawberry:~# ls -al /docker/containers/seaweedfs/volume/danbooru_4.dat
-rw-r--r-- 1 root root 1754610808 Mar 3 16:33 /docker/containers/seaweedfs/volume/danbooru_4.dat
root@strawberry:~# ls -ls --block-size=1k /docker/containers/seaweedfs/volume/danbooru_4.dat
10485764 -rw-r--r-- 1 root root 1713488 Mar 3 16:33 /docker/containers/seaweedfs/volume/danbooru_4.dat
W0313 06:17:06.842800 volume_checking.go:121 data file /data/danbooru_4.dat actual 1754610800 bytes expected 1754340568 bytes!
I0313 06:17:06.843036 volume_loading.go:128 volumeDataIntegrityChecking failed data file /data/danbooru_4.dat actual 1754610800 bytes expected 1754340568 bytes
root@nas:~# ls -al /mnt/user/seaweedfs/volume/danbooru_4.dat
-rw-r--r-- 1 root root 1754610800 Mar 3 16:33 /mnt/user/seaweedfs/volume/danbooru_4.dat
root@nas:~# ls -ls --block-size=1k /mnt/user/seaweedfs/volume/danbooru_4.dat
10485760 -rw-r--r-- 1 root root 1713488 Mar 3 16:33 /mnt/user/seaweedfs/volume/danbooru_4.dat
I had the same problem and weed compact
solved it
for case 1, the truncate
did not work well. You can use "weed fix" since the .dat file is correct.
Case 1
Seems OK after running weed fix, again after version upgrade.
Previously I unmarked it as read only earlier, not sure if that somehow changed anything.
[admin@MikroTik Core] > /log/print where message~"depo_55"
07:24:23 container,info,debug I0315 06:24:23.995707 volume_loading.go:142 loading memory index /data/depo_55.idx to memory
07:24:23 container,info,debug I0315 06:24:23.997279 disk_location.go:182 data file /data/depo_55.dat, replication=000 v=3 size=11449684024 ttl=
> volume.fsck -collection depo -verifyNeedles -volumeId 55
total 5920 directories, 25770 files
Total entries:5479 orphan:0 0.00% 0B
This could be normal if multiple filers or no filers are used.
no orphan data
Case 3
Similar to case 1 and 2 but It's on volume server that's easier to work on for me.
I tried compact but it didn't help.
Note, time inside container is GMT+0 and on host which weed compact is executed is GMT+1
root@nas:/mnt/cache/ssd/seaweed# ./weed363 compact -volumeId 93069 -collection depo -dir /mnt/cache/ssd/seaweed/volume/
I0315 07:43:19.386159 volume_loading.go:91 readSuperBlock volume 93069 version 3
W0315 07:43:19.386292 volume_checking.go:121 data file /mnt/cache/ssd/seaweed/volume/depo_93069.dat actual 10910368424 bytes expected 10908271200 bytes!
I0315 07:43:19.386358 volume_loading.go:128 volumeDataIntegrityChecking failed data file /mnt/cache/ssd/seaweed/volume/depo_93069.dat actual 10910368424 bytes expected 10908271200 bytes
I0315 07:43:19.387426 volume_loading.go:91 readSuperBlock volume 93069 version 3
docker logs -f seaweedfs-volume-nvme |& grep depo_93069
W0315 06:44:09.712158 volume_checking.go:121 data file /data/depo_93069.dat actual 10910368424 bytes expected 10908271200 bytes!
I0315 06:44:09.712168 volume_loading.go:128 volumeDataIntegrityChecking failed data file /data/depo_93069.dat actual 10910368424 bytes expected 10908271200 bytes
I0315 06:44:09.713530 disk_location.go:182 data file /data/depo_93069.dat, replication=000 v=3 size=10910368424 ttl=
CLI docs
Compact default method is 0, but it's not listed in help.
seaweedfs/weed/command/compact.go
Line 31 in 54ee732
I created PR to fix it
#5379
Questions
Is there any way to list files located by volume ID by weed shell?
I'll try to check if files on problematic volumes are ok but I need to locate them first.
I was able to move out files from all my Seaweedfs collections.
All my collections besides one weren't frequently changed.
Only one ~20GB file was lost caused to I/O problems in logs.
Last remaining collection, very active (frequent creating and deleting) got multiple I/O problems and probably multiple small files were lost.
I'm guessing that tailing volume when vacuum run was the problem in 3.62.
I don't use Seaweedfs anymore, I migrated one small bucket to minio so I'm marking this as closed.