Dealing with corrupt git repositories

Question

Dealing with corrupt git repositories

dblandin opened this issue 10 years ago · comments

Occasionally, I'll get git output like the following during deploys:

From https://github.com/dscout/dscout
542f131..9fcc475 master -> origin/master
error: unable to read sha1 file of Gemfile (df8c83095564185f146e937a36f5a30267d747f7)
error: unable to read sha1 file of Gemfile.lock (de830d7489153e2e1d9d14708ca19a9e841b1c4d)
error: unable to read sha1 file of app/middleware/content_type_correction.rb (2fbfe4469a82722ee8f9bf3212c1282541603e92)
error: unable to read sha1 file of app/middleware/font_access_headers.rb (22d7334324f748d4585c225b7cbc6a2d8b098b23)
error: unable to read sha1 file of app/models/mission.rb (9b8d0a941bde2142a63dd2368025282638c6bc02)
error: unable to read sha1 file of app/models/snippet.rb (e11c908c36498141a298434050f15c90b7b4a8d5)
error: unable to read sha1 file of app/services/assignment_destroyer.rb (bbff88562509f5cb2c2f135bb74d6bb24afac3d9)
error: unable to read sha1 file of app/services/group_messenger.rb (f603f81c3811f2ed19b3180d015538c328c603eb)
error: unable to read sha1 file of config/application.rb (f82ae6f44d58fefb1e1fa7a9a4fa31b77f96f54b)
error: unable to read sha1 file of spec/factories/message_factory.rb (db36db83391e9ed057595c6d772e575309518871)
error: unable to read sha1 file of spec/requests/v2/messaging_spec.rb (2aea3ea57a1655f6eb294f904563dd05c8f1c315)
error: unable to read sha1 file of spec/services/assignment_destroyer_spec.rb (36a1c1095dd7f5b6927e6445edec5357ede59d60)
fatal: Could not reset index file to revision '9fcc475a5b510f8ee76bf945b96b99346e685b89'.
 % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 809 0 695 100 114 1301 213 --:--:-- --:--:-- --:--:-- 1299
100 1573 0 1459 100 114 2730 213 --:--:-- --:--:-- --:--:-- 2727

The deploy continues on, which usually doesn't make a big difference when using a Capistrano provider. I'm using Capistrano v3. We don't usually alter the Capistrano configuration, and the remote server will fetch the git repository itself.

But, the working directory repo obviously doesn't checkout to the right sha. And for other providers, this will be more problematic (such as deploying straight to S3).

Has anyone run into this problem on their own setups? Any ideas towards fixing this problem?

I usually end up deleting the working directory which forces heaven to re-clone the repository during the next deploy.

I suppose another question is: should the deploy continue on if any task during a provider fails, or should the entire deploy fail immediately and ignore any following tasks.

Corey Donohoe · Answer 1 · Fri Sep 05 2014 03:46:57 GMT+0800 (China Standard Time)

I think we should figure out what's causing the corruption. Are we screwing up the locking somehow? Is there some kind of race condition with fetching and some other integration?

Deleting it definitely feels suboptimal. I'd probably favor it failing rather than behaving in a possibly unintended fashion.

devon blandin · Answer 2 · Fri Sep 05 2014 03:52:34 GMT+0800 (China Standard Time)

It could be something odd with POSIX::Spawn::Child or perhaps there's a more reliable way of checking out the right code than:

execute_and_log(["git", "fetch"])
execute_and_log(["git", "reset", "--hard", sha])

Corey Donohoe · Answer 3 · Fri Sep 05 2014 11:30:22 GMT+0800 (China Standard Time)

That approach has been solid for us for a few years but we don't use posix spawn.

devon blandin · Answer 4 · Fri Sep 05 2014 12:02:38 GMT+0800 (China Standard Time)

Isn't it used here?

https://github.com/atmos/heaven/blob/master/app/models/provider/capistrano.rb#L22

Then I might have to do some digging to see what's causing those git errors to come up.

Corey Donohoe · Answer 5 · Fri Sep 05 2014 12:12:31 GMT+0800 (China Standard Time)

Sorry, yeah. It is in use there. We've been using the same approach with capistrano run commands for years in our internal scripts.

devon blandin · Answer 6 · Fri Sep 05 2014 23:31:57 GMT+0800 (China Standard Time)

I'm at a loss. I know that it keeps happening. There doesn't seem to be any obvious issues with the fetch and reset approach.

Corey Donohoe · Answer 7 · Tue Sep 09 2014 05:03:04 GMT+0800 (China Standard Time)

Are you possibly getting multiple events close together where the different commands executing are leaving things in a bad state? Normally git leaves a lock file around that's pretty easy to identify and you don't seem to be getting that messages though.

devon blandin · Answer 8 · Sat Sep 13 2014 03:49:54 GMT+0800 (China Standard Time)

No, requests come in multiple times a day but rarely close together.

Corey Donohoe · Answer 9 · Sat Sep 13 2014 05:14:48 GMT+0800 (China Standard Time)

I'm sure you've already tried, but fsck the drive maybe?

Corey Donohoe · Answer 10 · Wed Oct 01 2014 00:45:13 GMT+0800 (China Standard Time)

@dblandin is this still happening to you?

devon blandin · Answer 11 · Wed Oct 01 2014 01:07:47 GMT+0800 (China Standard Time)

Last happened a few days ago:

------- stderr --------
From https://github.com/dscout/dscout
 ac60d35..4f2607e master -> origin/master
* [new branch] remove-news-from-dashboard -> origin/remove-news-from-dashboard
error: unable to read sha1 file of app/assets/javascripts/***.hamlbars (c17112f82f64f65a01cdfd4b654b407ca78c0ec7)
error: unable to read sha1 file of app/assets/javascripts/***.hamlbars (08cc22eccfcb6414b8ed20261c8871d1f9a10215)
error: unable to read sha1 file of app/assets/javascripts/***.hamlbars (e6b883b68b47db5b08b8210b2318e0c0bfe2373f)
error: unable to read sha1 file of app/assets/javascripts/***.hamlbars (71a7c0982fae608d3d57147d1fe549175d5626ff)
error: unable to read sha1 file of app/assets/javascripts/***.hamlbars (64ea0a90c1428db4d0168d7150a545f13a94b112)
error: unable to read sha1 file of app/assets/javascripts/***.hamlbars (bc189196cab256374799624adc0d1d28426a8069)
error: unable to read sha1 file of app/controllers/***.rb (5e6f115b8e0a568eb98b3013d9726036dbb93691)
error: unable to read sha1 file of app/controllers/***.rb (fa0e708f677b9dc3be6f955a72c94981f4341d9c)
error: unable to read sha1 file of app/queries/***.rb (89262e36fc54a7cedc3effb912f3bfa52ffd23de)
error: unable to read sha1 file of app/services/***.rb (4f5f64e95fea206778b03e24ee509e44f10b88e9)
error: unable to read sha1 file of config/environments/staging.rb (c619f7584471ab39e3bd7f07f48af346fa8fc9bb)
error: unable to read sha1 file of config/locales/en.yml (c2677210ef81f60634cbb09c656c54c1a0e224ca)
error: unable to read sha1 file of db/migrate/***.rb (90b2d2c210687788d0cdbe76c0601540af4e0c81)
error: unable to read sha1 file of db/schema.rb (6ba36c45096c5a22fff7e9afe94f1d5a0e095298)
error: unable to read sha1 file of lib/***.rb (7fdd5aa850c6a073fbacf287b1aa1e83fc479104)
error: unable to read sha1 file of lib/tasks/***.rake (d1443d53a39deac66f5eb5de91b0309610b29102)
error: unable to read sha1 file of spec/lib/***.rb (aa1ebc97324847fd58372856d3039aa7d56279d7)
error: unable to read sha1 file of spec/requests/***.rb (2abc06f8c39325af67a5af3f5d6ca7cde0bf62c5)
error: unable to read sha1 file of spec/requests/***.rb (358ac45080622396ed1e09e43db8fe3270bc96a3)
error: unable to read sha1 file of spec/requests/***.rb (9e23b87258892268521c704ad33268acf20a0232)
fatal: Could not reset index file to revision '4f2607e59119ae3385410bc411ee51a8235ca9f8'.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 803 0 689 100 114 2493 412 --:--:-- --:--:-- --:--:-- 2487
100 1570 0 1456 100 114 5256 411 --:--:-- --:--:-- --:--:-- 5256

fsck didn't reveal any problems either:

** /dev/rdisk2
** Root file system
   Executing fsck_hfs (version hfs-226.1.1).
** Verifying volume when it is mounted with write access.
** Checking Journaled HFS Plus volume.
   The volume name is Macintosh HD
** Checking extents overflow file.
** Checking catalog file.
** Checking multi-linked files.
** Checking catalog hierarchy.
** Checking extended attributes file.
** Checking volume bitmap.
** Checking volume information.
** The volume Macintosh HD appears to be OK.

devon blandin · Answer 12 · Thu Nov 13 2014 01:39:29 GMT+0800 (China Standard Time)

After looking at how Capistrano handles fetching/resetting, I've modified my providers to use the following checkout and sync methods:

def checkout(revision)
  unless File.exist?(checkout_directory)
    log "Cloning #{repository_url} into #{checkout_directory}"
    execute_and_log(["git", "clone", clone_url, checkout_directory])
  end
end

def sync(revision)
  Dir.chdir(checkout_directory) do
    log "Fetching the latest code"
    execute_and_log(["git", "config", "remote.origin.url", clone_url])
    execute_and_log(["git", "config", "remote.origin.fetch", "+refs/heads/*:refs/remotes/origin/*"])
    execute_and_log(["git", "fetch", "origin"])
    execute_and_log(["git", "fetch", "--tags", "origin"])
    execute_and_log(["git", "checkout", "--force", "-B", "deploy", sha])
  end
end

Hopefully this lessens the frequency of the git object errors that have crept up occasionally.

devon blandin · Answer 13 · Tue Nov 18 2014 06:40:10 GMT+0800 (China Standard Time)

Still running into this issue unfortunately. Current working theory is that the working directory gets into this state most often after a git rebase and git push --force...

$ git checkout --force -B deploy 6c77800512c5ca4ffb8016a911f9ed9e981a571b
error: unable to read sha1 file of app/components/snippet/snippet_preview.coffee (1400ba1080fd0877ea686210d0908fcafe6156ea)
error: unable to read sha1 file of app/components/submission/submission_preview.coffee (32a3c6854b5aeb7e412e545afc10eda1d785113a)
error: unable to read sha1 file of app/routes.coffee (d4af8c200c4b580b17b57b59d13b069f09e7ed6e)
error: unable to read sha1 file of test/collections/submissions_test.coffee (9051e40707a86bdd852231b18bbc04d62c91c4d9)

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (16748/16748), done.
broken link from    tree db5db7d20bcbdf3b3aef22e8e9d7e08033aec0c9
              to    blob ff293e66a327287a417aa084fcab01a01b07c298
broken link from    tree 52cba6ae218747458e036e3401d6d18c7920d4f9
              to    blob 1400ba1080fd0877ea686210d0908fcafe6156ea
broken link from  commit d3e3e3105caca444b9b68ebf2fe3ca9de90dffc7
              to    tree 674d2d97a6be816365c6f110c00a38e87dd6d49f
broken link from  commit 0ba98c4c73a7116d4c7ad52cc9cf8fe36131cf04
              to    tree 02c1df7bc77b9972f3366e4504c78fb63482cbfe
broken link from  commit 8b46f850040c5b36f322ba676ca90a18bef9cfbe
              to    tree c8e02007a9993a2da29745d7a9fe978693d6ae51
broken link from  commit 9558069ae80b512186a6223a5152c5821fddd18b
              to    tree 88b9a805b7506d3f18679d6ddc404219fa61692a
broken link from  commit 611e83db1a06ccc754357bce9015efdf519642e1
              to    tree d931545631f9ded7323b48c7cb1704eb18ffa2ec
broken link from  commit f83ff5902472232c8f23dcd2a908694382e2455a
              to    tree 1a2efda6a2bd9b20feb2cdb8f1067a5cf2342f7b
broken link from  commit 2ce2e81ea071eb5924568b2b80ce21e17a59b13e
              to    tree 7e4b8fe925ed7e5002b117abc60ebbbcf594d063
broken link from  commit d8a84efcdccb7cdec6ab5ef14d73529b30850d41
              to    tree 5d4d31aefb8c952c013c378a8ca3b313c8715274
broken link from  commit c2816871c8b7ce37d33503a3f5d3df13db3bdfe9
              to    tree 02ed2203dbcb334bc2e0c72365d332c1b0c519b7
broken link from  commit 2b746a0f83b81d980ddc8ea1114fcbf2a5342ea1
              to    tree 61a6b791ef0efdd828a20bbff88d2b71408fa066
broken link from  commit c0ff53ca3bcfa237ac2b141ef148820aa50da4c5
              to    tree 0b9f97eff08c5efbf7e745c10a7d855a418d402b
broken link from  commit 81b7ba7907acd9a53f3299aadc61e3ea1586ccb6
              to    tree 03e7f4dee6dd6e574437bd12e16505af3631ea66
broken link from  commit 13797ac47a27a3fe24540968df288dab3769ac0c
              to    tree 75a5f69cd37bd2b7f78027f02a4dc64f3e1acb3b
broken link from  commit b8f1e3b3262f7261c6855489ce17cbf18425cc83
              to    tree 225758d079291ec8e83188b63ecf9116a470dff9
missing blob 1400ba1080fd0877ea686210d0908fcafe6156ea
missing blob bc000f1d7b8269baba72de1a0492a774c92409f8
missing blob 7717f5815c80f6bc3b388cc9aaad22a3b73553ba
missing blob 0a1f05a786c2892be353639f7aec7f0a7b3889e3
missing blob ff293e66a327287a417aa084fcab01a01b07c298
missing tree 1a2efda6a2bd9b20feb2cdb8f1067a5cf2342f7b
missing blob be2f861639f98a35ed0c5b86a89e98cbd43a6982
missing tree d931545631f9ded7323b48c7cb1704eb18ffa2ec
missing blob b238585846d48cf2dd11f5e67b478ca2667c8b7a
missing tree 7e4b8fe925ed7e5002b117abc60ebbbcf594d063
missing tree 5d4d31aefb8c952c013c378a8ca3b313c8715274
missing tree 674d2d97a6be816365c6f110c00a38e87dd6d49f
missing blob 9051e40707a86bdd852231b18bbc04d62c91c4d9
missing tree 225758d079291ec8e83188b63ecf9116a470dff9
missing tree 0b9f97eff08c5efbf7e745c10a7d855a418d402b
missing blob 32a3c6854b5aeb7e412e545afc10eda1d785113a
missing tree 75a5f69cd37bd2b7f78027f02a4dc64f3e1acb3b
missing tree 61a6b791ef0efdd828a20bbff88d2b71408fa066
missing blob d4af8c200c4b580b17b57b59d13b069f09e7ed6e
missing tree 88b9a805b7506d3f18679d6ddc404219fa61692a
missing tree 02c1df7bc77b9972f3366e4504c78fb63482cbfe
dangling commit d7c2f5d5b6e0417235f73fccb7b61613573b0b94
missing tree c8e02007a9993a2da29745d7a9fe978693d6ae51
missing tree 03e7f4dee6dd6e574437bd12e16505af3631ea66
missing tree 02ed2203dbcb334bc2e0c72365d332c1b0c519b7

Corey Donohoe · Answer 14 · Tue Nov 18 2014 10:11:39 GMT+0800 (China Standard Time)

Do you guys rebase/force that often? We just do merges at work.

devon blandin · Answer 15 · Tue Nov 18 2014 10:46:06 GMT+0800 (China Standard Time)

Never directly on master, but we'll occasionally rebase branches off of master and force push the updated branch.

devon blandin · Answer 16 · Wed Nov 26 2014 02:41:33 GMT+0800 (China Standard Time)

I'm leaning towards using a clean temporary directory for each deploy to get around this problem. It has become a significant pain at work. I don't see us changing the way we're rebasing anytime soon.

Corey Donohoe · Answer 17 · Wed Nov 26 2014 03:58:59 GMT+0800 (China Standard Time)

that's probably a good idea. rsync is usually a good idea if you can keep the original copy nice and clean.

devon blandin · Answer 18 · Thu Nov 27 2014 02:39:38 GMT+0800 (China Standard Time)

Seems pretty stable so far:

require 'tmpdir'
# A module to include for easy access to writing to a transient filesystem
module LocalLogFile
  def working_directory
    @working_directory ||= Dir.mktmpdir
  end

  def cleanup_working_directory
    FileUtils.rm_r(working_directory)
  end

class DefaultProvider
  ...
  def run!
    Timeout.timeout(timeout) do
      setup
      execute unless Rails.env.test?
      notify
      record
    end
  rescue StandardError => e
    Rails.logger.info e.message
    Rails.logger.info e.backtrace
  ensure
    update_output
    cleanup_working_directory
    status.failure! unless completed?
  end
end

devon blandin · Answer 19 · Tue Feb 17 2015 05:22:35 GMT+0800 (China Standard Time)

Closing this issue for now. Wasn't able to figure out the root cause or a fix for it but starting from a fresh working directory for each deployment seems to work well.

Hopefully no one else encounters the same problem. Happy to submit a patch if this comes up for anyone else.

Jingfei Hu · Answer 20 · Thu Dec 17 2015 13:34:30 GMT+0800 (China Standard Time)

It's a pity that i come across this problem now. It seems no fix for it, right?

devon blandin · Answer 21 · Fri Dec 18 2015 02:39:55 GMT+0800 (China Standard Time)

@Live2Learn Unfortunately I didn't come up with a better solution for this issue.

Here's the commit where I setup a temp directory during deploys: https://github.com/dscout/heaven/commit/932500542745719cb460a0727cf0a3657dc8a7d9