system hang with rsync

Vincent Stemen vince.dragonfly at hightek.org
Mon Feb 18 15:56:35 PST 2008


Hello.

Doing a backup to our DragonFly file server, using rsync, hangs the
server under certain conditions.  

It only happens when I use the *--delete* option to rsync and seems to
only be on directories with large multi-GB files.  The directory that
consistently reproduces it is my tv recording directory where I have
multiple files ranging from 1 to 4 GB.  The total directory on the
server is about 67 GB, which has old files that need deleted.  The
directory on the client machine is currently about 79 GB. 

While the server is hung, I can still ping it and switch virtual
consoles but, other than that, all consoles are just frozen.  While it
is frozen, if I hit <cntl>c I see the '^C' on the screen but get no
other response.

Running *top*, in another console -- top also freezes and stops
updating, always with zero or very little process load.  On this last
test, after killing rsync on the client side with <cntl>c, *top* briefly
updated after 3 minutes, then stayed frozen for 3.5 more minutes.  After
a total of 6.5 minutes, the server came alive again.

It does not seem to be related to ssh because I ran the rsync daemon on
the server and ran the same test without ssh and got the same results.
Here is the output of the last test on the client side.

  $ rsync -HOav -x --delete . alexandria::tv/recordings
  /tv/recordings
  building file list ... done
  deleting The Universe (Jupiter: The Giant Planet).info

Hangs here.  I waited a while and finally hit <cntl>c.

  ^Crsync error: received SIGUSR1 or SIGINT (code 20) at rsync.c(163)

The .info file and the .avi file were both gone on the server after that
but I am not sure if the .avi file was deleted on one of the other
tests.

I did get a couple entries in /var/log/messages on the server with the
following error when I ran it with the rsync daemon

  rsyncd[3933]: rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9]

I did not get that error when run under ssh so I don't know if it has to
do with the freezing problem.

On the server, I replaced the recording directory with one that had
a subset of the files, about 5 recordings that needed deleted, and it
worked fine.  Then I hard linked all files in the directory from another
directory, and it still worked fine.

i.e.
    mv recordings recordings.bak
    mkdir recordings
    ln recordings.bak/* recordings

So far as I can tell, it only happens if it is over a certain amount of
data in the directory and it has to actually delete the files, not just 
unlink a secondary hard link.

I have been able to backup just about every other directory on our
client machines without any problems.

Does anybody have any theories about what might be happening?

The client machine is a NetBSD machine.
The DragonFly server is running version 1.10.1-RELEASE.  I also
tested with a 1.11.0-DEVELOPMENT kernel and got the same results.







More information about the Bugs mailing list