Accelerating transfer of small files with rsync and xargs

Rsync is a great tool for transferring and synchronizing files between computers/servers. It is usually available on the most popular linux distribution. If not installed yet, you can typically use the package manager for installation.

Rsync’s drawback lies in its sequential transfer over a single remote connection, resulting in lengthy transfer times for large amount of small files.

While it’s possible to experiment with rsync arguments to improve transfer speed, the single connection and sequential process remain a limiting factors.

One way to help with the issue is to combine rsync with another command called xargs.

Xargs reads items from the standard input and executes the command one or more times with any initial arguments.

Piping the list of files to be transferred into xargs enables xargs to execute multiple instances of rsync concurrently. This proves useful as it allows us to run multiple rsync commands in parallel using xargs.

The task can be divided into two steps:

  • Retrieve the list of files that need to be transferred
  • Output the list of files and use it as input for xargs to call rsync.

Below, we present an example of a bash script that demonstrates the concept.

#!/bin/bash

# Remote server information
REMOTE_USER="myuser"
REMOTE_SERVER="192.168.10.33"
REMOTE_DIR="/opt/myfiles/"

# Local server
LOCAL_DIR=”/opt/local-backup”

# Number of processes for xargs
NUM_PROC=10

# Create a unique temporary file
TEMP_FILE=$(mktemp)

# Get a list of remote files and save it locally
ssh ${REMOTE_USER}@${REMOTE_SERVER} "ls -1 ${REMOTE_DIR}/" > $TEMP_FILE

# Feed the list of remote files to xargs
cat $TEMP_FILE | xargs -n1 -P${NUM_PROC} -I% rsync -avz --progress \ ${REMOTE_USER}@${REMOTE_SERVER}:${REMOTE_DIR}/% $LOCAL_DIR/

# Remove the temporary file
rm $TEMP_FILE

Let’s break down each line and explain what it does:

  • REMOTE_USER=”myuser”
    • The user allowed to access the remote server
  • REMOTE_SERVER=”192.168.10.33″
    • The remote server we are trying to access
  • REMOTE_DIR=”/opt/myfiles/”
    • The directory on the remote server we are trying to access
  • LOCAL_DIR=”/opt/local-backup”
    • This variable represent the local directory to save files
  • NUM_PROC=10
    • This variable defines how many processes xargs should start when processing inputs. With the value of 10, xargs will create 10 instances of the rsync processes while feeding it input.
    • You have to be careful and try to gauge a good number of process. A higher number might actually slow things down.
  • TEMP_FILE=$(mktemp)
    • This variable uses the mktemp command to generate a unique temporary file
  • ssh ${REMOTE_USER}@${REMOTE_SERVER} “ls -1 ${REMOTE_DIR}/” > $TEMP_FILE
    • This command uses ssh to get the listing of all the files on the remote directory
  • cat $TEMP_FILE | xargs -n1 -P${NUM_PROC} -I% rsync -avz –progress \ ${REMOTE_USER}@${REMOTE_SERVER}:${REMOTE_DIR}/% $LOCAL_DIR/
    • The first part of the command uses cat to output the list of remote files
    • The second part of the command tells xargs what to do with the input
      • -n1 : use at most 1 arguments per line
      • -P${NUM_PROC} : run up to 10 processes at a time.
      • -I%: replace occurences of “%” with names read from standard input
    • The last part represent the rsync command to execute.
  • rm $TEMP_FILE
    • Cleaning up by removing the temporary file

It’s not a perfect solution, but it’s a quick one that leverages commands that most Linux machines usually have available.  One possible alternative is to leverage the gnu parallel command to accomplish a similar task.

I hope this post was useful to someone.


If you have any questions or require assistance, we are actively seeking strategic partnerships and would welcome the opportunity to collaborate. Don’t hesitate to contact us!

Subscribe

Stay in the loop! Sign up for our newsletter today for getting regular updates.