pile of documents in close up photography

Archiving a large WordPress uploads folder

Whilst trying to create an archive (.tar.gz) of a WordPress site uploads folder I was hitting issues with the command getting cut off for various reasons (poor connection, SSH host disconnecting etc). I also found that as the archive grew larger it took longer and longer to add files to it, making it more likely that something may interrupt the process.

To combat this I needed a way to create archive in small chunks so that if it was interrupted I could easily see what the last chunk was and resume. I ended up with this bash script:

#!/bin/bash
#
# This script will compress all of the files in the uploads directory into a single archive,
# excluding any files that contain a resolution in the filename.
#
# Usage: ./uploads.sh -s 2000 -e 2023
#
# The -s and -e flags are optional. If they are not provided, the script will default to compressing
# all files from 2000 to 2023. If they are provided, the script will compress all files from the start
# year to the end year, inclusive.
#
# Author: Philip John <phil@philipjohn.me.uk>
# Author URI: https://philipjohn.me.uk
# License: MIT
# License URI: https://opensource.org/licenses/MIT
# Version: 1.0.0

# Get the start and end year from the command line arguments.
while getopts s:e: flag
do
	case "${flag}" in
		s) START=${OPTARG};;
		e) END=${OPTARG};;
	esac
done

# Set the default start and end years to use.
DEFAULT_START_YEAR=`date +%Y -d "10 years ago"` # 10 years ago
DEFAULT_END_YEAR=`date +'%Y'` # This year

# Set the start and end year to the default values if they were not provided.
START_YEAR="${START:=$DEFAULT_START_YEAR}"
END_YEAR="${END:=$DEFAULT_END_YEAR}"

# Let's gooooooo!
for year in `seq $START_YEAR $END_YEAR`
do
	for month in `seq -s " " -w 01 12`
	do
		# Tell the user what we're doing.
		echo "Compressing all $year/$month upload files"

		# Find looks for all files in the uploads/year/month/ directory that start with the current character
		# and do not contain a resolution in the filename. The -print0 flag is used to handle filenames with spaces.
		# The tar command then appends the files to the uploads.tar.gz archive.
		find uploads/$year/$month/ -type f ! -name '*[0-9]x[0-9]*.*' -print0 | tar -rvf uploads-archives/uploads-$year-$month.tar --null -T -

		# Compress the archive.
		gzip uploads-archives/uploads-$year-$month.tar
	done
done

Intended for larger sites it will look for 10 years worth of year/month upload folders and compress each one in turn. However, because even individual month folders can be quite large it will add files to the archive in chunks by the first character of the file name.

For instance, it will first find any files matching uploads/2013/01/a*, add them to the archive, then uploads/2013/01/b* and so on…

The find command also uses the case-insensitive match on the filenames so that we don’t need to run through the alphabet twice.

It can be placed inside the wp-content/ folder and is used like so:

./compress-uploads.sh

By default it will start at 10 years ago (as of today, 2013) and end at the current year. However, you can use the -s (start) and -e (end) flags to set your own start and end years:

./compress-uploads.sh -s 2020 -e 2021

The above will compress the uploads/2020/ and uploads/2021/ folders.

Rather than compress to a single file, you might want to create separate year/month archives. That’s simple enough by changing the tar command like so:

# Create year-based archives:
tar -rvf uploads-$year.tar.gz --null -T -

# Create month-based archives:
tar -rvf uploads-$year-$month.tar.gz --null -T -

Let me know if you found this useful, or if you have ideas for improvements!


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *