Resize completely a wordpress.com blog’s media gallery

I found no convenient way to resize a whole media gallery on wordpress.com, with the free plan which does not allow to install plugins. Aside from that, I find strange wordpress itself still does not prevent duplicates media using checksum or else.

I had a blog with a media gallery reaching the limit of upload on a free plan. And it contained tons of very high-res pictures that actually could be downsized without posing any problem.

I found no convenient way to replace images (with the free plan and no plugins). If you reupload the same file, after deleting it, it will get an extra suffix -1/-2, etc: wordpress clearly keep the deleted media names in the database and prevent it to be reused so it is a no go.

The only solution I found was to:

  • export both post and media files;
  • delete all files and post;
  • run scripts to update images files and xml export/import files;
  • reimport everything with new filenames.

It is not perfect, some data will be lost, namely old galleries and some post front image choice, etc. Follows the first approach (1, 2, 3) and the second one (1, 2+3, 4) that tries to do smarter things (but is then more likely to break soon).

1. Renaming and downsizing the images

# to run in the exported media directory (extracted in a single directory)    
date=`date +%H%M%S`
backup=`mktemp --directory --tmpdir=$PWD -t $date-XXX.bak`
    
for file in  *.png *.jpg *.jpeg; do
	# skip if not a file
	if [ ! -f "$file" ]; then continue; fi

	# rename:
	newfile="BLOGNAME.wordpress.com-"$file
	echo "$file => $newfile"
	
	# limit to 1600 in max size  - to check up to decent size amount for the full
	convert "$file" -resize 1600x1600\> "$newfile"
	mv "$file" $backup/
	
done

2. Updating post export/import with new images filenames

#!/usr/bin/perl
# to be saved as as perl script file, 
# edit (especial THISBLOG and 2020/03, YYYY/MM of the new upload that will be added automatically in the URL of the media file during upload)
# and then 
# run against the exports files like
# chmod +x ./thiscript.pl
# ./thiscript.pl 

# note: wordpress.com rename -- in - during upload

use strict;


open(IN, "< $ARGV[0]");
open(OUT, "> edited.$ARGV[0]");
while (<IN>) {
    s@\.files\.wordpress.com\/\d{4}\/\d{2}\/@.files.wordpress.com/2020/03/THISBLOG.wordpress.com-re.up-@ig;
    print OUT $_;
}
close(IN);
close(OUT);

3. Checking if images appears in posts

#!/usr/bin/perl
#
# make sure every image downloaded actually exist in posts
# also to download and run against the xml import/export files and the media directory
# ./thiscript.pl THISBLOG.wordpress.2020-03-29.001.xml #../images_updated

use strict;
use warnings;
use Regexp::Common qw/URI/;
use File::Basename;
use File::Copy;

my %images;

open(IN, "< $ARGV[0]");
while (<IN>) {
    my $url;
    next unless /($RE{URI}{HTTP}{-keep})/;
    $url = $1;
    next unless $url =~ /THISBLOG\.files\./;
    my ($basename, $parentdir, $extension) = fileparse($url, qr/\.[^.]*$/);

    if ($extension  =~ /^\.(png|jpg|jpeg|gif)$/i) {
	#print "$basename$extension ($url)\n";
	$images{"$basename$extension"} = "$basename$extension"
	    unless $images{"$basename$extension"};
    } else {
	#print "IGNORE $basename$extension ($url)\n";
    }

}

while (my($image, ) = sort(each(%images))) {
    print "$image\n";
    if ($ARGV[1]) {
	if ((-d "$ARGV[1]") and (-e "$ARGV[1]/$image")) {
	    mkdir("$ARGV[1]/valide") unless -d "$ARGV[1]/valide";
	    copy("$ARGV[1]/$image", "$ARGV[1]/valide/") or print "failed to copy $image to $ARGV[1]/valide/\n";
	}
    }
}

This scripts are primitive (sure, even the blog name and upload YYYY/MM was hardcoded). Since platforms like wordpress.com often changes, this might no longer works at another time. Many pages on the Internet claims you can simply erase and reupload image with the same name: this clearly no longer works. This page could save you some trial-and-error process to give a solution that works as of today.

Note also that some wordpress.com theme have front page image and other selections that wont be carried over.

2+3. Updated script to update xml and check images

I wrote the following script for a smoother process. Being more complex, it is more likely to be fragile also. It still miss handling/removing some specific outdated wp:metadata. This one list duplicates, unused and missing files.

#!/usr/bin/perl
#
# ./renew-url+checkimages.pl FOLDER_OF_XML FOLDER_OF_IMAGES (no subdirs)

use strict;
use warnings;
use Regexp::Common qw/URI/;
use File::Basename;
use File::Copy;
use File::Slurp;
use Cwd 'abs_path';
use MIME::Base64;
use URI;

my $blog = "zukowka";
my $newprefix = "zukowka.wordpress.com-re.up-";
my $upload_yyyymm = "2020/03"; # time of the new upload
my $images_types = "png|jpg|jpeg|gif";

my %oldurl_newurl;       # {original} = new  
my %imagebase64_newurl;    # {base64} = new_url


die "$0 FOLDER_OF_XML FOLDER_OF_IMAGES (no subdirs)" unless $ARGV[0] and$ARGV[1];



# get dir full path 
my $xmldir;
$xmldir = abs_path($ARGV[0])
    if $ARGV[0];
# get dir full path 
my $imgdir;
$imgdir = abs_path($ARGV[1])
    if $ARGV[1];


my ($unused,$dupes,$missing,$mapped);
open(DUPES, "> $0.duplicatedimages");  # same image found with different names/URL
open(MISSING, "> $0.missingimages");   # file listed in xml not found in the image directory
open(UNUSED, "> $0.unusedimages");   #  file found in the image directory but not listed
open(MAPPING, "> $0.mapping");


# test access to dir
print "XML dir (ARG1): $xmldir
images dir (ARG2): $imgdir\n";
chdir($xmldir) if -d $xmldir or die "unable to enter $xmldir, exiting"; 
chdir($imgdir) if -d $imgdir or die "unable to enter $imgdir, exiting";

# - slurp all urls in import files
# - check if the image listed exist, store with base64 so we keep only one
chdir($xmldir);
while (defined(my $file = glob("*.xml"))) {
    print "### slurp $file ###\n";
    open(IN, "< $file");
    while (<IN>) {
	while (/($RE{URI}{HTTP}{-scheme => qr@https?@}{-keep})/g) {
	    # remove the http/https and possible args to keep the most portable url 
	    my $uri = URI->new($1);
	    my $url = $uri->authority.$uri->path;

	    
	    # images URL always start with $blog.files.wordpress.com
	    next unless $url =~ /^$blog\.files\./;
	    
	    my ($basename, $parentdir, $extension) = fileparse($url, qr/\.[^.]*$/);
	    my $newurl = "$blog.files.wordpress.com/$upload_yyyymm/$newprefix$basename$extension";
	    
	    # skip if the url was already mapped
	    if ($oldurl_newurl{$url}) {
		#print "SEEN ALREADY (skipping) $url\n";
		next;
	    }
	    
	    # work only on images
	    next unless lc($extension) =~ /^\.($images_types)$/i;
	    
	    # check if the relevant image exists in the image folder
	    my $newimage = "$imgdir/$newprefix$basename$extension";
	    unless (-e $newimage) { 
		print "MISSING (skipping) $newimage\n";
		print MISSING "$newimage\n";
		$missing++;
		next;
	    }
	    
	    # get image base64
	    my $base64 = encode_base64(read_file($newimage));
	    
	    # find out if this exact image is already known
	    if ($imagebase64_newurl{$base64}) {
		# already known, will point to the first one found
		print "DUPES $newprefix$basename$extension\n";
		print DUPES "$newprefix$basename$extension:\n\t$url => $imagebase64_newurl{$base64}\n";
		$dupes++;
		
		# URLs will point to the first image found
		#   full form like http://blog.files.wordpress.com/YYYY/MM/file.jpg
		$oldurl_newurl{$url} = $imagebase64_newurl{$base64};
		#   short form like http://blog.wordpress.com/file/
		my ($base64_basename, $base64_parentdir, $base64_extension) = fileparse($imagebase64_newurl{$base64}, qr/\.[^.]*$/);
		$oldurl_newurl{"$blog.wordpress.com/$basename/"} = "$blog.wordpress.com/$base64_basename/" if
		    "$blog.wordpress.com/$basename/" ne "$blog.wordpress.com/$base64_basename/";
		
	    } else {
		# store base64 with full form url to
		$imagebase64_newurl{$base64} = $newurl;
		
		# URLs will point to the first image found
		#   full form like http://blog.files.wordpress.com/YYYY/MM/file.jpg
		$oldurl_newurl{$url} = $newurl;
		#   short form like http://blog.wordpress.com/file/
		$oldurl_newurl{"$blog.wordpress.com/$basename/"} = "$blog.wordpress.com/$newprefix$basename/" if
		    "$blog.wordpress.com/$basename/" ne "$blog.wordpress.com/$newprefix$basename/";
			
	    }
	}
    }
    close(IN);
}

# store mappings
my %used;
while (my($oldurl,$newurl) = sort(each(%oldurl_newurl))) {
    print MAPPING "$oldurl => $newurl\n";
    my ($basename, $parentdir, $extension) = fileparse($newurl, qr/\.[^.]*$/);
    $used{"$basename$extension"} = 1;
    $mapped++;
}


# build import xml 
chdir($xmldir);
mkdir("$xmldir/renew") unless -d "$xmldir/renew";
while (defined(my $file = glob("*.xml"))) {
    print "### rewrite $file ###\n";
    open(OUT, "> renew/renewed-$file");
    open(IN, "< $file");
    while (<IN>) {
	my $line = $_;
	# check every url
	while (/($RE{URI}{HTTP}{-scheme => qr@https?@}{-keep})/g) {
	    # remove the http/https and possible args to keep the most portable url 
	    my $uri = URI->new($1);
	    my $url = $uri->authority.$uri->path;
	    
	    # update if mapping registered
	    if ($oldurl_newurl{$url}) {
		$line =~ s/$url/$oldurl_newurl{$url}/g;
		#print "$url -> $oldurl_newurl{$url}\n"
	    }
	}
	print OUT $line;
    }
    close(OUT);
    close(IN);	
}

# finally list useless images
chdir($imgdir);
mkdir("$xmldir/renew") unless -d "$xmldir/renew";
while (defined(my $file = glob("*"))) {    
    # work only on images
    my ($basename, $parentdir, $extension) = fileparse($file, qr/\.[^.]*$/);
    next unless lc($extension) =~ /^\.($images_types)$/i;

    # check if registered yet
    next if $used{$file};

    # if we reach this point, this media is unknown
    $unused++;
    print UNUSED $file, "\n";
}

close(MAPPING);
close(DUPES);
close(MISSING);
close(UNUSED);


print "=============================
$mapped mapped URL
$missing missing files (!)
$dupes duplicated files/links
$unused unused files\n";

# EOF

Grabbing new images post_id

There are used in gallery in the form . To use this script, you must compare your new XML produced by the previous script and new XML export made by wordpress.com AFTER uploading the new images.

The point is to get post_id from newly uploaded images and to map them to the old removed images post_id.


#!/usr/bin/perl
#
# ./update-post_id.pl FOLDER_OF_XML_UPDATED FOLDER_OF_XML_AFTER_IMAGE_REUPLOAD
#
#
# galleries are } -->
# refering to <wp:post_id>27</wppost_id> of image attachements.
#
# reuploaded files have new post_id along with new metadata hardcoded in the database
# 	<guid isPermaLink="false">http://XX.files.wordpress.com/2020/03/XXX-img_20160814_125558.jpg</guid>
#       <wp:post_type>attachment</wp:post_type>
# should match
#       <wp:attachment_url>https://XX.files.wordpress.com/2020/03/XXX-img_20160814_125558.jpg</wp:attachment_url>


use strict;
use warnings;
use Regexp::Common qw/URI/;
use File::Basename;
use File::Copy;
use File::Slurp;
use Cwd 'abs_path';
use MIME::Base64;
use URI;
use XML::LibXML;

die "$0 FOLDER_OF_XML_UPDATED FOLDER_OF_XML_AFTER_IMAGE_REUPLOAD" unless $ARGV[0] and$ARGV[1];



# get dir full path 
my $xmldir;
$xmldir = abs_path($ARGV[0])
    if $ARGV[0];
# get dir full path 
my $xmlafterreuploaddir;
$xmlafterreuploaddir = abs_path($ARGV[1])
    if $ARGV[1];

# test access to dir
print "XML dir (ARG1): $xmldir
XML after reupload dir (ARG2): $xmlafterreuploaddir\n";
chdir($xmldir) if -d $xmldir or die "unable to enter $xmldir, exiting"; 
chdir($xmlafterreuploaddir) if -d $xmlafterreuploaddir or die "unable to enter $xmlafterreuploaddir, exiting";


my %guid_postid;  # {guid} = postid



chdir($xmlafterreuploaddir);
# get postid after reupload
while (defined(my $file = glob("*.xml"))) {
    print "### read $file ###\n";
    my $dom = XML::LibXML->load_xml(location=>$file);

    foreach my $e ($dom->findnodes('//item')) {	
	#	print $e->to_literal();

	# only care for attachement type post
	next unless $e->findvalue('./wp:post_type') eq 'attachment';
	
	# store new post_id with the guid as key 	
	$guid_postid{$e->findvalue('./guid')} = $e->findvalue('./wp:post_id');
    }   
}


my %old2new_postid; # {old} = new


chdir($xmldir);
# get postid in first export
while (defined(my $file = glob("*.xml"))) {
    print "### read $file ###\n";
    my $dom = XML::LibXML->load_xml(location=>$file);

    foreach my $e ($dom->findnodes('//item')) {
	# only care for attachement type post
	next unless $e->findvalue('./wp:post_type') eq 'attachment';

	# ignore if this guid was not found/replaced
	next unless $guid_postid{$e->findvalue('./guid')};

	# map postids
	$old2new_postid{$e->findvalue('./wp:post_id')} = $guid_postid{$e->findvalue('./guid')};	
	print $e->findvalue('./wp:post_id')." -> ".$guid_postid{$e->findvalue('./guid')}."\n";
    }   
}


# finally, with this mapping, edit xml gallery entries:
# build import xml 
chdir($xmldir);
mkdir("$xmldir/newpostid") unless -d "$xmldir/newpostid";
while (defined(my $file = glob("*.xml"))) {
    print "### rewrite $file ###\n";
    open(OUT, "> newpostid/$file");
    open(IN, "< $file");
    while (<IN>) {
	my $line = $_;
	# older galleries
	# 	
	while (/\*)".*]/g) {
	    print "$1\n";
	    my $original = $1;
	    my @new_ids;
	    foreach my $id (split(",", $original)) {
		if ($old2new_postid{$id}) {
		    push(@new_ids, $old2new_postid{$id});
		} else {
		    # if not found, push back the original one
		    push(@new_ids, $id);
		}
	    }
	    my $new = join(",", @new_ids);
	    print " => $new\n";

	    $line =~ s/\,"columns":2} -->	
	while (/\<\!\-\- wp\:gallery \{\"ids\"\:\[([\d|,]*)\].*\}/g) {
	    print "$1\n";
	    my $original = $1;
	    my @new_ids;
	    foreach my $id (split(",", $original)) {
		if ($old2new_postid{$id}) {
		    push(@new_ids, $old2new_postid{$id});
		} else {
		    # if not found, push back the original one
		    push(@new_ids, $id);
		}
	    }
	    my $new = join(",", @new_ids);
	    print " => $new\n";

	    $line =~ s/\<\!\-\- wp\:gallery \{\"ids\"\:\[$original\]/<!-- wp:gallery {"ids":[$new]/g;
	    print $line;
	}

	print OUT $line;
    }
    close(OUT);
    close(IN);	
}


# EOF