Is there a way to remove dangling thumbnails?

mcuda · June 23, 2023, 8:25pm

I find it convenient to remove image files using the command line "rm". However, it seems if you do this the thumbnail of the file is not removed from ~/.cache/thumbnails/normal. Is there a command that will simultaneously remove both the image and its thumbnail? Alternatively, is there a way to remove just dangling thumbnails from ~/.cache/thumbnails/normal? Is there a way to turn off thumbnails for specific folders?

Luca_Pavan · June 23, 2023, 10:07pm

Really interesting, I never noticed that thumbnails are saved on the 3 folders inside ~/.cache/thumbnails/ and not removed once their original file is deleted definitively, this behavior consumes some memory for no reason, even much if users have or had really many screenshots or pictures like me. Some thumbnails don't work, picture viewer shows black. Of course I don't like this behavior, because it means that sometimes I'll have to do manual cleaning.

If there was a method that automatically removes previews of images that no longer exist it would be better, because there is no point in keeping stuff when its original file no longer exists.

I found a way but removes them globally, open Files > click options menu on title bar (on the left of Minimize button) > Preferences > Search and Preview > change settings under Auditions and Thumbnails.

Aravisian · June 23, 2023, 10:43pm

To totally clean thumbnails:

rm -rf ~/.cache/thumbnails/*

Or

rm -rf ~/.cache/thumbnails/normal/*

rm path/to/image && ~/.cache/thumbnails/normal/(image-name)

Using GUI:
You can use Ubuntu Cleaner or BleachBit to clean your whole system:
Bleachbit:

sudo apt install bleachbit

Or for Ubuntu Cleaner:

sudo add-apt-repository ppa:gerardpuig/ppa

sudo apt update && sudo apt install ubuntu-cleaner

The reason is to allow for fast loading of previews.

Luca_Pavan · June 23, 2023, 11:01pm

Well, yes, but I meant to the thumbnails of deleted images, they have no reason to stay once the reference image has been permanently deleted.

mcuda · June 24, 2023, 7:14am

I forgot to mention I am using Zorin OS 16.2 Lite. At least on this system the image-name is a hash of some kind. E.g.,

ls .cache/thumbnails/normal
76586bf7d5b280cb9f62c8aad5abb77a.png  bff57a56b0394d78d18acc4b40d27962.png  fb6dacd072597a6566dcb6b2b9ee6792.png

Anyone know how this hash is computed? This is an interesting problem - you cannot just use the base name as there may be multiple files with the same base name. You could use the full path but that seems expensive. My guess is it is using the inode. Kind busy right now but will do more research later.

mcuda · June 24, 2023, 7:50am

In my case these were images from my web cam, when I finally noticed ~/.cache/thumbnails/normal had > 64K files and used more than 1GB of disk space. This may be a problem for beginners who may not realize this. Of course, beginners probably use the file manager to delete the files which is smart enough to also delete the thumbnail. But, on a slow machine like mine this is onerously time consuming for a large number of files as it seems to take the file manager some time to find all the matching thumbnails.

I forgot to mention I am using Zorin OS 16.2 Lite. The file manager may be different. Thunar is the file manager for Lite. Thunar has away to turn off thumbnails globally but what I would prefer is to turn off thumbnails only for my web cam folder.

zabadabadoo · June 24, 2023, 9:01am

Can you please edit your forum profile to indicate the edition of ZorinOS you are using. In your case "Lite". That will help us help you. Thanks, Zab.

Luca_Pavan · June 24, 2023, 9:21am

Yes, there are some differences between Core and Lite, but nothing too different that we can't somehow come up with . You can show that you are using Lite in the preview we see by clicking your avatar, open your Zorin Forum profile settings and on the Profile section add Lite under the title Zorin OS editions. If you click our avatar you can read the editions that we are using .

, lots of memory, good thing that you discovered it .

Probably it's worth a guide here on Zorin Forum to advise users to remove them . It must explain to remove only those that point to deleted files. But probably removing the whole thumbnails isn't so bad, usually it's the kind of data that's restored once needed again, so opening their related picture should restore their own thumbnail .

zenzen · June 24, 2023, 10:57am

I found this article that gives some good insights on how this works. According to this, it's using the canonical absolute URI of the file and hashing it using the MD5 hash function.

After some experimentation I found this command to find the exact name generated:

echo -n file://$(readlink -fn ~/Pictures/wallpaper.png) | md5sum

Note that the thumbnail is always saved as a png, regardless of the original format of the image. This command will generate the hash only.

You can use this to figure out if a particular image file has a thumbnail generated:

ls -l ~/.cache/thumbnails/large | grep $(echo -n file://$(readlink -fn ~/Pictures/wallpaper.png) | md5sum)

And for convenience, a small script that you can use to remove images including their respective thumbnails generated. It would be nice to have it recognize whether a file is an image and apply this method, and if not use the default rm binary. It shouldn't be to hard to get this up and running but I'll have to try later.

It goes without saying but this is provided with no guarantees and you use it at your own risk.

#!/usr/bin/env bash

URI_PREFIX='file://'

CACHE_LOCATIONS=(
    "$HOME/.cache/thumbnails/large"
    "$HOME/.cache/thumbnails/normal"
    "$HOME/.cache/thumbnails/fail/gnome-thumbnail-factory"
)

FILE_URI="${URI_PREFIX}$(readlink -fn $1)"

THMB_NAME=$(echo -n $FILE_URI | md5sum | cut -d ' ' -f 1).png

rm $1

for dir in $CACHE_LOCATIONS
do
    rm ${dir}/${THMB_NAME}
done

mcuda · June 24, 2023, 2:29pm

I also found this

Thanks for the script, I also wrote one but yours is much better since mine is written in PHP - don't know shell scripting.

I think switching to Zorin is the best thing I have done in a long time - the response in this forum is fantastic.

zenzen · June 24, 2023, 2:37pm

That specification page will definitely come in handy, I was actually trying to look for something similar but, well, run into this Medium article... not a great fan of Medium to be honest but sometimes you do find great content.

Do you mind sharing that PHP script? I'm curious to see how that would look like. If I have time these days I'll try to update the script to be a bit more flexible as this is indeed a great idea. I didn't realize but turns out I had almost ~500Mb of cached data!

mcuda · June 24, 2023, 2:44pm

<?php
$thumbnail_dir = getenv('HOME') . '/.cache/thumbnails';
for ($i = 1; $i < count($argv); $i++) {
    $url = 'file://' . realpath($argv[$i]);
    foreach(['normal/', 'large/'] as $type) {
        $thumbnail = "{$thumbnail_dir}/{$type}" . md5($url) . '.png';
        #echo "{$argv[$i]} -> {$url} -> {$thumbnail}\n";
        @unlink($thumbnail);    # suppress warnings for non existent thumbnails
    }
    unlink($argv[$i]);
}

Executed with this command:

php ~/temp/rm-thunar.php *.jpg

I think this may be faster than zenzen's script since everything is done in one process. Of course it is only useful to users which have PHP installed. (I am a WordPress hobbyist so of course I use PHP). Incidentally, although most users only use PHP for web development I think it is underrated as a general purpose scripting language.

mcuda · June 24, 2023, 8:39pm

I just thought about this:

For every file five processes will be created - readlink, md5sum, cut, rm and echo. Typically, I have several hundred or thousand files to delete so several thousand processes will be created and destroyed. It may be better to do it all in one process. (I was surprised to see that echo was an actual binary and not a shell built-in but maybe the shell intercepts the echo command and never calls the binary, in which case the count will be four.) The binary could be executed as smart-rm *.jpg so that one process will remove them all.

zenzen · June 25, 2023, 6:34am

I don't think you can get away without having multiple processes for each file, since those are necessary steps to figure out the right name of the thumbnail.
But in any case I don't think it's going to cause any issues as processes are recycled very efficiently. If you are concerned you can cat /proc/sys/kernel/pid_max to see the number of maximum process id's which should be set to something like 4 million on 64-bit machines.

echo is definitely a shell-builtin. Maybe there's a binary for using it in special cases or more exotic shells?

A quick update to make it accept any number of files would look something like this.
It still does no checks as to whether the provided files are images or not, ~~and you're likely going to get a lot of output due to "file not found" since not every thumbnail exists on all three of the cache locations.~~

At the cost of a slightly time increase you can also use find to locate and delete the thumbnails but without any "noise" due to file not found.

#!/usr/bin/env bash

URI_PREFIX='file://'

CACHE_LOCATIONS=(
    "$HOME/.cache/thumbnails/large"
    "$HOME/.cache/thumbnails/normal"
    "$HOME/.cache/thumbnails/fail/gnome-thumbnail-factory"
)

for file in $@
do
    FILE_URI="${URI_PREFIX}$(readlink -fn $file)"
    THMB_NAME=$(echo -n $FILE_URI | md5sum | cut -d ' ' -f 1).png

    rm $FILE_URI

    find $HOME/.cache/thumbnails -type f -name $THMB_NAME -delete
    
    # for dir in CACHE_LOCATIONS
    # do
    #     rm ${dir}/${THMB_NAME}
    # done
done

mcuda · June 25, 2023, 7:10am

If you were to create a compiled binary then the right name can be computed using only library and system calls. Process creation although done efficiently is still expensive since the kernel must allocate and populate all the process objects. Since pipes are used to communicate between the processes context switches must be done. If you are using only library and system calls everything is done in one context of one process.

I think it may be there so that a binary can call it. I think readlink may also be a shell built-in.

You may be right to not worry about the cost of process creation and destruction and the cost of context switches but I am a very old man (73 next month) which means I worked on hardware that the current generation would find impossibly slow and we worried about the cost of process creation and context switches. I think these things still are important.

mcuda · June 26, 2023, 9:42am

I spent some time thinking about how to convince you that the cost of process creation/destruction should be something you should not ignore.

To observe the cost of process creation/destruction on your computer run the following two commands

for ((i=0;i<10000;i++)); do echo x > /dev/null; done

for ((i=0;i<10000;i++)); do cat < /dev/null > /dev/null; done

Since the first command uses echo a shell built-in no processes are created. Since, cat is not built-in but a binary 10,000 processes will be created and destroyed. On my system a 5 year old? low-end laptop the first command returns instantaneously but the second takes many seconds ~10 seconds.

To get some idea about the work the kernel must do to create/destroy a process you can visit /proc/{some pid}. There are many files in this directory. Most of these files are representation of the kernel objects for the process that the kernel must create, maintain, destroy for the process. Even destroy is not as simple as freeing memory as if the object is for a resource that resource must be shutdown, e.g., open file descriptors must be closed.

On my slow system rm *.jpg can delete 10K files almost instantaneously. (To be fair I ls'd the directory before so the inodes were already in the memory cache.) The reason rm can do this is because the files are deleted using library and system calls running from a single context in a single process. Contrast this with a shell script which does its work by executing commands each incurring the cost of process creation/destruction. This of course is really only a problem if the commands are inside a loop that will be executed many times.

Incidentally, a language like PHP does not have this disadvantage because it does its work using library and system calls. It is not as fast as a compiled binary because the data must be marshalled between PHP and the native C libraries but its pretty close.

BTW 10,000 is not an unreasonable number. I will use this script to clean my webcam directory which accumulates several hundred images daily. If I am lazy in a few days there will be > 1K files in a month there will be >10K files.

zenzen · June 26, 2023, 2:35pm

I've only recently started learning some programming and while I'm spoiled by the modern hardware I definitely sympathize with the constraints that existed not too long ago. And I really do try to aim for efficiency whenever possible but it's important to factor in the development time as well, which is why one should use whatever is most comfortable with.
For relatively simple scripts where performance is not the main concern I typically reach out for shell scripts, but I move on to other languages like Python when things do get a little more complex. Most of the time this is enough, but I guess I underestimated the work needed to deleting that many files.

You made some very good points and I was curious, so I took the time to run all these scripts separately and run some very basic measurement of how long it takes them to get the job done. In addition, I also went down a little rabbit hole and wrote another version using C to compare the results with a binary format (this is the reason I was delayed in responding as I'm not exactly proficient with this language).

I used a sample dataset of 1066 random images that I downloaded from unsumple.net. I imagine the results will be different with even more images but, quite honestly, it was a little tedious to download all these images and I stopped at this number as it's probably big enough to show the relative differences between these scripts.
And of course, I didn't want to use my own images... just in case

To run the test I simply use the time -p command in front of the executable to run the measurement. Again, very basic, but the results were actually surprising:

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

So, it's clear that Bash is just slow compared to other languages. You were right in that PHP makes system calls and that fact alone makes this process so much faster, nearly as fast as with C.
I honestly don't know what these metrics mean exactly. I will have to do some reading for another time. But this is probably good enough to show the relative differences, and I guess with a larger dataset these would be a little more obvious. It seems that PHP is after all the better tool for writing quick scripts that rely heavily on system calls. I'm personally more inclined to use Python as it's already installed by default on most (if not all) Linux distributions.

Well, this was a fun exercise and definitely learned something from it. If you are curious this is the C program. It depends on openssl so you'll need to installlibssl-dev. Run gcc main.c -lcrypto to compile it. Again, I'm not very proficient with C so it probably has plenty of room for improvement:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <openssl/evp.h>
#include <openssl/err.h>

#define URI_PREFIX "file://"
#define MD5_LENGTH 16

void str_prepend(char* str, const char* text);
void initialize_openssl(EVP_MD_CTX** ctx, EVP_MD** md5);
void md5sum(EVP_MD_CTX* ctx, EVP_MD* md5, const char* msg, char* buffer);

char THUMBNAIL_LOCATIONS[][PATH_MAX] = {
    "/.cache/thumbnails/normal/",
    "/.cache/thumbnails/large/",
};

int main(int argc, char** argv)
{
    // Initialize OpenSSL digest algorithm
    EVP_MD_CTX *ctx = NULL;
    EVP_MD *md5 = NULL;

    initialize_openssl(&ctx, &md5);

    for (int i = 1; i < argc; i++)
    {
        char absolute_path[PATH_MAX];
        char canonical_uri[PATH_MAX];

        // Get file's absolute path
        realpath(argv[i], absolute_path);
        strcat(absolute_path, "\0");

        // Get file's canonical uri
        strcpy(canonical_uri, absolute_path);
        str_prepend(canonical_uri, URI_PREFIX);

        // Calculate canonical uri's md5 digest. Turn into ascii text
        char checksum[MD5_LENGTH * 2 + 1];
        md5sum(ctx, md5, canonical_uri, checksum);

        // Construct thumbnail name
        char thumbnail_name[MD5_LENGTH * 2 + 4];
        strcpy(thumbnail_name, checksum);
        strcat(thumbnail_name, ".png");

        // Delete file
        if (remove(absolute_path) != 0)
        {
            fprintf(stderr, "Errno: %d\n", errno);
            perror("Cannot find");
        }

        // Delete thumbnails
        for (int i = 0; i < 2; i++)
        {
            // Get thumbnail's absolute path
            char full_path[PATH_MAX];
            strcpy(full_path, getenv("HOME"));
            strcat(full_path, THUMBNAIL_LOCATIONS[i]);
            strcat(full_path, thumbnail_name);

            // Delete thumbnail; ignore errors due to file not found
            int error_num = remove(full_path);
            if (error_num == 1 || error_num > 2)
            {
                fprintf(stderr, "Errno: %d\n", errno);
                perror("Delete error");
            }
        }
    }

    EVP_MD_free(md5);
    EVP_MD_CTX_free(ctx);

    return 0;
}

void str_prepend(char* str, const char* text)
{
    size_t str_length = strlen(str);
    size_t text_length = strlen(text);
    memmove(str + text_length, str, str_length + 1);
    memcpy(str, text, text_length);
}

void initialize_openssl(EVP_MD_CTX** ctx, EVP_MD** md5)
{
    *ctx = EVP_MD_CTX_new();
    *md5 = EVP_MD_fetch(NULL, "md5", NULL);
}

void md5sum(EVP_MD_CTX* ctx, EVP_MD* md5, const char* msg, char* buffer)
{
    unsigned char* md_output = NULL;
    unsigned int md_length = 0;

    EVP_DigestInit_ex(ctx, md5, NULL);
    EVP_DigestUpdate(ctx, msg, strlen(msg));

    md_output = OPENSSL_malloc(EVP_MD_get_size(md5));

    EVP_DigestFinal_ex(ctx, md_output, &md_length);

    char* ptr = buffer;
    for (int i = 0; i < MD5_LENGTH; i++)
    {
        ptr += sprintf(ptr, "%02x", md_output[i]);
    }

    OPENSSL_free(md_output);
}

mcuda · June 26, 2023, 2:52pm

zenzen:

for file in $@
do
    FILE_URI="${URI_PREFIX}$(readlink -fn $file)"
    THMB_NAME=$(echo -n $FILE_URI | md5sum | cut -d ' ' -f 1).png

    rm $FILE_URI

    find $HOME/.cache/thumbnails -type f -name $THMB_NAME -delete
    
    # for dir in CACHE_LOCATIONS
    # do
    #     rm ${dir}/${THMB_NAME}
    # done
done

I have been thinking more about this problem and think you can dramatically improve the performance of your script if you re-designed it. Since, md5sum, cut and rm can process multiple arguments there is no need call them once per file instead you can call them on all files at once. So for 10,000 files only readlink is called 10,000 times md5sum, cut and rm would be called only once.

mcuda · June 26, 2023, 3:50pm

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

I am really surprised at how close PHP is to C. PHP is not compiled but only the first pass through the code is slow. The first pass either creates an optimized binary representation or may be even compiled code so thereafter it runs close to binary speed.

Another advantage to PHP is it is very good at string processing. In the early 90's I worked as a Solaris sysadmin and so had to write a lot of scripts. The hardcore guys were all doing shell scripts and I tried to learn it but found it difficult and my boss said I should try Perl which I really liked because it was very good at string processing which I think the shell is very weak on. Sadly, Perl is a dying language but Larry Wall the inventor of Perl who is only a few years younger than me is developing another language - not bad for a 68 year old.

I think it is much more valuable to learn Python than PHP. Python is the language of choice for building developer tools. I use PHP because WordPress is my hobby and in some ways it is a successor to Perl.

zenzen · June 26, 2023, 5:11pm

Yes, definitely. I use the shell mostly because it's convenient but I really don't like it. It's very unintuitive in so many ways, and I especially agree it's difficult to work with strings.
Python has proven to be quite useful. It's flexible, it's fast, it's easy to work with... Nowadays we have plenty of languages to choose from, it almost fells like everyday there are a few dozen popping up, so it's only natural that some of them fall out of fashion.

Once I read more about how to run a "proper" benchmark and read the results, I can try to compare a few other variations and languages. As they said, practice makes perfect.

EDIT:

One last refactor on that shell script taking into account what you said about reducing processes created.
Unfortunately, while the md5sum does take multiple files it's not the files that we need to compute, but the canonical uri, and that it won't accept except from stdin.
However I manage to reduce the number of calls to readlink and that alone had a surprising effect:

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
Bash (refactor)	5.25	3.44	0.51
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

#!/usr/bin/env bash

ABSOLUTE_PATHS=$(readlink -f $@)
CANONICAL_URIS=()
MD5_DIGESTS=()

for path in $ABSOLUTE_PATHS
do
    CANONICAL_URIS+=( "file://${path}" )
    rm $path
done

for uri in "${CANONICAL_URIS[@]}"
do
    HASHED_NAMES=$(echo -n $uri | md5sum | cut -d ' ' -f 1)

    for digest in $HASHED_NAMES
    do
        MD5_DIGESTS+=( "${digest}.png" )
    done
done

for thumbnail in "${MD5_DIGESTS[@]}"
do
    rm "$HOME/.cache/thumbnails/large/$thumbnail"
    rm "$HOME/.cache/thumbnails/normal/$thumbnail"
done