Is there a way to remove dangling thumbnails?

mcuda · June 27, 2023, 5:31am

The trick is to put the canonical uri's into files and then run md5sum on the files. First, since we need to create lots of tiny temporary files create a tmpfs so the files will be created in a memory file system - no I/O to a real device (mount -t tmpfs .... /mnt/xxx). readlink's output should be redirected into into files uri0, uri1, uri2, uri3, .... Then do md5sum uri*.

Sorry, about the late response, I live in Hawaii but I am a total nighthawk - I sleep during the day and work at night 7:00PM HST to 9:00AM HST.

My two cents: I think it is better to be very good at a few tools than mediocre at many tools. I think Python should definitely be one of those tools. I don't know Python at all except by reputation and Python's reputation is excellent. Currently, I believe most developer tools are now written in Python. Of course the choice of tools depends on the field you are working in. My field is WordPress so my tool is PHP. I have become very proficient at PHP so I think I can do most sysadmin tasks in PHP although PHP is not usually used for this. However, unless you will be doing web development I would not recommend PHP. Because it was organically developed, it is a very quirky (i.e., ugly) language. (It hurts me deeply to say that because PHP is my language.) Contrast that with Python which was developed under the strict guidance of G. van Rossum. The other tool I think is very valuable (again only by reputation) is Rust. Apparently, Rust is not just another programming language but introduces new paradigms for concurrency. So, if I were younger and my field was not web development but system programming, Python and Rust would be the languages I would be interested in.

zenzen · June 27, 2023, 1:29pm

That's an interesting idea, I didn't even consider it. After some reading, I found out that all SystemD based systems already have a tmpfs mount that you can use at /run/user/$ID, where $ID is the id for that user, so that's one less thing to do. This path is also conveniently stored in the XDG_RUNTIME_DIR environment variable.

I'm not entirely sure how to redirect multiple lines of output into separate files, other than looping through the output directly as before. I went with read this time and ran it a couple of times but there doesn't seem to be much of a difference.

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
Bash (refactor)	5.25	3.44	0.51
Bash (tmpfs)	3.16	1.66	0.48
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

It's very interesting to see how much this improve things. I'm really glad you mentioned it as I've picked up a few trick on how to write Bash more effectively.

I don't disagree with this, but I think fundamental knowledge is necessary in order to identify which tool to specialize in. At least nowadays where so many new tools and languages keep popping up every single day. In that regard, I like to pick a few tricks from here and there preemptively.

PHP may be a ugly as many people say but I've found Laravel to be quite enjoyable actually. It's very mature and has a rich ecosystem of tools and packages built around it. For backend development I would still prefer something like this over the new shiny things eg.: Go. Clearly those are strong contestants, but are still not mature enough.

I believe one has to decide what the best tool for the job is. For simple scripts, Bash is my go-to, but I reach for Python immediately when things get too complicated or when I need cross-platform compatibility.
On the backend I would use Laravel for most websites that need the "expected" set of features (authentication, authorization, CRUD, etc.). But for an API or distributed system, Python and Node.js are a much better fit.

In the future, I would really like to learn about many things: Rust, Go, Elixir... but that we'll see about that. Some interesting watch and read (ignore the title of the video ):

In my opinion one of the most underrated technologies is WebAssembly, which despite the name has nothing to do with the Web, and it going to be play a big role in containerized applications. Some programs like Zellij are already using it also as a plugin system, so that you can use any language that compiles to WebAssembly.

#!/usr/bin/env bash

temp_dir=$(mktemp -dt "$(basename $0).XXXXXX" --tmpdir="$XDG_RUNTIME_DIR")

i=0
readlink $@ | \
while read -r path;
do
    echo -n "file://${path}" > "${temp_dir}/uri${i}"
    rm $path
    i=$(( i + 1 ))
done

for hash in $(md5sum $temp_dir/uri* | cut -d ' ' -f 1)
do
    find $HOME/.cache/thumbnails -type f -name  ${hash}.png -delete
done

rm -r $temp_dir

mcuda · June 27, 2023, 1:43pm

If all the files are from one directory then only the basename is different and the loop can be over an echo command

echo {constant uri pathname}/{varying basename} > uri.{basename}

echo is a shell built-in and looping over it cost almost nothing.

mcuda · June 27, 2023, 2:22pm

You want get an argument from me - I love PHP but it is ugly. I misunderstood what you are working on. If you are doing web development than certainly PHP is a tool you should have.

Web Assembly I am not so sure about. It seems after an initial rapid development things have slowed down. The thing is computers and JavaScript have become so fast that the need for binary code to run in a browser seems less important than when Web Assembly first started. I think it will be a niche market used only where performance is really important or for some esoteric C code that is not available as JavaScript.

BTW, if I was a young guy I would very carefully follow the progress that AI is making in writing code. I think the future for programmers may be drastically different from now. Much of the time I spend in developing code is spent in iterating over tweaks that I make to the code. I can maybe do 6 iterations an hour. An AI program can try 10,000 tweaks in an hour. AI isn't smart in the human sense but speed at which it can do things means brute force is viable. My guess is brute force will write programs that humans did not see.

zenzen · June 27, 2023, 5:00pm

Well, I think nowadays is very common to over-hype things thanks to social media. WebAssembly was portrayed as the replacement for JavaScript at the beginning, but it took a while until Medium posts stopped writing nonsense like that.
It's no magic bullet, but I'm very optimistic of WebAssembly, as it has uses in just about anywhere, not just the browser. From desktop apps to mobile and server.
Although its pretty cool to have things like Audacity, FFMPEG or even WordPress running directly on the browser.

Of course I agree about AI. I've used ChatGTP for a few tasks and it's mindblowing... too much so, actually. That reminds me of this documentary about AI in general, very recommended:

mcuda · June 27, 2023, 5:56pm

I worked really hard on this and finally figured how to this with echo as the only command in the loop. All other commands are done only once.

realpath *.jpg > paths.txt
readarray paths < paths.txt
i=0
for path in "${paths[@]}"
do
    path="${path%"${path##*[![:space:]]}"}"
    echo -n "file://${path}" > "uri-${i}.txt"
    i=$(($i+1))
done
md5sum uri*.txt > md5sums.txt
awk 'OFS="." {print $1,"png"}' < md5sums.txt | awk 'OFS="/" {print "/home/magenta/.cache/thumbnails/normal",$1}' > thumbnails.txt
readarray thumbnails < thumbnails.txt
ls -l ${thumbnails[*]}

It isn't a script yet as I executed directly from the command line but it seems to work. I only ls instead of rm the files. Since, echo is the only command inside the loop I think it will be fast. This was incredible difficult to do as [ is an actual binary and I couldn't use it inside a loop. The

path="${path%"${path##*[![:space:]]}"}"

to strip whitespaces I found on stackoverfow and would never figure out on my own. Now I know why I gave up learning shell scripting 20 years ago - shell scripting is an invention of the devil to drive mortals insane!

Its late and nearing by sleep time and there are some things I need to do before that so I am quiting for today.

zenzen · June 27, 2023, 6:58pm

Yeah, yikes... I think we can both agree this is much uglier than PHP One of the reasons I used read before to process the filenames is because it handles whitespaces automatically.

On the other hand, the results are impressive!

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
Bash (refactor)	5.25	3.44	0.51
Bash (tmpfs)	3.16	1.66	0.48
Bash (no_reps)	0.93	0.41	0.15
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

I copied the script as is and only made a few minor adjustments to use tmpfs, same as I did in the last script in my earlier post, and add a rm $path line to remove the original file as well.

I'm very surprised with this result. It seems bash, while not ergonomic at all, is not as slow when handled carefully. Very well done.

mcuda · June 27, 2023, 9:12pm

I did some things and instead of sleeping I finished it.

#!/bin/bash
tmp="${XDG_RUNTIME_DIR}/for_smart_rm"
mkdir $tmp
realpath $@ > "${tmp}/paths.txt"
readarray -t paths < "${tmp}/paths.txt"
i=0
for path in "${paths[@]}"
do
    echo -n "file://${path}" > "${tmp}/uri-${i}.txt"
    i=$(($i+1))
done
md5sum ${tmp}/uri*.txt > md5sums.txt
dir="${HOME}/.cache/thumbnails/normal"
awk_arg="OFS=\"/\" {print \"${dir}\",\$1}"
awk 'OFS="." {print $1,"png"}' < md5sums.txt | awk "${awk_arg}" > "${tmp}/thumbnails.txt"
readarray -t thumbnails < "${tmp}/thumbnails.txt"
rm ${thumbnails[*]}
rm $@
rm -r $tmp

If you can time this I would appreciate it. readarray with -t handles whitespaces.

zenzen · June 28, 2023, 12:13am

This one is even more impressive, dangerously close to even the PHP or even C code:

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
Bash (refactor)	5.25	3.44	0.51
Bash (tmpfs)	3.16	1.66	0.48
Bash (no_reps)	0.93	0.41	0.15
Bash (no_reps2)	0.19	0.04	0.07
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

That's almost a 4~5 times performance increase just by shifting things around a bit (and about 50 times from the very first script).

mcuda · June 28, 2023, 8:22am

Actually I am only doing the normal sub-directory of thumbnails (the other directories are almost empty on my system) so the time may be larger if these other directories are populated on you system.

This will probably be the last shell script I will write in my lifetime. The previous last one was written in the early 90's, 30 years ago, so if history is any guide I will be dead before I write the next one which suits me just fine.

I think shell scripting should be done by programming gods who write system or service init code. But, mere mortals should install Python, Perl or in my case PHP. Not only are these scripts more efficient since they do their work through library and system calls instead of executing commands but they have reasonable string processing capabilities.

Please forgive the rant but I truly hate shell scripting. It totally defeated me in the early 90's when I was working as a Solaris sysadmin and had to write a lot of scripts. Thank god for Perl. It saved my job.

I think this is retribution for my having called PHP ugly. After this PHP looks like Morgan Fairchild.

zenzen · June 28, 2023, 10:13am

In my test machine the thumbnail directory being used was "large" instead, so that's something I had to change, but I don't think the size in the file makes too much of a difference since it's quite small anyway.

I hear you, and can't say that you can be blame in the slightest for this decision Still, I'm glad you did wrote it as I learned a few tricks from this exchange myself.

A point in favor of Python is that it comes installed by default on most Linux distributions, so it's easier for everyone to distribute and use scripts written in it.

zenzen · June 29, 2023, 11:08am

So, out of curiosity I just had to try a Python script as well and compare the results. Pretty surprising to be honest, although I know there are ways of optimizing Python a bit. Anyway, just for fun:

Script version	real time	user time	sys time
Bash (inner loop)	9.50	3.40	0.62
Bash (find)	9.46	3.09	0.55
Bash (refactor)	5.25	3.44	0.51
Bash (tmpfs)	3.16	1.66	0.48
Bash (no_reps)	0.93	0.41	0.15
Python	0.20	0.03	0.06
Bash (no_reps2)	0.19	0.04	0.07
PHP	0.13	0.01	0.05
C	0.11	0.00	0.03

#!/usr/bin/env python3

import os
import sys
import hashlib

THMB_CACHE_LOCATIONS = (
    os.environ['HOME'] + '/.cache/thumbnails/normal',
    os.environ['HOME'] + '/.cache/thumbnails/large'
)

def main():

    for file in sys.argv[1:]:
        abs_path = os.path.abspath(file)
        uri = 'file://' + abs_path
        hash = hashlib.md5(uri.encode('utf-8')).hexdigest()
        thumbnail = hash + '.png'

        os.unlink(abs_path)

        for dir in THMB_CACHE_LOCATIONS:
            try:
                os.unlink(dir + f'/{thumbnail}')
            except FileNotFoundError:
                pass

if __name__ == '__main__':
    sys.exit(main())

mcuda · June 29, 2023, 1:27pm

There is a essential cost that cannot be reduced - the cost of unlinking the actual files. You can estimate this cost by timing

rm *.jpg

on the same number of files. The cost of string processing - computing the md5sum and prepending and appending the path and suffix is probably insignificant when compared to the cost of unlinking. In other words if the time is close to the time of "rm *.jpg" then there is probably little room for more improvement.

Although PHP is an interpreted language it actually is parsed and executes either on an optimized binary representation or maybe even a JIT compilation (Python probably does the same). So if you increased the sample size I believe the advantage that the C program has will decrease as the overhead is distributed over a larger sample size. Correspondingly, if you decrease the sample size the C program should look much better. Since, C is compiled it doesn't have the overhead of parsing the source code.

Actually, just time it on a sample size of 1 and a sample size of 1000 and you can get the cost of the overhead (this includes the cost of process creation and initializing the interpreter).

zenzen · June 29, 2023, 2:24pm

With a single image, pretty much all scripts ran at the same speed (I only tried your latest version of the shell script). I would really like to see how it behaves with more images... But I will have to do this another time once I've downloaded a bunch more.

Script version	real time	user time	sys time
Python	0.04	0.01	0.00
rm *.jpg	0.03	0.00	0.01
PHP	0.02	0.01	0.00
Bash	0.01	0.00	0.00
C	0.01	0.00	0.00

Note: these are timings on deleting a single image, except for the rm *.jpg command which I used to delete 1060.

mcuda · June 29, 2023, 6:54pm

I am surprised by these numbers. The cost of doing one file is probably dominated by the startup/shutdown cost. In other words if you were to do two files the times probably won't change much. Both Python and PHP would be penalized because they have the additional cost of parsing the source code and generating an optimized binary representation. I would have thought these costs would have been more significant but I guess this is mitigated by the very small size of the source files. Bash also has to parse the source but I don't think it generates an optimized binary representation which may save it some time (I an guessing here as I don't follow shell development). Anyway your numbers show that the cost of parsing and generating the optimized binary representation is quite small at least for small source files.

BTW, PHP 8 does JIT compilation if you enable it in php.ini. I am running PHP 7.4 so I don't have it.

I don't think this will show anything interesting because I was wrong the cost of parsing and generating the optimized binary representation is relatively small so it is very quickly amortized. The effects of amortization will only be noticeable for much smaller sample sizes. (In a factory the cost of producing the first widget is very expensive. The unit cost of producing 1000 widgets is more reasonable. The unit cost of producing 10,000 widgets is probable only slightly cheaper than the unit cost of producing 1,000 widgets because the overhead costs has become insignificant.)

zenzen · June 29, 2023, 9:54pm

Oh, I thought this was already enabled by default on PHP 8. I actually didn't check which version of PHP was running but it was on a Debian 12 virtual machine so I guess it was version 8 but I didn't tweak anything.

mcuda · June 30, 2023, 6:33am

php -i | grep -E '^opcache\.(enable_cli|jit|jit_buffer_size) '
opcache.enable_cli => On => On
opcache.jit => tracing => tracing
opcache.jit_buffer_size => 50M => 50M

If JIT is enabled the opcache.jit* options should be displayed.

from PHP 8.0 is here! How to turn on JIT compiling

So, did you actually run these tests on Debian 12 (Bookworm) not Zorin 16.2? How is Debian 12 - I have heard really good things about it. I have always decided against it because the applications on it were always so far behind the current releases. But Zorin has only PHP 7.4 in its repository and PHP 7.4 has been obsolete (i.e., unsupported) since the fall of 2022.

zenzen · June 30, 2023, 11:32am

Yes, I used Debian 12 with 4GB of RAM but not on physical hardware. Although I have another computer where I recently installed Debian 12 and it works pretty well. I can also tell it boots up pretty quick compared to EndeavourOS which I used previously.

Coming from OpenSUSE Tumbleweed though, which has the latest packages, you can definitely tell the difference with Debian in that regard. But I don't really need the latest versions of most the programs I use so this isn't really an issue. The only exception is Neovim so I compile that from source, since it's not that big of a program. And for many other things like running different versions of Python, PHP or Node.js I just use Docker.
I'm planning on installing Debian on my main machine at some point so I guess I will find out if I end up missing some niceties from Ubuntu repos.

And btw, Debian 12 has PHP8.2 in the repositories, so that's what I used for the tests. I guess it makes sense given it's the current version, but it's also interesting how for things like Node.js it has severely outdated versions dating back ~5 years or more.

mcuda · July 1, 2023, 9:40am

Thank-you very much for the info. Although, I like Zorin I am not too happy that it has only PHP 7.4 in its repository - 7.4 is no longer supported since Nov of last year. I know I can compile 8.2 from source but if I do this then maintaining it becomes my responsibility not apt's and I am lazy. I am a recent user of Zorin coming from MX Linux. MX Linux will soon release MX 23 which should be based on Debian 12 and I think I will be returning to MX Linux so I can get PHP 8.2. I will be moving from Zorin soon in a day or two so I may not talk with you again and want to thank you for your time and effort - I learned a lot and I love to learn new things.

zenzen · July 1, 2023, 1:22pm

Likewise. This has been very productive and quite fun!

The forum is still available should you ever want to pass by, even if you are not running ZorinOS.