@zenzen , thanks for the advice. I really appreciate it. When i extracted the zim file from https://youzim.it , later on i realized it would create two copies of the images, one that is corrupt and one that works/opens. The ones that work, youzim/zim put in folder "A" and the corrupt mirrors go in folder "H". Have no clue why. But the file paths & files are mirrors & identical. It took some time to figure out that all of them do work, just not the corrupt copies.
![image](https://forum.zorin.com/uploads/default/original/3X/d/6/d6ae4aa3f1f56db903b0eee394e1cbc94608fda4.png)
https://youzim.it managed to webcrawl & rip 339MBs of images from a website via its default settings. Whereas GNU/wget in terminal in default settings which crawls 5 layers deep, took hours & ripped 1.4GBs of images from the same website.
I still havent found the two photos from the news site i'm looking for, even after skimming through ALL 9000 of the wget images. Neither method gave me an error log though, so not sure if there were any errors.
If i knew more about wget, or youzim, i'd set the settings to dig deeper.
And with wget, i probably can tell it to ignore already ripped images.
ChaptGPT 3.5 appears to be very good at helping me with this, instead of just scanning for jpg & default images, I'm going to next time scan for all image file types & also go 7 layers deep instead of 5, and also tell it to ignore all past already saved/ripped images. This is the command ChatGPT gave that i will try:
wget -nd -r -l 7 -nc -P /save/location -A "*.jpg,*.jpeg,*.jpe,*.jfif,*.png,*.gif,*.bmp,*.dib,*.ico,*.tif,*.tiff,*.tga,*.webp,*.svg,*.svgz,*.eps,*.raw,*.cr2,*.nef,*.orf,*.sr2,*.arw,*.rw2,*.raf" --reject-regex ".*\.(txt|pdf)" 2>&1 | tee error.log http://www.somedomain.com
In this command, the --output-file
option is not used. Instead, we redirect both stdout and stderr to the console using 2>&1
, and then we pipe (|
) the output to tee
command. The tee
command allows us to both display the output on the console and save it to the error.log
file.
With this command, wget will download the images and scan the website while creating an error log in the error.log
file. Any error messages or explanations for failed downloads will be captured in the log file for your reference.