Cant figure out how to extract all files from a compressed .zim file, ZIM, not ZIP

Jessie · November 15, 2023, 9:33am

.zim files are used to download entire webpages offline, or wikis and to navigate them offline. I have a zim file, but i need to extract all its files into a folder. I've spent a very long & frustrating time trying to install libzim, Zimdump, even tried chatgpt to help me, & also tried writing many python scripts.

Nothing works. I have managed to get libzim & all its dependents installed though, which is the brunt of the work.

The last command i succesfully did was ninja -C build install

but how do i run libzim, or zimdump?

-Thanks

Here are some of my notes & helpful links:
https://sourceforge.net/p/kiwix/discussion/604122/thread/c4a75e17/

https://www.openzim.org/wiki/Zimdump

https://wiki.openzim.org/wiki/Libzim

https://www.reddit.com/r/Kiwix/comments/tfo761/is_there_a_way_to_extract_images_and_text_from/

https://zim-wiki.org/downloads/

zenzen · November 15, 2023, 11:46am

Typically, anything named libwhatever are development libraries that are meant to be used to develop software, not used directly.

I found this link that contains the pre-compiled binary for zim utilities, including zimdump which you can use to extract files from a .zim file.

https://download.openzim.org/release/zim-tools/

You want to install download zim-tools_linux-x86_64-3.2.0-2.tar.gz. Since it's a regular tarball, you can extract it normally and inside you'll find a bunch of executables, I think in AppImage format, ready to run.

Mandatory disclaimer that running executable files can be dangerous, especially from untrusted sources. You do this at your own risk.

As described in the documentation page, you can run it as follows:

<path_to_executable> dump --dir <path_to_extract_files_to> -- <zim_file_to_extract>

For example:

./zimdump dump -dir . -- edutechwiki_en_all_maxi_2021-03.zim

In this example I'm extracting the files in the same directory but I would suggest to use a separate one as there may be quite a lot of them, just make sure you create this directory upfront

Jessie · November 15, 2023, 2:05pm

Thank you, this actually worked. Finally worked. But when i look at any of the images that are extracted to their folders, none of them open, they are seemingly corrupt. And have no thumbnails. I'm also running wget right now, its been running & crawling this same site for about an hour, and it has no problem finding all images & I can open all of them easily. Wget is still running, isnt done.

Before i started wget, I earlier had used https://youzim.it to archive the website offline into the .zim file in question.

zenzen · November 15, 2023, 7:02pm

I've never used this software (in fact, I've never heard of the .zim extension at all) so I can't really say what is the expected outcome... Can you try with another website and see if that works? For reference the file that I'm using as an example is from this site (scroll down a bit, the bright green button).

https://edutechwiki.unige.ch/en/EduTechWiki_offline

With this example at least, it seems that most images except small icons in .gif and .svg and similar formats have been converted into .webp format, which is specifically designed for the web. Most if not all browsers these days can handle .webp files just fine. Can you try to drag and drop one of them onto your browser and see if they work?
If that's the case you may need to convert them back into something a little friendlier or add webp support to the default image viewer.

Notice how the file extension is .png.webp. The file extension doesn't actually matter in Linux but it suggests that these images were png originally and have been converted. The file command confirms those are indeed webp images, mostly the ones inside the I directory.

If you want to convert them back to something else you can use imagemagick, something like:

convert <img> -format png <new_img_file>

It may a be a little tedious to do for all of them but let's see first if this makes any difference. Try with other websites too, just to double check.

Jessie · November 16, 2023, 4:12am

@zenzen , thanks for the advice. I really appreciate it. When i extracted the zim file from https://youzim.it , later on i realized it would create two copies of the images, one that is corrupt and one that works/opens. The ones that work, youzim/zim put in folder "A" and the corrupt mirrors go in folder "H". Have no clue why. But the file paths & files are mirrors & identical. It took some time to figure out that all of them do work, just not the corrupt copies.

https://youzim.it managed to webcrawl & rip 339MBs of images from a website via its default settings. Whereas GNU/wget in terminal in default settings which crawls 5 layers deep, took hours & ripped 1.4GBs of images from the same website.

I still havent found the two photos from the news site i'm looking for, even after skimming through ALL 9000 of the wget images. Neither method gave me an error log though, so not sure if there were any errors.

If i knew more about wget, or youzim, i'd set the settings to dig deeper.
And with wget, i probably can tell it to ignore already ripped images.

ChaptGPT 3.5 appears to be very good at helping me with this, instead of just scanning for jpg & default images, I'm going to next time scan for all image file types & also go 7 layers deep instead of 5, and also tell it to ignore all past already saved/ripped images. This is the command ChatGPT gave that i will try:

wget -nd -r -l 7 -nc -P /save/location -A "*.jpg,*.jpeg,*.jpe,*.jfif,*.png,*.gif,*.bmp,*.dib,*.ico,*.tif,*.tiff,*.tga,*.webp,*.svg,*.svgz,*.eps,*.raw,*.cr2,*.nef,*.orf,*.sr2,*.arw,*.rw2,*.raf" --reject-regex ".*\.(txt|pdf)" 2>&1 | tee error.log http://www.somedomain.com

In this command, the --output-file option is not used. Instead, we redirect both stdout and stderr to the console using 2>&1, and then we pipe (|) the output to tee command. The tee command allows us to both display the output on the console and save it to the error.log file.

With this command, wget will download the images and scan the website while creating an error log in the error.log file. Any error messages or explanations for failed downloads will be captured in the log file for your reference.

zenzen · November 16, 2023, 10:10am

Perhaps the issue is with Zimit itself, or the website you are trying to crawl. According to OpenZim and related links, the .zim format is specifically designed for wikis, so it may have trouble with other sites that are not following a particular structure.

There are other ways and tools that are listed on the openzim page to create your own zim files:

Checkout the wget-2-zim script for some ideas for your wget command, or use this directly instead.

Jessie · November 17, 2023, 11:58am

Thanks!

system · February 15, 2024, 11:59am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.