BOORU CHARS volume 2023 completes an attempt to consolidate and arrange available character-centric
almost SFW anime/CG/game art into localized format suited both for batch processing and visual estimation.
The whole evolved project consists of (in release order):
It is strongly recommended to inspect README’s there and - of course - download and seed it.
Almost 4M carefully selected samples are ready (~1.2TB) let’s use it for something meaningful !
This release covers
- ~98% newcoming images from composite rips
- some old imageboards stuff forgotten in BC2015 (internal partition 2016)
- ~20% “the best of” Dark Pixiv Collection project 202209 (included in 2016 partition)
- as “imageboard” pixiv.sfs, long image ID include artist ID post ID and post version
- filtered by minimum size and volume
- semi-automatic NSFW cleanup done
- deduplicated with all other BCs
Similarly to a whole project :
- files unique identified by (booru + fid) imageboard name and post ID key
verbose file naming used %booru% - %fid% - %up-to-3-copyrights% ~ %up-to-5-characters% (%up-to-2-artists%)
- aspect ratio clustered, priorities high to low 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40%
- (as of composite rips) image format JPG-fied and
- sampled 1280px longest side (1024px for 1x1)
- re-mogrified to 94% from 98-100% JPEG quality
- imageboard tags arranged and partially placed inside image EXIF-info
- some general image statistics got with IMAGE MAGICK
- content analisys basicly the same as for BC2022 but with advanced software and models
This release contains BC2023 by itself :
- 1.153.513 sampled images clustered by aspect ratio and also number of heads detected
(0 heads = letter A, 2 = B, 3+ = C, 1 = letter E in folder name)
ordered and grouped into ~1000/2000-th zip/folders by “attractiveness score function”
- zipped in one archive tab separated texts
- BC_2023.tsv file/image related metadata
- BC_2023_tags.tsv tags list with Danbooru enrichment 25.250.897 rows
- BC_2023_yolo.tsv 3.877.682 detailed results for torso detection
- dedicated bc_readme.txt with detailed description and examples
and also huge crossBOORU catalog of URLs, tags and other metadata (partitioned by 1-st letter of MD5 hash, zipped)
- BOORU_*.tsv 17.733.350 items (not only images) identified by MD5 with 35.033.097 (usually redundant) URLs
- BOORU_*_TG.tsv correlated artist / copyright / character tag list 63.900.184 rows
- BOORU_TG.tsv 1.014.481 tags registry zipped
- separate booru_readme.txt for detailed descrition and examples
Similarly to BC2015 and BC2022:
- simple numerical ranks has been built across clusters of images for each numerical criteria,
so both outlier processing and ranking use only relative ranks
- “the worst of” outliers were deleted (rank by rank, ~2% in total)
- “attractiveness score function” finally turned to definition “colorful and textless”
Comments - 3
SomaHeir
Thanks!
Shinon71
Hi! Do you have a discord I can ask you about stuff @AlexPUA?
AlexPUA (uploader)
Hi, Shinon71! Mail me to iprintcraft AT gmail com