fastpdf2png: PDF to PNG at 1,500 pages/s with SIMD and PDFium

What My Project Does I was working on a document extraction pipeline and got frustrated with how slow PDF to PNG conversion was. PyMuPDF, MuPDF, ImageMagick, none of them were fast enough when you'...

By · · 1 min read
fastpdf2png: PDF to PNG at 1,500 pages/s with SIMD and PDFium

Source: DEV Community

What My Project Does I was working on a document extraction pipeline and got frustrated with how slow PDF to PNG conversion was. PyMuPDF, MuPDF, ImageMagick, none of them were fast enough when you're processing thousands of documents. So I wrote fastpdf2png. It uses PDFium (the PDF engine from Chrome) under the hood, with a custom PNG encoder that uses SIMD instructions and a patched compression library. It also detects when a page is grayscale and outputs 8-bit PNGs automatically. pip install fastpdf2png import fastpdf2png images = fastpdf2png.to_images("doc.pdf", dpi=150, workers=4) Target Audience Anyone dealing with PDFs at scale. Data pipelines, ML preprocessing, document management, that kind of thing. Comparison I benchmarked everything I could find at 150 DPI, single process. fastpdf2png does 323 pg/s, MuPDF does 37, PyMuPDF 30, and ImageMagick 2.9. With 8 workers it gets to about 1,500 pg/s. Output files end up smaller too because of the grayscale detection. https://github.com

Similar Topics

#artificial intelligence (31552) #data science (24017) #crypto (15110) #generative ai (15034) #machine learning (14680) #bitcoin (14310) #featured (13553) #news & insights (13064) #crypto news (11112) #research (8564) #deep learning (7655) #news (7647) #bitcoin news (6886) #gaming (5907) #grow your business (5747) #ai for good (5043) #btc (4998) #trending (4405) #business (4341) #adoption (4116)

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

#artificial intelligence (31552) #data science (24017) #ai (16738) #generative ai (15034) #crypto (14987) #machine learning (14680) #bitcoin (14229) #featured (13550) #news & insights (13064) #crypto news (11082)

Around the Network