Some PDFs render correctly, but extracted text has broken Unicode because of embedded subset TrueType fonts and missing or invalid ToUnicode / font encoding.
We also need a reliable way to detect micro-spaces or invisib…...the embedded font program, glyph names, or font encoding? Is...example PDF character code → glyph id → Unicode? Should we attach...