Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guillaume Smet (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
Date Tue, 19 Aug 2008 06:21:44 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623572#action_12623572
] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

Hi Andrzej,

This problem is also fixed in the non-Apache repository of PDFBox (on sf.net - the link I
posted is from the sf.net CVS tree). I don't know though if you can build and ship a non released
version of PDFBox according to ASF release policy.

Even if we can't solve it in the Nutch tree right now, the problem is now referenced and people
can solve it by themselves quite easily.

Regards,

-- 
Guillaume



> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it
using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated
API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary
cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary
cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf:
failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary
cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can
get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document
and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message