Shashimee Shashimee - 1 month ago 9
PHP Question

Regex for capturing smallest group

I am trying to capture an ID for a PDF Page object that looks like this :

4 0 obj
<<
/Type /Page /
...
>>
endobj


The ID is this 'ID 0 obj'. The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object :

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);


Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' :

%PDF-1.3
%¦¦¦¦

1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj

3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj

4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf


What should I change to not make it greedy ?

EDIT : Clarifications


  • I forgot to mention that I need to capture all of the Page object IDs.

  • As some people told me to use more specific regex, I have to say that this is not a formal example of how objects are build and this one is also possible. You can see that the spaces are not mendatory and that there can be multiple tags before the Page '/Type /Page' tag.



Example :

4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj



  • There are tags called Pages, PageLayout, SiglePage and I don't want to capture them.


Answer Source

You may use

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

See the regex demo

Details:

  • ^ - start of a line anchor (as m modifier makes ^ match start of a line and not of a whole string)
  • (\d+) 0 obj - 1 or more digits (captured into Group 1), then space, 0, space and an obj substring
  • (?:(?!^\d+ 0 obj$).)*? - a tempered greedy token that matches any char (.) that does not start a ^\d+ 0 obj$ pattern, as few times as possible
  • \/Type\s*\/Page\s - /Type, 0+ whitespaces (replace \s with \h to only match horizontal whitespace), /Page and then a whitespace
  • .*? - any 0+ chars as few as possible up to the first occurrence of
  • endobj$ - endobj followed with the end of line position