Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match against all 8 PNG file signature bytes #190

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions filetype/types/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,14 @@ def __init__(self):

def match(self, buf):
if (len(buf) > 8 and
buf[:8] == bytearray([0x89, 0x50, 0x4e, 0x47,
0x0d, 0x0a, 0x1a, 0x0a])):
buf[0] == 0x89 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startswith() works quickest ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, indeed (Python 3.12.5, w10, but idk how all this test will be perform on older versions of python)

python -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf[:8] == bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a])"
1000000 loops, best of 5: 232 nsec per loop

python -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf[0] == 0x89 and buf[1] == 0x50 and buf[2] == 0x4E and buf[3] == 0x47 and buf[4] == 0x0D and buf[5] == 0x0A and buf[6] == 0x1A and buf[7] == 0x0A"
1000000 loops, best of 5: 208 nsec per loop

python -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf.startswith(b'\x89PNG\r\n\x1a\n')"
5000000 loops, best of 5: 93.2 nsec per loop

and some "what if" testing

python -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf.startswith(bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a]))"
1000000 loops, best of 5: 251 nsec per loop

python -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf[:8] == b'\x89PNG\r\n\x1a\n'"
5000000 loops, best of 5: 44.5 nsec per loop

python -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf[0:1] == b'\x89' and buf[1:2] == b'P' and buf[2:3] == b'N' and buf[3:4] == b'G' and buf[4:5] == b'\r' and buf[5:6] == b'\n' and buf[6:7] == b'\x1a' and buf[7:8] == b'\n'"
1000000 loops, best of 5: 398 nsec per loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a' insead bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ python3 -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf.startswith(bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a]))"
2000000 loops, best of 5: 173 nsec per loop
$ python3 -m timeit -s "buf = b'\x89PNG\r\n\x1a\n'" "buf.startswith(b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a')"
5000000 loops, best of 5: 60 nsec per loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, the speedup cause is not in comparizon function, but in data transformation:

$ python3 -m timeit -s "s=0; buf = b'\x89PNG\r\n\x1a\n' + b'x'*100000" "for i in range(100000): s+=1 if buf[0] == b'\x89' and buf[1] == b'P' and buf[2] == b'N' and buf[3] == b'G' and buf[4] == b'\r' and buf[5] == b'\n' and buf[6] == b'\x1a' and buf[7] == b'\n' else 0"
100 loops, best of 5: 3.65 msec per loop

$ python3 -m timeit -s "s=0; buf = b'\x89PNG\r\n\x1a\n' + b'x'*100000" "for i in range(100000): s+=1 if buf[0:1] == b'\x89' and buf[1:2] == b'P' and buf[2:3] == b'N' and buf[3:4] == b'G' and buf[4:5] == b'\r' and buf[5:6] == b'\n' and buf[6:7] == b'\x1a' and buf[7:8] == b'\n' else 0"
10 loops, best of 5: 28.5 msec per loop

$ python3 -m timeit -s "s=0; buf = b'\x89PNG\r\n\x1a\n' + b'x'*100000" "for i in range(100000): s+=1 if buf[0] == 0x89 and buf[1] == 0x50 and buf[2] == 0x4e and buf[3] == 0x47 and buf[4] == 0x0d and buf[5] == 0x0a and buf[6] == 0x1a and buf[7] == 0x0a else 0"
20 loops, best of 5: 13.7 msec per loop

b'\x89' is compared quickest than 0x89 (integer). Cannot say about old versions of python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was curious how "buf.startswith(bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a]))" will be perform, thats why i test it

here is some more test, first half bytes match

python -m timeit -s "buf = b'\x89PNGFOOBARFOOBAR'" "buf[:8] == bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a])"
1000000 loops, best of 5: 242 nsec per loop

python -m timeit -s "buf = b'\x89PNGFOOBARFOOBAR'" "buf[0] == 0x89 and buf[1] == 0x50 and buf[2] == 0x4E and buf[3] == 0x47 and buf[4] == 0x0D and buf[5] == 0x0A and buf[6] == 0x1A and buf[7] == 0x0A"
2000000 loops, best of 5: 138 nsec per loop

python -m timeit -s "buf = b'\x89PNGFOOBARFOOBAR'" "buf.startswith(b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a')"
5000000 loops, best of 5: 94.8 nsec per loop

python -m timeit -s "buf = b'\x89PNGFOOBARFOOBAR'" "buf[:8] == b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a'"
5000000 loops, best of 5: 73.9 nsec per loop

zero bytes match

python -m timeit -s "buf = b'FOOBARFOOBAR'" "buf[:8] == bytearray([0x89, 0x50, 0x4e, 0x47,0x0d, 0x0a, 0x1a, 0x0a])"
1000000 loops, best of 5: 244 nsec per loop

python -m timeit -s "buf = b'FOOBARFOOBAR'" "buf[0] == 0x89 and buf[1] == 0x50 and buf[2] == 0x4E and buf[3] == 0x47 and buf[4] == 0x0D and buf[5] == 0x0A and buf[6] == 0x1A and buf[7] == 0x0A"
10000000 loops, best of 5: 33.5 nsec per loop

python -m timeit -s "buf = b'FOOBARFOOBAR'" "buf.startswith(b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a')"
5000000 loops, best of 5: 95.3 nsec per loop

python -m timeit -s "buf = b'FOOBARFOOBAR'" "buf[:8] == b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a'"
5000000 loops, best of 5: 66.5 nsec per loop

buf[:8] == b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a' faster than buf.startswith(b'\x89\x50\x4e\x47\x0d\x0a\x1a\x0a')
and is fastest only if first byte do/not match, then times will be worse than this two matchers (at least on my pc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops dint saw your message when posting this one
buf[0] == b'\x89' and buf[1] == b'P' and ... will not work because of different types

>>> buf = b'\x89PNG\r\n\x1a\n'
>>> buf[0] == b'\x89'
False
>>> type(buf[0])
<class 'int'>
>>> type(buf[0:1]) 
<class 'bytes'>
>>> type(0x89)
<class 'int'>

buf[1] == 0x50 and
buf[2] == 0x4E and
buf[3] == 0x47 and
buf[4] == 0x0D and
buf[5] == 0x0A and
buf[6] == 0x1A and
buf[7] == 0x0A):
# cursor in buf, skip already readed 8 bytes
i = 8
while len(buf) > i:
Expand Down Expand Up @@ -133,12 +139,15 @@ def __init__(self):
)

def match(self, buf):
return (len(buf) > 3 and
return (len(buf) > 8 and
buf[0] == 0x89 and
buf[1] == 0x50 and
buf[2] == 0x4E and
buf[3] == 0x47)

buf[3] == 0x47 and
buf[4] == 0x0D and
buf[5] == 0x0A and
buf[6] == 0x1A and
buf[7] == 0x0A)

class Gif(Type):
"""
Expand Down