Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image filter methods, size of output #106

Open
twocs opened this issue Jun 8, 2015 · 19 comments
Open

Image filter methods, size of output #106

twocs opened this issue Jun 8, 2015 · 19 comments

Comments

@twocs
Copy link
Contributor

twocs commented Jun 8, 2015

In OpenCV, the output image is the same size as the input image. The indices for x,y in the output image corresponds to the x,y in the input.

But in Pal, the 3x3 methods have output sizes that are two pixels smaller in width and height than the input size. This is more economical on memory but does not have corresponding x,y locations. Is this the preferred method?

The size of the output images should be added to the comments and documentation for these methods.

@anael-seghezzi
Copy link
Contributor

Some design decision is needed there, I also noticed an inconsistency while making this example : #92

It's not clear what need to be followed, the original filters output a smaller image, but the sobel filter doesn't. Users will expect an output of the same size, but it's not trivial (specially for large radius) :

  • is the margin set to zero ?
  • or should we take the nearest value after (or before) filtering ?

In hi-quality filters what is done is :
1 : a temporary image is created at size = input size + radius
the margin value is equal to the nearest pixel
2 : the output is computed using the larger temp input at input size

It make sense on x86, but might be inefficient on simple processor with no cache.

@twocs twocs mentioned this issue Jun 13, 2015
@anael-seghezzi
Copy link
Contributor

I think that the low-level methods should only focus on the fastest/lightest code, and so output a smaller image.

Example :
A user wants to apply multiple filters to the same input image while keeping the original size. The optimal method in this case would be to oversize the input (2 pixels larger for 3x3 filters, with nearest value for the margin) and then apply all the filters. 1 resize only then multiple raw speed, and the final output will be of the original size.

Pal can propose additional higher level functions for the filters, that output the expected image size, by internally pre-resizing the input and call the lower level function.

@aolofsson
Copy link
Member

Agree 100%
On Jun 14, 2015 2:50 PM, "Anaël Seghezzi" [email protected] wrote:

I think that the low-level methods should only focus on the
fastest/lightest code, and so output a smaller image.

Example :
A user wants to apply multiple filters to the same input image while
keeping the original size, the optimal method in this case would be to
create an oversized image (2 pixels larger for 3x3 filters, with nearest
value for the margin) and then only apply all the filters. = 1 resize only
then multiple raw speed.

Pal can propose additional higher level functions for the filters, that
output the expected image size, by internally pre-resizing the input and
call the lower level function.


Reply to this email directly or view it on GitHub
#106 (comment).

@tcptomato
Copy link

@anael-seghezzi: I think you're wrong. It can't be faster to create another input buffer (whose size depends on the number of filters to be executed!!!), when most current systems have the memory bus as a bottleneck. A better choice here would be to compute some garbage values for the pixels that are on the border, and offer a post-processing function for those who also need the border to be correct ( and offer a number of possibilities in extending the image by 1 - clone neighbor, fill with value, and such).

@mateunho
Copy link
Contributor

I agree with @anael-seghezzi. Filtering done on certain nodes should be as fast as possible. That implies no extraordinary possibilities (border pixels, post-processing, etc.).
@tcptomato In my opinion your way of thinking covers higher level of abstraction - feature level. Here, we're discussing the building blocks of such features. Which are pretty basic in their inner workings.

@aolofsson
Copy link
Member

Thomas,
These kernels will always work on local copies. You never read directly
from dram. The intent here is that each core will work on a sub tile of a
larger image.
On Jun 14, 2015 4:05 PM, "Thomas Böhm" [email protected] wrote:

@anael-seghezzi https://github.com/anael-seghezzi: I think you're
wrong. It can't be faster to create another input buffer (whose size
depends on the number of filters to be executed!!!), when most current
systems have the memory bus as a bottleneck. A better choice here would be
to compute some garbage values for the pixels that are on the border, and
offer a post-processing function for those who also need the border to be
correct ( and offer a number of possibilities in extending the image by 1 -
clone neighbor, fill with value, and such).


Reply to this email directly or view it on GitHub
#106 (comment).

@tcptomato
Copy link

Andreas,
I have experience in implementing image processing algorithms on this kind of multicore architecture ( even doing it in assembler), so I know the idea behind it and what I'm talking about.

The currently implemented image processing algorithms treat the output as a vector, and after computing 1 element, they advance by 1 element size in memory. This means that a simple 2 pixel enlargement won't suffice for running multiple filters, since they lack the understanding of the line stride, and will just output a smaller image, contiguously in memory. So either after each filter you resize, or you transfer the image with the padding logic encoded ( the DMA controller will probably love this, I foresee alignment issues here, or byte as transfer size). And additionally, before each DMA transfer you need to zero the buffer in order to have the correct values in the padding pixels.

@mateunho Nope. My way of thinking also takes into consideration some restrictions on the hardware side, it doesn't only look at the algorithm to be implemented.

@aolofsson
Copy link
Member

@tcptomato Can you clarify what the problem is? Can you clarify further what problem you are concerned about? Please keep in mind that these low level functions are not meant to be "perfect functions". Perhaps some of your concerns could be addressed in a higher level function?

Let's take the following use case:
-There is a 128 x 128 pixel image to process in DRAM
-We want to run a sobel filter on this 2D array using 16 cores
-Each core would work on separate 32 x 32 sub tiles of the image
-Each core needs to bring in a 32 x 32 sub tile + a 1 pixel halo around the tile from DRAM and produces a complete 32x32 sub result image.

Please describe your proposal for addressing the above example.

@tcptomato
Copy link

Assumption:
We ignore the border artifacts when cutting the image into tiles.

For your example:
We transfer the 32x32 tile using DMAs and we use it to create the 1 pixel halo at the same time. The first line of the extended image is 34 pixels of border (for simplicity we assume its 0). On the second, the first pixel is again 0, after which the actual data begins. Meaning that that the 36th pixel needs to be correctly aligned in memory. Worse after the 32 pixels of this line, there will be again 2 halo pixels So the difference between the start address of the 2 lines is 16 ( modulo 64), so any DMA transfer size bigger than half-word will be a misalignment. ( output stride needs to be a multiple of transfer size in all DMA controllers I've seen). Meaning that transfer speed is a quarter of what is possible.

For the next issue allow me to replace Sobel by the Gaussian blur ( since it's more likely to be ran multiple times on the same image, and has the same halo issue). In order to apply the blur filter 2 times on the image, after the first run the halo needs to be recreated either by transferring the image to DRAM and copying back with the stride ( or does the DMA controller support local to local transfer ? even if it does it's another memory transfer), or the local processor needs to add it.

To recap, biggest bottleneck here is DMA running at 1/4 speed and needing to prepare the buffer for each new run.

How I see it.
The image image is sliced 8 x 128. Which is transferred as a contiguous block of memory by the DMA at full speed. Now we run the filter on the full block, and being implemented as pointer arithmetic on a 1d vector it will generate a value for all output pixels ( the ones on the border will be garbage). It might need an empty line at the top and the bottom of the block to prevent memory access errors. Since the output size is the same as the input, we can use it as the input for a new run of the function without any changes.

Both options discussed here have issues in stitching back the tiles, but we'll discuss them when we get there.

@aolofsson
Copy link
Member

Please explain assumption. How can we ignore border artifacts?

On Mon, Jun 15, 2015 at 1:37 PM, Thomas Böhm [email protected]
wrote:

Assumption:
We ignore the border artifacts when cutting the image into tiles.

For your example:
We transfer the 32x32 tile using DMAs and we use it to create the 1 pixel
halo at the same time. The first line of the extended image is 34 pixels of
border (for simplicity we assume its 0). On the second, the first pixel is
again 0, after which the actual data begins. Meaning that that the 36th
pixel needs to be correctly aligned in memory. Worse after the 32 pixels of
this line, there will be again 2 halo pixels So the difference between the
start address of the 2 lines is 16 ( modulo 64), so any DMA transfer size
bigger than half-word will be a misalignment. ( output stride needs to be a
multiple of transfer size in all DMA controllers I've seen). Meaning that
transfer speed is a quarter of what is possible.

For the next issue allow me to replace Sobel by the Gaussian blur ( since
it's more likely to be ran multiple times on the same image, and has the
same halo issue). In order to apply the blur filter 2 times on the image,
after the first run the halo needs to be recreated either by transferring
the image to DRAM and copying back with the stride ( or does the DMA
controller support local to local transfer ? even if it does it's another
memory transfer), or the local processor needs to add it.

To recap, biggest bottleneck here is DMA running at 1/4 speed and needing
to prepare the buffer for each new run.

How I see it.
The image image is sliced 8 x 128. Which is transferred as a contiguous
block of memory by the DMA at full speed. Now we run the filter on the full
block, and being implemented as pointer arithmetic on a 1d vector it will
generate a value for all output pixels ( the ones on the border will be
garbage). It might need an empty line at the top and the bottom of the
block to prevent memory access errors. Since the output size is the same as
the input, we can use it as the input for a new run of the function without
any changes.

Both options discussed here have issues in stitching back the tiles, but
we'll discuss them when we get there.


Reply to this email directly or view it on GitHub
#106 (comment).


Andreas Olofsson, CEO/Founder at Adapteva
Cell: +1 781 325 6688
Twitter: @adapteva
Web: adapteva.com

Linkedin: linkedin.com/in/andreasolofsson

http://www.adapteva.com/

@tcptomato
Copy link

The artifact appears for the border pixels, who need values in other slices to be computed correctly. Ignoring this artifact ( for the current discussion), we use the value in the halo ( in your case ), or some other one ( in mine).

@anael-seghezzi
Copy link
Contributor

Well, there is definitively some taught to have about memory, I honestly have no good insight about paralella memory transfer cost vs computation power.

What I know (I tested it on intel), is that, in the case of a "correct" gaussian filter (meaning it will output clean filtered border) it is slower to take care of the border inside the processing loop than to prepare an enlarged image and then execute a simpler processing loop.

Also, tile-based filtering will have to make a copy of the input data anyway, and will have to be larger than the output tile also anyway (or the tile-borders will not match each other).

@tcptomato
Copy link

Do you have by any chance the code pushed somewhere, so I can extend it and we can talk about the same thing ?

I didn't say to take care of the border in the processing loop ( that would be counterproductive). If correct borders are needed, a helper function could correct them. If not, processing is done.

@anael-seghezzi
Copy link
Contributor

The "intel" code ? It's on my github account > https://github.com/anael-seghezzi/Maratis-Tiny-C-library

I think it's better to have correctness, even if the resulting image is smaller, in my view, if the output has to be the same size as input, the corner has to be correct.

Also, I don't see how a helper function can simply correct a border in a tile-based system, because you will still need the extra input data to correct the border of the tile.

That said, I still agree that the memory transfer on the platforms targeted by pal needs to be optimal, I don't have a parallela so I'm not sure about that.

@tcptomato
Copy link

The helper function is for the border of the complete frame, where you have to assign values to the non existent pixels when computing the value of the pixels on the border. OpenCV has here 5 different options ( reflect, constant, wrap , ... ), and I was saying that the default should be don't care ( but due to the way it's implemented in pointer arithmetic it will be a kind of wrap for most of the image ).

And the way I was proposing to split the image, you would have to bring 2 more lines ( 256 pixels ) at full speed. For the tile-based one you need 132 more pixels, but it will be at 1/4th of the speed due to memory alignment.

@anael-seghezzi
Copy link
Contributor

If you are saying that transferring horizontal blocs (128x8) aligned in memory will remove a bottleneck on this platform versus a tile-based system (32x32), it should be considered and tested.

I'm not sure how this will react with larger radius where you cannot parse the radius-source pixels continuously. But larger radius have other winning strategies, large blur radius are usually processed in two passes horizontal blur then vertical blur, how is this regarding memory transfert on paralella ?

@tcptomato
Copy link

I'm saying a contiguous block transfer will remove a bottleneck that a halo based approach would incur.

And I'm not sure I understand what you say by not being able to parse the source pixels continuously.

@anael-seghezzi
Copy link
Contributor

I mean that, the way you propose to split the image, if we have to deal with a larger blur radius, or if the image width is very large, you won't be able to fit the extra data in memory if you want contiguous pixels. Or maybe I misunderstood a step.

How do you deal with a large image width for example ? You will have to tile the image anyway no ?

@anael-seghezzi
Copy link
Contributor

An interesting link about this subject : https://www.youtube.com/watch?t=549&v=3uiEyEKji0M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants