Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the O(n^3) performance issue in is_connected utility function #397

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

Wavetrace
Copy link

@Wavetrace Wavetrace commented Nov 12, 2024

The is_connected() function is implemented in terms of is_rechable(), effectively checking if each vertex is reachable from each other vertex in a nested loop. This quickly becomes terribly slow as the number of vertices and edges grow, becoming roughly unusable with a square connected graph of size 20x20.

This request fixes the issue by

  • using a single DFS walk for undirected graphs
  • employing the Tarjan's SCC algorithm for directed graphs

This reduces the algorithm complexity from cubic to linear and makes it possible to use it on large graphs. See also the commit message in 80ae041

This change is fully backward compatible with the previous implementation.

It also adds a new is_connected.cpp test. Feel free to run this test on the previous implementation. This test passes without the new algorithm, but picks the size of 20x20 square to be considerably slow. This test also accepts an integer as argv[1], the size of the square, so it is possible to observe the slow-down growing as the size increases (the default is 20).

Please see an excerpt of my measurement below (I can upload this program as well). The previous implementation quickly becomes impracticably slow.

I note, that is_connected is not documented and is not widely used. Still I consider it to be important to fully adhere to the interface specified in comment and preserve the compatibility, as it could have been used in non-public projects. In short one important decision here is to avoid any breaking changes.

I foresee a few concerns with this request, which I'll be happy to discuss and address if possible:

  • the graph_utility.hpp header file seems to be designed as low-dependency. Including < vector > and <strong_components.hpp> there seems to violate that choice. But I think the practical aspect of this is reasonably minimal; while fixing the issue fully preserving the backward compatibility is important in practice.
  • the colormap is not used for directed graphs. strong_components do not accept the colormap and pass it to DFS. I don't see this as a big concern. This can be addressed in 2 ways: updating SCC interface to pass the colormap to DFS, or (probably better) provide the version of is_connected with a single argument and no colormap.
  • the version for directed graphs seems to now require VertexIndexGraph concept. (I think) This can be fixed by using map instead of vector for the component map (comp_map) in case the graph is not VertexIndexGraph. I'm not sure, how important is this, but this seems to be a deviation from the full backward compatibility.

I consider a bit of follow-up work on this, which can also be done in a separate pull request:

  • provide single-argument is_connected(const Graph& g). This will require including the colormap header into the graph_utility
  • document is_connected (and other utility functions like is_reachable)
  • add examples of the above

fixes #398

Measurement:

=== Testing size 1
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     3.299e-06s
time: undirected, connected - new  implementation - elapsed     1.179e-06s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         8.62e-07s
time: undirected, not connected - new  implementation - elapsed         5.39e-07s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       8.22e-07s
time: directed, connected - new  implementation - elapsed       2.636e-06s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   1.64e-06s
time: directed, not connected - new  implementation - elapsed   1.555e-06s

...
=== Testing size 10
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     0.0409115s
time: undirected, connected - new  implementation - elapsed     5.703e-06s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         0.000392494s
time: undirected, not connected - new  implementation - elapsed         4.8e-06s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       0.0240339s
time: directed, connected - new  implementation - elapsed       7.925e-06s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   0.000261854s
time: directed, not connected - new  implementation - elapsed   6.209e-06s

...
=== Testing size 24
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     16.458s
time: undirected, connected - new  implementation - elapsed     6.1249e-05s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         0.0312753s
time: undirected, not connected - new  implementation - elapsed         6.9342e-05s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       7.45684s
time: directed, connected - new  implementation - elapsed       4.9894e-05s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   0.01372s
time: directed, not connected - new  implementation - elapsed   0.000215756s

Adhere to the specification in comment:
// Is the undirected graph connected?
// Is the directed graph strongly connected?

Instead of checking if each vertex is reachable from each vertex (which is O(n^3)),
consider different approaches for directed and undirected graphs:

For an undirected graph check if each vertex was reachable with a single DFS walk.
This runs in O(N) (modulo back edges).

For a directed graph run Tarjan SCC algorithm and check if all vertices end up in
a single component. This runs in O(N+E).

The speed-up is considerable, e.g. for a 25x25 square connected graph it is about:

time: undirected, connected - prev implementation - elapsed 	16.458s
time: undirected, connected - new  implementation - elapsed 	6.1249e-05s
time: directed, connected - prev implementation - elapsed 	7.45684s
time: directed, connected - new  implementation - elapsed 	4.9894e-05s

For similar 80x80 graph it's about 0.000563113s, while it just takes forever
with the previous O(n^3) approach.
Adhere to the specification in function comment:
// Is the undirected graph connected?
// Is the directed graph strongly connected?

Instead of checking if each vertex is reachable from each vertex (which is O(n^3)),
consider different approaches for directed and undirected graphs:

For an undirected graph check if each vertex was reachable with a single DFS walk.
This runs in O(N) of a single connected component (modulo its back edges).

For a directed graph run Tarjan SCC algorithm and check if all vertices end up in
a single component. This runs in O(N + E).

The speed-up is considerable, e.g. for a 25x25 square connected graph it is about:

time: undirected, connected - prev implementation - elapsed 	16.458s
time: undirected, connected - new  implementation - elapsed 	6.1249e-05s
time: directed, connected - prev implementation - elapsed 	7.45684s
time: directed, connected - new  implementation - elapsed 	4.9894e-05s

For similar 80x80 graph it's about 0.000563113s, while it just takes forever
with the previous O(n^3) approach.
…backward compatibility with the previous implementation
@Wavetrace
Copy link
Author

I've also opened #398 to track this.

This resolves a build error due to boost::detail::get preferred without the using declaration:

../../../boost/pending/detail/disjoint_sets.hpp:58:24: error: no matching function for call to 'get(long unsigned int*&, int&)'
   58 |         assert(i == get(p, i));
../../../boost/graph/reverse_graph.hpp:561:7: note: candidate: 'template<class E> E boost::detail::get(boost::detail::underlying_edge_desc_map_type<E>, const boost::detail::reverse_graph_edge_descriptor<EdgeDesc>&)'
  561 |     E get(underlying_edge_desc_map_type< E > m,
      |       ^~~
../../../boost/graph/reverse_graph.hpp:561:7: note:   template argument deduction/substitution failed:
../../../boost/pending/detail/disjoint_sets.hpp:58:24: note:   mismatched types 'boost::detail::underlying_edge_desc_map_type<E>' and 'long unsigned int*'
   58 |         assert(i == get(p, i));
      |                     ~~~^~~~~~
@jeremy-murphy jeremy-murphy self-assigned this Nov 14, 2024
@jeremy-murphy
Copy link
Contributor

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

@Wavetrace
Copy link
Author

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

A great question! Well, I read the book as soon at it was published ;) And Alexander Stepanov's recommendation meant a lot to me, I know him personally and appreciate his inventions a lot. I also ended up rolling up my own graph library before, because Boost was not welcome in my organization.

As of GitHub, I hope to publish a few things soon, including the small project, where I encountered this error.

One reason I actually hit it is that I go straight to the code first and resort to the documentation when the code doesn't work as I'd expect.

Thanks for looking into this!

@Wavetrace
Copy link
Author

For the record, this function was added in 86ef707 on Feb 6, 2001 and not modified since then except for style.

@jeremy-murphy
Copy link
Contributor

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

A great question! Well, I read the book as soon at it was published ;) And Alexander Stepanov's recommendation meant a lot to me, I know him personally and appreciate his inventions a lot. I also ended up rolling up my own graph library before, because Boost was not welcome in my organization.

Interesting. Did you also work with him? I'm also a great admirer of his work and managed to visit him once for a chat; unfortunately, Palo Alto is a long way from Australia.

Anyway, my question was fairly flippant, so let me address an actual concern: maybe this function shouldn't exist at all?
Although the meaning of "connected" is unambiguous for undirected graphs, it's not so for directed graphs, and although we would preserve the original meaning of strongly connected, it still leaves every new user to discover what is meant by "connected" in the directed case. Also, what happens when someone wants "connected" to mean weakly, or unilaterally, connected in the directed case? It sits uncomfortably with me. It perhaps has to be a parameter to is_connected so that the user can specify it at each call. What do you think of this situation? Let's resolve this design question before we move on to matters of code.

As you pointed out, the function is undocumented and not used anywhere in the code base that I can see, so we are free to reinvent it.

@Wavetrace
Copy link
Author

Interesting. Did you also work with him? I'm also a great admirer of his work and managed to visit him once for a chat; unfortunately, Palo Alto is a long way from Australia.

No, we also only met for a few days.

It perhaps has to be a parameter to is_connected so that the user can specify it at each call. What do you think of this situation? Let's resolve this design question before we move on to matters of code.

Yes, I like this idea. Let me draft an implementation do discuss in greater details. I'd still consider it important to keep the interface backward-compatible. What do you think about it?

Also is it possible to mark a function as deprecated in Boost? I only see the deprecated headers or comments for function deprecation.

@Wavetrace
Copy link
Author

And one more question: I see meta/libraries.json specifies "cxxstd": "14". Does this mean we are free to use all the C++14 features? I see most code is C++98, but maybe it was just not updated yet?

is_connected.hpp

Implement 3 connection modes for directed graphs: strong, unilateral and
weak
Remove the obsolete `run_test` invocation.
@Wavetrace
Copy link
Author

Hi

I updated this FR with your suggestions, please have a look. One can now call the algorithm like this:

bool b = is_connected(g, is_connected_kind::strong);
bool b = is_connected(g, is_connected_kind::unilateral);
bool b = is_connected(g, is_connected_kind::weak);

which all returns the same for an undirected graph (this makes sense conceptually).

Does this correspond to your direction of thought? If yes, I'll add some documentation.

I think there are 3 questions at hand, I made some reasonable trade offs here, but would like to discuss this more. Probably it is good enough, but sure there is some place for improvement.

  1. Do we want to keep the interface 100% backward compatible? This is technically possible, though I understood from your previous comment, that you don't see this as important. I opted into breaking the interface, but producing a compile-time error with a reasonable message if one tries to use the old one.
  2. Algorithms now require that graph has a vertex index (there's a separate discussion of why it's not simply VertexIndexGraphConcept). It is also possible to accept the named params to get the vertex index, though this might be a bit of an overkill for now. Another alternative is to create associative output property maps instead of iterated ones, but this might be slow. Is there an easy way to pick an appropriate property map type depending on if vertex_index is provided by the graph? I see something relevant in named_function_params.hpp, though not exactly this.
  3. It might make sense to drop the default of "strongly connected" for directed graphs and require the mode / kind to be always specified explicitly.

While working on this I spotted a few other things; I'll probably send separate PRs for them to keep this one focused on is_connected.

I look forward to your comments and suggestions on the subject. Many thanks for looking into this!

@jeremy-murphy
Copy link
Contributor

Apologies for the delay in replying, I'm in the midst of relocating to a different country.

It perhaps has to be a parameter to is_connected so that the user can specify it at each call. What do you think of this situation? Let's resolve this design question before we move on to matters of code.

Yes, I like this idea. Let me draft an implementation do discuss in greater details. I'd still consider it important to keep the interface backward-compatible. What do you think about it?

I know backward-compatibility is often sought after or required, but I think it's given too much priority without consideration of the value provided. In this case, as I mentioned:

  1. the function is undocumented
  2. the interface is ambiguous

so in my opinion, backward-compatibility is not a priority.

I think it's better to draft an interface first before moving on to code, but I'm too late with that advice.
How I imagine it working is, given is_connected<X>(Y), where X is a type and Y is a concept:

X              Y              Result
               undirected     OK
               directed       invalid
weakly         directed       OK
unilaterally   directed       OK
strongly       directed       OK

The relationship between the kinds of directed connectivity is that strongly ⇒ weakly and strongly ⇒ unilaterally, but nothing else (i.e. unilaterally does not ⇒ weakly nor vice versa). So I wouldn't use the struct/enum design you currently have, I would do

struct weakly {};
struct unilaterally {};
struct strongly : weakly, unilaterally {};

I believe that's how the graph concepts are done too, have a look around in graph_traits.hpp maybe.

Also is it possible to mark a function as deprecated in Boost? I only see the deprecated headers or comments for function deprecation.

https://en.cppreference.com/w/cpp/language/attributes/deprecated

@jeremy-murphy
Copy link
Contributor

Ah, I see we both wrote long comments at the same time; I didn't see yours until after I finished mine.

@jeremy-murphy
Copy link
Contributor

Hmmm, my idea of using the template parameter is not essential, btw. Using the second parameter for a tag like you have done is probably preferable and I think follows existing conventions. One benefit of the tag dispatch is that is_connected(g) can work for an undirected graph.

Keep it not required for undirected graphs.
Also rename is_connected_kind to connected_kind
@Wavetrace
Copy link
Author

I'm in the midst of relocating to a different country

I wish you very smooth relocation, definitely not an easy thing to do ;)

How I imagine it working is, given is_connected(Y)

Yes, I ended up implementing this behaviour even before reading your comment! I agree, that explicit is better than implicit even when it requires a bit more typing. Please see the recent version.

Thanks for commenting on the backward compatibility. I think a static assert with a reasonable message would be good enough for now. Also thanks for confirming the kind selection tags strategy, it's probably good enough for now.

Overall I think this code is now in a reasonably good shape to be useful already.

I will look into selecting the proper output property maps and picking associative ones when the vertex_index is not present in the graph. I think this would be more usable than providing an external index and still reasonably fast. This can also follow as a different PR.

@Wavetrace
Copy link
Author

As of the other issues I mentioned, please see #403 and #404 (this is the reason some code is commented out in the test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is_connected() is too slow
2 participants