Analysis of Collatz Conjecture

What is Collatz Conjecture?

The Collatz Conjecture is a mathematical hypothesis named after a German Mathematician, Lothar Collatz, who may have come up with it in 1930s. This hypothesis revolves around a sequence defined by the following rules:

Begin with a positive integer (n).
Generate each subsequent term based on the following conditions:
- If the previous term is even, the next term is half of the previous term.
  - $n/2$
- If the previous term is odd, the next term is obtained by multiplying the previous term by 3 and then adding 1.
  - $3n+1$

The conjecture assume that irrespective of the initial positive integer chosen, the sequence will always eventually reach the value 1.

library(tidyverse)

start <- 1:10000

seq <- list()
for (i in start) {
  collatz_seq <- gen_collatz(i)
  seq[[i]] <- collatz_seq
}

length <- double()
for (i in start) {
  length[i] <- length(gen_collatz(i))
}

parity <- ifelse(start %% 2 == 0,
                 "Even",
                 "Odd")

max_val <- double()
for (i in start) {
  max_val[i] <- max(gen_collatz(i))
}


collatz_df <- tibble(start,
                     seq,
                     length,
                     parity,
                     max_val)

Exploratory Data Analysis & Visualizations

Using {tidyverse} data wrangling techniques, we will analyze the data to provide essential insights into the behavior of the Collatz Conjecture sequences. Then, using {ggplot2}, we will create the appropriate graphs that visualize the data wrangling tasks.

Identify top 10 starting integers that produce the longest sequence.

top10longest <-  collatz_df %>%
  arrange(desc(length)) %>%
  slice_head(n = 10) %>%
  pull(start)

Below is the scatter plot of the sequence lengths, with the starting integer on the horizontal axis and the length of the sequence on the vertical axis. The top 10 starting integers are highlighted and labeled in blue.

Below is the code for the plot:

library(ggplot2)
library(ggrepel)

# Scatterplot of all the sequence lengths
plot1<- ggplot( data = collatz_df,
                mapping = aes(x = start,
                              y = length)
)+
  geom_point()+
  labs(
    title = "Scatterplot of Sequence Lengths",
    x = "Starting Integer",
    y = "Length of Sequence"
  )
  
# To find the top 10 longest starting integers
top10length <- collatz_df %>%
    arrange(desc(length)) %>%
    top_n(10,length)
    
# To identify the top 10 longest starting integers in the scatterplot
scatterplot1 <-
  plot1 + 
  geom_point(data = top10length,aes(colour = "Top 10"))+
  scale_colour_manual(values = c("Top 10" = "blue"))+
  geom_text_repel(data = top10length, aes(label = start))

Identify the starting integer that produces a sequence that reaches the highest maximum value.

max_val_int <- collatz_df %>%
  slice(which.max(max_val)) %>%
  pull(start)

max_val_int = 9663

Below is the scatter plot of the highest value reached by each starting integer, with the starting integers in the horizontal axis, and the maximum values in the vertical axis. The top 10 starting integers are highlighted and labeled in red.

We can see that starting integer 9663 from max_val_int , indeed has the highest maximum value.

Below is the code for the plot:

# Scatterplot of the highest values of starting integers
plot2<- ggplot( data = collatz_df,
                mapping = aes(x = start,
                              y = max_val)
)+
  geom_point()+
  labs(
    title = "Scatterplot Of Maximum Value Reached In Sequence",
    x = "Starting Integer",
    y = "Maximum Value"
  )

# To find the top 10 starting integers with the highest value
sortedvalue <- collatz_df %>% arrange(desc(max_val))
top_10_value <- sortedvalue[1:10,]

# To highlight the top 10 starting integers 
scatterplot2 <-
  plot2 + 
  geom_point(data = top_10_value,
             aes(colour = "Top 10"))+
  scale_colour_manual(values = c("Top 10" = "red"))+
  geom_text_repel( data = top_10_value, aes(label = start))

Calculate the average length and standard deviation of sequences for even starting integers and compare them with those for odd starting integers.

even_odd_avg_len <- collatz_df %>%
  group_by(parity) %>%
  summarise(mean(length)) %>%
  pull(2)

even_odd_sd_len <- collatz_df %>%
  group_by(parity) %>%
  summarise(sd(length)) %>%
  pull(2)

even_odd_avg_len = 79.5936 92.3396

even_odd_sd_len = 45.10308 47.18387

Shown below are box plots to compare the distribution of sequence lengths for even and odd starting integers, with the parity on the horizontal axis and the length of sequence on the vertical axis.

Lets denote the box plot of even starting integers are "Box plot A", and the box plot of odd starting integers as "Box plot B". As we can see above, there are a few notable differences between box plot A and box plot B, which are as follows:

Outliers are only present in Box plot B
The median in Box plot B is higher than in Box plot A, which indicates that odd starting integers result in longer sequences.

Below is the code for the plot:

# Boxplot of distribution of sequence length for even,odd starting integers

ggplot( data = collatz_df,
        mapping = aes( x = parity,
                       y = length))+
  geom_boxplot()+
  labs(
    title = "Boxplots Comparing Distribution of Even and Odd Starting Integers",
    x = "Parity",
    y = "Length of Sequence"
  )

Investigating backtracking in Sequences

When a series hits a number that is less than the initial integer but subsequently rises beyond it at least once more before reaching 1, backtracking has occurred. Hence, let's investigate them!

Filter collatz_df to retain starting integers that exhibit backtracking in their sequence.

has_backtracking <- function(seq) {
  if (length(seq) <= 2) {
    return(FALSE)
  }
  for (i in 2:(length(seq) - 1)) {
    if (seq[i] < seq[1] && seq[i + 1] > seq[1]) {
      return(TRUE)
    }
  }
  return(FALSE)  
}

backtracks_df <- collatz_df %>% 
  filter(sapply(seq, has_backtracking))

A function has_backtracking was created to identify and check if the backtracking has occurred with input seq.

Let's find the most frequently occurring number of times they go above their starting integer.

count_backtrack <- function(seq) {
  sum(seq > seq[1])
}

mode_backtrack <- backtracks_df %>%
  mutate( backtrack_counts = sapply(seq, count_backtrack)) %>%
  count(backtrack_counts) %>%
  arrange(desc(n)) %>%
  pull(backtrack_counts) %>%
  first()

Another function count_backtrack was introduced to count number of backtracks that has occurred with input seq.

What is the maximum value reached after the first backtrack for these sequences?

first_backtrack <- function(seq) {
  start <- seq[1]
  first <- start
  reached_backtrack <- FALSE
  
  for (i in seq) {
    if (reached_backtrack && i > first) {
      first <- i
    }
    if (i < start) {
      reached_backtrack <- TRUE
    }
  }
  return(first)
}

max_after_backtrack <- backtracks_df %>%
  mutate(max_after_backtrack = sapply(seq, first_backtrack)) %>%
  pull(max_after_backtrack)

Lastly, function first_backtrack was created to find the first backtrack value for the sequences

Are backtracking sequences more common among even or odd starting integers? Let's find out!

even_odd_backtrack <- backtracks_df %>%
  group_by(parity) %>%
  summarise(frequency = n()) %>%
  pull(2)

even_odd_backtrack = 3943 4286

Hence, backtracking sequences are more common in odd starting integers compared to even starting integers.

Open-ended Exploration

Investigating the correlation between the starting integers and the number of even and odd numbers in the sequence

First, we create odd_counts and even_counts, which are the frequency of of odd and even numbers in the sequence.

odd_numbers_in_seq <- function(x) {
  odd_count <- sum(x %% 2 != 0)
  return(odd_count)
}
odd_counts <- double()
for (i in start) {
  odd_counts[i] <- odd_numbers_in_seq(collatz_df$seq[[i]])
}

even_numbers_in_seq <- function(x) {
  even_count <- sum(x %% 2 == 0)
  return(even_count)
}
even_counts <- double()
for (i in start) {
  even_counts[i] <- even_numbers_in_seq(collatz_df$seq[[i]])
}

Then, use the function cor() to compute the correlation and store it as correlation_start.

correlation_start <- collatz_df %>%
  mutate(even_counts, odd_counts) %>%
  select(-seq, -parity) %>%
  cor()

> correlation_start
                 start    length    max_val even_counts odd_counts
start       1.00000000 0.2054051 0.08813284   0.2212319  0.1798567
length      0.20540510 1.0000000 0.17133951   0.9998222  0.9995421
max_val     0.08813284 0.1713395 1.00000000   0.1719367  0.1702539
even_counts 0.22123190 0.9998222 0.17193671   1.0000000  0.9987938
odd_counts  0.17985668 0.9995421 0.17025389   0.9987938  1.0000000

The correlation coefficient between the starting integers and the number of even numbers in the sequence is 0.22123, which is low and positive.
- This indicates that there is a weak positive relationship between the starting integers in the sequence and the number of even numbers in the sequence.
The correlation coefficient between the starting integers and the number of odd numbers in the sequence is 0.17986, which is also low and positive.
- This shows that there is a weak positive relationship between the starting integers in the sequence and the number of odd numbers in the sequence.
The correlation coefficient between the number of even and odd numbers in the sequence is 0.99879, which is high (close to 1) and positive.
- This indicates that there is a strong positive relationship between the number of even and odd numbers in the sequence.

The first two plots represents the relationship between the starting integer of the sequence and the number of even and odd numbers in the sequence respectively.
- It is clear here that there are many points that are scattered away from the line, which indicates a weak relationship. The slope of the line is positive, so the relationship will be positive.
The last plot represents the relationship between the number of odd and even numbers in the sequence.
- It is obvious here that there are many points close to the line, which indicates a very strong relationship and since the slope is positive, the relationship will also be positive.
The code below is used for creating a visualization to show whether there is a relationship between two variables.

start_even_counts <- collatz_df %>%
  mutate(even_counts) %>%
  select(-seq, -parity) %>%
  ggplot(., aes(x = start,
                y = even_counts)) +
  geom_point(color = "skyblue") +
  geom_smooth(method = "lm",
              se = FALSE,
              fullrange = TRUE,
              color = "red") +
  labs(x = "Starting integers",
       y = "Even numbers in the sequence") +
  theme_minimal()
  
start_odd_counts <- collatz_df %>%
  mutate(odd_counts) %>%
  select(-seq, -parity) %>%
  ggplot(., aes(x = start,
                y = odd_counts)) +
  geom_point(color = "skyblue") +
  geom_smooth(method = "lm",
              se = FALSE,
              fullrange = TRUE,
              color = "red") +
  labs(x = "Starting integers",
       y = "Odd numbers in the seq") +
  theme_minimal()
  
even_odd_counts <- collatz_df %>%
  mutate(odd_counts, even_counts) %>%
  select(-seq, -parity) %>%
  ggplot(., aes(x = even_counts,
                y = odd_counts)) +
  geom_point(color = "skyblue") +
  geom_smooth(method = "lm",
              se = FALSE,
              fullrange = TRUE,
              color = "red") +
  labs(x = "Even numbers",
       y = "Odd numbers") +
  theme_minimal()

library(ggpubr)
ggarrange(start_even_counts, start_odd_counts, even_odd_counts,
          labels = c("p = 0.22123", "p = 0.17986", "p = 0.99879"),
          hjust = -2,
          font.label = list(size = 9),
          ncol = 2, nrow = 2)

ggarrange() is used to fit multiple plots in one image. Before using this, it is important to install ggpubr first.

Creative Visualization Challenge

For this section we will look into 3 different visualizations for Collatz Conjecture;

1) Plotting the highest maximum value reached by each starting integer

Here is a plot of starting integers up to 10,000, with the largest maximum value reached by each starting integer plotted on the y-axis. The y-axis stopped at 100,000, but not all starting integers can be shown at this scale. For example, when n = 9663, the largest value reached climbs as high as 27 million.

Below is the code to this plot:

collatz_df %>%
  unnest(seq) %>%
  group_by(start) %>%
  filter(start %in% 1:10000) %>%
  slice_max(order_by = seq) %>%
  ggplot(.,
         aes(x = start,
             y = seq)) +
  geom_point(aes(col = start,
                 alpha = length),
             size = 1) +
  labs(
    title = "Collatz Conjecture",
    subtitle = "Max value reached by each starting integer ",
    x = "Starting Integer",
    y = "Value"
  ) +
  theme_minimal() + 
  xlim(0, 10000) +
  ylim(0, 100000)

2) Numerical Progression of each starting integer

Here is a plot of the numerical progression of each starting integer from 1 to 30. Interestingly, the starting integer n = 27, goes through 112 steps to finally reach 1.

Below is the code to this plot:

collatz_df %>%
  unnest(seq) %>%
  group_by(start) %>%
  filter(start %in% 1:30) %>%
  mutate(steps = row_number()) %>%
  ggplot(.,
         aes(x = steps,
             y = seq)) +
  geom_line(aes(col = length)) +
  facet_wrap(start ~ length, scales = "free") +
  labs(
    title = "Collatz Sequence Line Plot",
    subtitle = "Numerical Progression for each starting integer",
    x = "Steps",
    y = "Value"
  ) +
  theme_classic()

You may replace the starting integer filter "1:30" to any numbers you want to look at their numerical progression.

3) Collatz Sequence Hex

Here is a hexagonal plot of the length of each starting integers from 1 to 10,000. For every hexagon, you can check how many data points there are which leads to the count. As you can see, lengths around 50 and 130 are more common compared to the rest.

Perhaps, by having more starting integers (i.e 1 to 100,000), a better hexagonal representation can be made to see which lengths are more common.

Below is the code to this plot:

collatz_df %>%
  ggplot(.,
         aes(x = start,
             y = length)) +
  geom_hex(bins = 20) +
  scale_y_continuous(breaks = seq(0, 275, by = 25)) +
  scale_fill_viridis_c() +
  theme_minimal()

Contribution declaration

Tasks	Done by
Task 1	@HafizNjame
Task 2	@HafeezulRaziq
Task 3	@20B2097
Task 4	@dnshzm
Task 5	@HafizNjame
Task 6	@HafeezulRaziq
README	@HafizNjame @dnshzm @20B2097 @HafeezulRaziq

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
figures		figures
tests		tests
.gitignore		.gitignore
01-gen_collatz.R		01-gen_collatz.R
02-eda.R		02-eda.R
03-backtracking.R		03-backtracking.R
04-plots.R		04-plots.R
05-open.R		05-open.R
06-creative.R		06-creative.R
ASSIGNMENT.md		ASSIGNMENT.md
INSTRUCTIONS.md		INSTRUCTIONS.md
README.html		README.html
README.md		README.md
a1-grp-R.Rproj		a1-grp-R.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of Collatz Conjecture

What is Collatz Conjecture?

Table of Contents

Generating the Collatz Conjencture

Exploratory Data Analysis & Visualizations

Investigating backtracking in Sequences

Open-ended Exploration

Creative Visualization Challenge

1) Plotting the highest maximum value reached by each starting integer

2) Numerical Progression of each starting integer

3) Collatz Sequence Hex

Contribution declaration

About

Releases

Packages

Contributors 4

Languages

sm2302-aug23/grp-r-al_jabr

Folders and files

Latest commit

History

Repository files navigation

Analysis of Collatz Conjecture

What is Collatz Conjecture?

Table of Contents

Generating the Collatz Conjencture

Exploratory Data Analysis & Visualizations

Investigating backtracking in Sequences

Open-ended Exploration

Creative Visualization Challenge

1) Plotting the highest maximum value reached by each starting integer

2) Numerical Progression of each starting integer

3) Collatz Sequence Hex

Contribution declaration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages