Skip to content

Data: Qualities of NSW Spatial LOT Layer

Angus edited this page Jan 18, 2025 · 5 revisions

Field Formats

lot_id_string

Note they way they format lot is like so

  • 1234/1234/DP12345
    • Note deposit plans are prefixed with DP, which we do not do in nsw_lrs.parcel.
      • Perhaps we should tho, the reason we haven't was largely based on it never really appearing in the land value data
      • Or Perhaps we shouldn't as our format is also used by the NSW LRS.
  • 1234//DP12345
    • note they don't drop the / when there is no section
  • //SP12345
    • note this how the specify a strata lot.

Then there are other shapes like this, these appear inconsistent with the other shapes.

image

Uniqueness

lot_id_string

There seems to be at most 2 of each lot_id_string, based on this query:

Query result
select lot_id_string, COUNT(*)
  from nsw_spatial_lppt_raw.lot_feature_layer
  group by lot_id_string
  order by 2 DESC
  limit 10;
image

How do they vary?

This is what I saw when I looked at these lots

1//DP1087606

field before after description
{modified,start,created}_date & last_updated 2019-03-26 02:21:46 2025-01-14 23:18:46 clearly one piece of data was newer than the other
shape_area 13259.99992434575 13260.000083243887 this looks like different rounding?
shape_length 612.9018429031083 612.9018432012755 this also looks like different rounding?

The others

  • 1//DP71431, the changes are identical to the ones I mentioned above.
  • 2//DP246207, same

Geography

lot before after
1//DP1087606 image image
1//DP71431 image image

After checking the 2 above rows, I stopped taking screenshots as it really seemed like each of these were just updates with different rounding.

A Theory of duplication

I think these are just same pieces of information re-entered and they just have a bit of floating point variation. I figured we can test this by seeing variation between the new and old values, however after testing it with the below query it turns out there is none trivial variance in some cases.

Query result
select lot_id_string, (MAX(shape_area) - MIN(shape_area)) AS area_variance
  from nsw_spatial_lppt_raw.lot_feature_layer
  group by lot_id_string
  order by 2 DESC
  LIMIT 10;
image

Perhaps these are subdivisions?

Number of duplicates with variance greater than 0

I ran this query and there appears to be 106 examples of lot ids with variance greater than 0

select COUNT(*)
  from (
    select lot_id_string, (MAX(shape_area) - MIN(shape_area)) AS area_variance
      from nsw_spatial_lppt_raw.lot_feature_layer
  	  group by lot_id_string
     order by 2 DESC) t 
 where area_variance > 0;

Checking GeoPandas

Here's how I choose to render everything

import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch


df = gpd.read_postgis("""
select l.*, ROUND(v.area_variance::numeric, 2) as area_variance
  from (
    select lot_id_string, (MAX(shape_area) - MIN(shape_area)) AS area_variance
      from nsw_spatial_lppt_raw.lot_feature_layer
      group by lot_id_string
      order by 2 DESC) v
  left join nsw_spatial_lppt_raw.lot_feature_layer l using (lot_id_string)
  where area_variance > 0
  order by area_variance DESC, l.last_update DESC
""", db.engine(), geom_col='geometry')

lot_id_strings = df['lot_id_string'].unique().tolist()
resolution = 8
plot_cols, plot_rows = 11, 11

for offset in range(0, len(lot_id_strings), plot_cols * plot_rows):
    figsize = (plot_cols * resolution, plot_rows * resolution)
    fig, ax = plt.subplots(plot_cols, plot_rows, figsize=figsize)
    for subplot in range(0, min(plot_cols * plot_rows, len(lot_id_strings) - offset)):
        ax_col = subplot % plot_cols
        ax_row = subplot // plot_rows
        sub_ax = ax[ax_row][ax_col]
        lot_id = lot_id_strings[offset + subplot]
        sub_df = df[df['lot_id_string'] == lot_id].reset_index(drop=True)

        sub_df.iloc[[0]].plot(ax=sub_ax, color='none', edgecolor='red', hatch='///')
        sub_df.iloc[[1]].plot(ax=sub_ax, color='none', edgecolor='blue', hatch='\\\\\\')
        sub_ax.set_title(lot_id)
        sub_ax.set_xlabel(f"Variance {sub_df.loc[0, 'area_variance']}", loc='left')
        sub_ax.legend(handles=[
            Patch(facecolor='white', edgecolor='red', label=sub_df.loc[0, 'start_date']),
            Patch(facecolor='white', edgecolor='blue', label=sub_df.loc[1, 'start_date']),
        ], loc='lower right')
    plt.show()

And this is the result.

differences

It's only really the top rows that have any significant difference. The rest seem to be very minor differences.

Conclusion?

I think I'll just opt for the latest row.

Storage

Do we want more than 1 table? Should we restructure how we're currenting storing lots to reflect the fact that land is either part of a deposit lot or a strata plan with any number of lots?