Data: Qualities of NSW Spatial LOT Layer

Field Formats

`lot_id_string`

Note they way they format lot is like so

1234/1234/DP12345
- Note deposit plans are prefixed with DP, which we do not do in nsw_lrs.parcel.
  - Perhaps we should tho, the reason we haven't was largely based on it never really appearing in the land value data
  - Or Perhaps we shouldn't as our format is also used by the NSW LRS.
1234//DP12345
- note they don't drop the / when there is no section
//SP12345
- note this how the specify a strata lot.

Then there are other shapes like this, these appear inconsistent with the other shapes.

Uniqueness

`lot_id_string`

There seems to be at most 2 of each lot_id_string, based on this query:

Query	result
`select lot_id_string, COUNT(*) from nsw_spatial_lppt_raw.lot_feature_layer group by lot_id_string order by 2 DESC limit 10;`

How do they vary?

This is what I saw when I looked at these lots

`1//DP1087606`

field	before	after	description
`{modified,start,created}_date` & `last_updated`	2019-03-26 02:21:46	2025-01-14 23:18:46	clearly one piece of data was newer than the other
shape_area	`13259.99992434575`	`13260.000083243887`	this looks like different rounding?
shape_length	`612.9018429031083`	`612.9018432012755`	this also looks like different rounding?

The others

1//DP71431, the changes are identical to the ones I mentioned above.
2//DP246207, same

Geography

lot	before	after
1//DP1087606
`1//DP71431`

After checking the 2 above rows, I stopped taking screenshots as it really seemed like each of these were just updates with different rounding.

A Theory of duplication

I think these are just same pieces of information re-entered and they just have a bit of floating point variation. I figured we can test this by seeing variation between the new and old values, however after testing it with the below query it turns out there is none trivial variance in some cases.

Query	result
`select lot_id_string, (MAX(shape_area) - MIN(shape_area)) AS area_variance from nsw_spatial_lppt_raw.lot_feature_layer group by lot_id_string order by 2 DESC LIMIT 10;`

Perhaps these are subdivisions?

Number of duplicates with variance greater than 0

I ran this query and there appears to be 106 examples of lot ids with variance greater than 0

select COUNT(*)
  from (
    select lot_id_string, (MAX(shape_area) - MIN(shape_area)) AS area_variance
      from nsw_spatial_lppt_raw.lot_feature_layer
  	  group by lot_id_string
     order by 2 DESC) t 
 where area_variance > 0;

Checking GeoPandas

Here's how I choose to render everything

import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch


df = gpd.read_postgis("""
select l.*, ROUND(v.area_variance::numeric, 2) as area_variance
  from (
    select lot_id_string, (MAX(shape_area) - MIN(shape_area)) AS area_variance
      from nsw_spatial_lppt_raw.lot_feature_layer
      group by lot_id_string
      order by 2 DESC) v
  left join nsw_spatial_lppt_raw.lot_feature_layer l using (lot_id_string)
  where area_variance > 0
  order by area_variance DESC, l.last_update DESC
""", db.engine(), geom_col='geometry')

lot_id_strings = df['lot_id_string'].unique().tolist()
resolution = 8
plot_cols, plot_rows = 11, 11

for offset in range(0, len(lot_id_strings), plot_cols * plot_rows):
    figsize = (plot_cols * resolution, plot_rows * resolution)
    fig, ax = plt.subplots(plot_cols, plot_rows, figsize=figsize)
    for subplot in range(0, min(plot_cols * plot_rows, len(lot_id_strings) - offset)):
        ax_col = subplot % plot_cols
        ax_row = subplot // plot_rows
        sub_ax = ax[ax_row][ax_col]
        lot_id = lot_id_strings[offset + subplot]
        sub_df = df[df['lot_id_string'] == lot_id].reset_index(drop=True)

        sub_df.iloc[[0]].plot(ax=sub_ax, color='none', edgecolor='red', hatch='///')
        sub_df.iloc[[1]].plot(ax=sub_ax, color='none', edgecolor='blue', hatch='\\\\\\')
        sub_ax.set_title(lot_id)
        sub_ax.set_xlabel(f"Variance {sub_df.loc[0, 'area_variance']}", loc='left')
        sub_ax.legend(handles=[
            Patch(facecolor='white', edgecolor='red', label=sub_df.loc[0, 'start_date']),
            Patch(facecolor='white', edgecolor='blue', label=sub_df.loc[1, 'start_date']),
        ], loc='lower right')
    plt.show()

And this is the result.

differences

It's only really the top rows that have any significant difference. The rest seem to be very minor differences.

Conclusion?

I think I'll just opt for the latest row.

Storage

Do we want more than 1 table? Should we restructure how we're currenting storing lots to reflect the fact that land is either part of a deposit lot or a strata plan with any number of lots?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data: Qualities of NSW Spatial LOT Layer

Field Formats

`lot_id_string`

Uniqueness

`lot_id_string`

How do they vary?

`1//DP1087606`

The others

Geography

A Theory of duplication

Number of duplicates with variance greater than 0

Checking GeoPandas

Conclusion?

Storage

Clone this wiki locally