This document intended to get you up to speed with using
rsgeo
and comfortable branching out from using sf more
generally.
Understanding sf
Most R users that do geospatial analysis are familiar with sf and
understand spatial workflows in that context only. Many users of
sf
view it as this magical data frame that lets you do
spatial analysis—and that is exactly how it feels! This means that sf
has done a great job in making spatial analysis feel a lot easier for
the vast majority of R users.
But sf
needs to be understood in a more fundamental way.
sf
is named after the Simple Feature Access Standard.
Simple features are an agreed upon way to represent “geometric
primitives”—things like points, linestrings, polygons, and their multi-
types. The sf
package builds a hierarchy off of these. We
have sfg
, sfc
, and sf
objects.
sfg
objects
At the core is the sfg
class which are representations
of simple feature geometries. sfg
class objects are
representations of a simple feature. They are a single geometry at a
time. We can think of them as a scalar value (even though R does not
have the concept of a scalar).
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
# create a point
pnt <- st_point(c(0, 10))
# create a line
ln <- st_linestring(matrix(c(0, 1, 0, 0), ncol = 2))
class(pnt)
#> [1] "XY" "POINT" "sfg"
class(ln)
#> [1] "XY" "LINESTRING" "sfg"
Each scalar value can be used independently which is useful in itself.
st_length(ln)
#> [1] 1
sfg
objects are very simple objects constructed of
numeric vectors, matrices, and lists of matrices. They also have no
sense of a coordinate reference system (CRS).
But more often we want to have a vector of geometries. A vector of
geometries is stored in an sfc
object.
sfc
objects
sfc
is short for simple feature column. Under the hood,
this is just a list of sfg
objects. You might think that
you could create a vector of geometries by combining them using
c()
but that would be wrong.
c(pnt, pnt)
#> MULTIPOINT ((0 10), (0 10))
c(pnt, ln)
#> GEOMETRYCOLLECTION (POINT (0 10), LINESTRING (0 0, 1 0))
sfg
objects behave like scalars so combining them
creates either a multi- type of the sfg
or a geometry
collection (another type of simple feature).
To create a vector of geometries you must use st_sfc()
.
st_sfc()
is the construct for creating a “simple feature
geometry list column.” It lets you also set a number of attributes that
are associated with the vector.
st_sfc(pnt, pnt)
#> Geometry set for 2 features
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 0 ymin: 10 xmax: 0 ymax: 10
#> CRS: NA
#> POINT (0 10)
#> POINT (0 10)
sfc
objects have attributes such as a CRS, bounding box,
and precision. sfc
objects behave more like the vectors
that you are familiar with .
c(
st_sfc(pnt, pnt),
st_sfc(pnt),
st_sfc(ln)
)
#> Geometry set for 4 features
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: 0 ymin: 0 xmax: 1 ymax: 10
#> CRS: NA
#> POINT (0 10)
#> POINT (0 10)
#> POINT (0 10)
#> LINESTRING (0 0, 1 0)
Since sfc
objects are vectors, they can be included as a
column in a data frame.
df <- data.frame(
geo = st_sfc(pnt, pnt)
)
df
#> geometry
#> 1 POINT (0 10)
#> 2 POINT (0 10)
class(df)
#> [1] "data.frame"
Having geometry included in a data frame is a huge win for the R community because this means they can included attributes along with the geometries.
For example we can create a vector of points from the x
and y
columns from the diamonds
dataset in
ggplot2
.
data(diamonds, package = "ggplot2")
pnts <- purrr::map2(
diamonds$x,
diamonds$y,
function(.x, .y) st_point(c(.x, .y))
) |>
st_sfc()
We can included this in a data frame
library(dplyr, warn.conflicts = FALSE)
dmnd <- diamonds |>
select(price, clarity) |>
bind_cols(geometry = pnts)
head(dmnd)
#> # A tibble: 6 × 3
#> price clarity geometry
#> <int> <ord> <POINT>
#> 1 326 SI2 (3.95 3.98)
#> 2 326 SI1 (3.89 3.84)
#> 3 327 VS1 (4.05 4.07)
#> 4 334 VS2 (4.2 4.23)
#> 5 335 SI2 (4.34 4.35)
#> 6 336 VVS2 (3.94 3.96)
Now we have a tibble with price, clarity, and geometry! Huge win! What if we wanted to calculate the average price by clarity and keep the geometries?
dmnd |>
group_by(clarity) |>
summarise(avg_price = mean(price))
#> # A tibble: 8 × 2
#> clarity avg_price
#> <ord> <dbl>
#> 1 I1 3924.
#> 2 SI2 5063.
#> 3 SI1 3996.
#> 4 VS2 3925.
#> 5 VS1 3839.
#> 6 VVS2 3284.
#> 7 VVS1 2523.
#> 8 IF 2865.
Well, we lose the geometry just like we would for any other column
that wasn’t included in the summarise()
call. To keep it,
we would need to perform an operation on the geometry itself.
dmnd |>
group_by(clarity) |>
summarise(
avg_price = mean(price),
geometry = st_union(geometry)
)
#> # A tibble: 8 × 3
#> clarity avg_price geometry
#> <ord> <dbl> <MULTIPOINT>
#> 1 I1 3924. ((4.33 4.29), (4.33 4.36), (4.36 4.33), (4.38 4.42), (4.39 …
#> 2 SI2 5063. ((0 0), (0 6.62), (3.79 3.75), (3.84 3.82), (3.87 3.85), (3…
#> 3 SI1 3996. ((3.88 3.84), (3.89 3.84), (3.9 3.85), (3.93 3.96), (3.95 3…
#> 4 VS2 3925. ((0 0), (3.73 3.68), (3.73 3.71), (3.74 3.71), (3.76 3.73),…
#> 5 VS1 3839. ((0 0), (3.83 3.85), (3.84 3.87), (3.86 3.89), (3.88 3.9), …
#> 6 VVS2 3284. ((3.83 3.86), (3.85 3.89), (3.85 3.9), (3.85 3.91), (3.86 3…
#> 7 VVS1 2523. ((0 0), (3.83 3.85), (3.87 3.9), (3.88 3.95), (3.88 3.99), …
#> 8 IF 2865. ((3.86 3.88), (3.89 3.9), (3.91 3.95), (3.92 3.94), (3.93 3…
Well, wouldn’t it be nice if the geometry knew to do that? Well, that
is exactly what sf
objects are.
sf
objects
sf
objects are simply just data frames with a geometry
column that is sticky and smart. We can create an sf
object
if an sfc
column is present in a data frame by using
st_as_sf()
.
dmnd_sf <- st_as_sf(dmnd)
Doing this creates an object of class sf
. The two things
that make an sf
object so special are the class
sf
and the the attribute sf_column
.
attr(dmnd_sf, "sf_column")
#> [1] "geometry"
This attribute tells us which column is the geometry. Because we have
this sf
can implement its own methods for common functions
like select()
, mutate()
,
aggregate()
, group_by()
, etc which always keep
the attr(x, "sf_column")
attached to the data frame.
Having the class and attribute allow methods like
summarise()
to be written for the class itself and handle
the things like unioning geometry for us.
dmnd_sf |>
group_by(clarity) |>
summarise(avg_price = mean(price))
#> Simple feature collection with 8 features and 2 fields
#> Geometry type: MULTIPOINT
#> Dimension: XY
#> Bounding box: xmin: 0 ymin: 0 xmax: 10.74 ymax: 58.9
#> CRS: NA
#> # A tibble: 8 × 3
#> clarity avg_price geometry
#> <ord> <dbl> <MULTIPOINT>
#> 1 I1 3924. ((4.33 4.29), (4.33 4.36), (4.36 4.33), (4.38 4.42), (4.39 …
#> 2 SI2 5063. ((0 0), (0 6.62), (3.79 3.75), (3.84 3.82), (3.87 3.85), (3…
#> 3 SI1 3996. ((3.88 3.84), (3.89 3.84), (3.9 3.85), (3.93 3.96), (3.95 3…
#> 4 VS2 3925. ((0 0), (3.73 3.68), (3.73 3.71), (3.74 3.71), (3.76 3.73),…
#> 5 VS1 3839. ((0 0), (3.83 3.85), (3.84 3.87), (3.86 3.89), (3.88 3.9), …
#> 6 VVS2 3284. ((3.83 3.86), (3.85 3.89), (3.85 3.9), (3.85 3.91), (3.86 3…
#> 7 VVS1 2523. ((0 0), (3.83 3.85), (3.87 3.9), (3.88 3.95), (3.88 3.99), …
#> 8 IF 2865. ((3.86 3.88), (3.89 3.9), (3.91 3.95), (3.92 3.94), (3.93 3…
Note that this is just like what we wrote earlier except that we
didn’t have to handle the geometry column manually. If we compare these
two approaches and ignore attribute difference like
sf_column
and class
we can see that they are
identical.
Freeing yourself
Knowing how sf
works can be helpful for understanding
how we can ease our reliance on it for all geospatial operations.
Instead of thinking of the entire sf data frame as the thing that
handles all geometry operations we now know that it is the
sfc
geometry column.
In R there are different ways of representing geometry besides
sfc
vectors. The packages s2
,
geos
, wk
, and rsgeo
all provide
different vectors of geometries that can be used.
These libraries are very handy for doing geometric operations. Each
of these packages tend to be better at one thing than another and each
have their place. You will often find big speed improvements if you use
these libraries instead of sf
for certain things.
Take for example calculating the length of linestrings on a geodesic.
We can use the roxel
dataset from
sfnetworks.
data(roxel, package = "sfnetworks")
To illustrate we can extract the geometry column and cast it as an
rsgeo
class object.
We can then use the functions st_length()
and
length_haversine()
respectively to compute the length of
linestrings.
bench::mark(
st_length(geo),
length_haversine(rs),
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 st_length(geo) 2.28ms 2.37ms 409. 2.89MB 17.1
#> 2 length_haversine(rs) 322.55µs 756.37µs 1301. 8.8KB 0
This is markedly faster when using rsgeo. We also do not have to extract the vector and work with it as its own object. Any vector can be included in a data frame, remember?
roxel |>
as_tibble() |>
mutate(
geometry = as_rsgeo(geometry),
length = length_haversine(geometry)
)
#> # A tibble: 851 × 4
#> name type
#> <chr> <fct>
#> 1 Havixbecker Strasse residential
#> 2 Pienersallee secondary
#> 3 Schulte-Bernd-Strasse residential
#> 4 NA path
#> 5 Welsingheide residential
#> 6 NA footway
#> 7 NA footway
#> 8 NA path
#> 9 NA track
#> 10 NA track
#> # ℹ 841 more rows
#> # ℹ 2 more variables: geometry <r_LINEST>, length <dbl>