# QuickstartΒΆ

If you prefer to learn by diving in and getting your feet wet, then here are some cut-and-pasteable examples to play with.

First, let’s import stuff and get some data to work with:

```
In [1]: import numpy as np
In [2]: from patsy import dmatrices, dmatrix, demo_data
In [3]: data = demo_data("a", "b", "x1", "x2", "y")
```

`demo_data()` gives us a mix of categorical and numerical
variables:

```
In [1]: data
Out[1]:
{'a': ['a1', 'a1', 'a2', 'a2', 'a1', 'a1', 'a2', 'a2'],
'b': ['b1', 'b2', 'b1', 'b2', 'b1', 'b2', 'b1', 'b2'],
'x1': array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721]),
'x2': array([-0.10321885, 0.4105985 , 0.14404357, 1.45427351, 0.76103773,
0.12167502, 0.44386323, 0.33367433]),
'y': array([ 1.49407907, -0.20515826, 0.3130677 , -0.85409574, -2.55298982,
0.6536186 , 0.8644362 , -0.74216502])}
```

Of course Patsy doesn’t much care what sort of object you store
your data in, so long as it can be indexed like a Python dictionary,
`data[varname]`. You may prefer to store your data in a pandas DataFrame, or a numpy
record array... whatever makes you happy.

Now, let’s generate design matrices suitable for regressing `y` onto
`x1` and `x2`.

```
In [1]: dmatrices("y ~ x1 + x2", data)
Out[1]:
(DesignMatrix with shape (8, 1)
y
1.49408
-0.20516
0.31307
-0.85410
-2.55299
0.65362
0.86444
-0.74217
Terms:
'y' (column 0),
DesignMatrix with shape (8, 3)
Intercept x1 x2
1 1.76405 -0.10322
1 0.40016 0.41060
1 0.97874 0.14404
1 2.24089 1.45427
1 1.86756 0.76104
1 -0.97728 0.12168
1 0.95009 0.44386
1 -0.15136 0.33367
Terms:
'Intercept' (column 0)
'x1' (column 1)
'x2' (column 2))
```

The return value is a Python tuple containing two DesignMatrix
objects, the first representing the left-hand side of our formula, and
the second representing the right-hand side. Notice that an intercept
term was automatically added to the right-hand side. These are just
ordinary numpy arrays with some extra metadata and a fancy __repr__
method attached, so we can pass them directly to a regression function
like `np.linalg.lstsq()`:

```
In [1]: outcome, predictors = dmatrices("y ~ x1 + x2", data)
In [2]: betas = np.linalg.lstsq(predictors, outcome)[0].ravel()
In [3]: for name, beta in zip(predictors.design_info.column_names, betas):
...: print("%s: %s" % (name, beta))
...:
Intercept: 0.579662344123
x1: 0.0885991903554
x2: -1.76479205551
```

Of course the resulting numbers aren’t very interesting, since this is just random data.

If you just want the design matrix alone, without the `y` values,
use `dmatrix()` and leave off the `y ~` part at the beginning:

```
In [1]: dmatrix("x1 + x2", data)
Out[1]:
DesignMatrix with shape (8, 3)
Intercept x1 x2
1 1.76405 -0.10322
1 0.40016 0.41060
1 0.97874 0.14404
1 2.24089 1.45427
1 1.86756 0.76104
1 -0.97728 0.12168
1 0.95009 0.44386
1 -0.15136 0.33367
Terms:
'Intercept' (column 0)
'x1' (column 1)
'x2' (column 2)
```

We’ll use dmatrix for the rest of the examples, since seeing the
outcome matrix over and over would get boring. This matrix’s metadata
is stored in an extra attribute called `.design_info`, which is a
`DesignInfo` object you can explore at your leisure:

```
In [1]: d = dmatrix("x1 + x2", data)
In [2]: d.design_info.<TAB>
d.design_info.builder d.design_info.slice
d.design_info.column_name_indexes d.design_info.term_name_slices
d.design_info.column_names d.design_info.term_names
d.design_info.describe d.design_info.term_slices
d.design_info.linear_constraint d.design_info.terms
```

Usually the intercept is useful, but if we don’t want it we can get rid of it:

```
In [1]: dmatrix("x1 + x2 - 1", data)
Out[1]:
DesignMatrix with shape (8, 2)
x1 x2
1.76405 -0.10322
0.40016 0.41060
0.97874 0.14404
2.24089 1.45427
1.86756 0.76104
-0.97728 0.12168
0.95009 0.44386
-0.15136 0.33367
Terms:
'x1' (column 0)
'x2' (column 1)
```

We can transform variables using arbitrary Python code:

```
In [1]: dmatrix("x1 + np.log(x2 + 10)", data)
Out[1]:
DesignMatrix with shape (8, 3)
Intercept x1 np.log(x2 + 10)
1 1.76405 2.29221
1 0.40016 2.34282
1 0.97874 2.31689
1 2.24089 2.43836
1 1.86756 2.37593
1 -0.97728 2.31468
1 0.95009 2.34601
1 -0.15136 2.33541
Terms:
'Intercept' (column 0)
'x1' (column 1)
'np.log(x2 + 10)' (column 2)
```

Notice that np.log is being pulled out of the environment where
`dmatrix()` was called – np.log is accessible because we did
`import numpy as np` up above. Any functions or variables that you
could reference when calling `dmatrix()` can also be used inside
the formula passed to `dmatrix()`. For example:

```
In [1]: new_x2 = data["x2"] * 100
In [2]: dmatrix("new_x2")
Out[2]:
DesignMatrix with shape (8, 2)
Intercept new_x2
1 -10.32189
1 41.05985
1 14.40436
1 145.42735
1 76.10377
1 12.16750
1 44.38632
1 33.36743
Terms:
'Intercept' (column 0)
'new_x2' (column 1)
```

Patsy has some transformation functions “built in”, that are automatically accessible to your code:

```
In [1]: dmatrix("center(x1) + standardize(x2)", data)
Out[1]:
DesignMatrix with shape (8, 3)
Intercept center(x1) standardize(x2)
1 0.87995 -1.21701
1 -0.48395 -0.07791
1 0.09463 -0.66885
1 1.35679 2.23584
1 0.98345 0.69899
1 -1.86138 -0.71844
1 0.06598 -0.00417
1 -1.03546 -0.24845
Terms:
'Intercept' (column 0)
'center(x1)' (column 1)
'standardize(x2)' (column 2)
```

See `patsy.builtins` for a complete list of functions made
available to formulas. You can also define your own transformation
functions in the ordinary Python way:

```
In [1]: def double(x):
...: return 2 * x
...:
```

Arithmetic transformations are also possible, but you’ll need to
“protect” them by wrapping them in `I()`, so that Patsy knows
that you really do want `+` to mean addition:

```
In [1]: dmatrix("I(x1 + x2)", data) # compare to "x1 + x2"
Out[1]:
DesignMatrix with shape (8, 2)
Intercept I(x1 + x2)
1 1.66083
1 0.81076
1 1.12278
1 3.69517
1 2.62860
1 -0.85560
1 1.39395
1 0.18232
Terms:
'Intercept' (column 0)
'I(x1 + x2)' (column 1)
```

Note that while Patsy goes to considerable efforts to take in data
represented using different Python data types and convert them into a
standard representation, all this work happens *after* any
transformations you perform as part of your formula. So, for example,
if your data is in the form of numpy arrays, “+” will perform
element-wise addition, but if it is in standard Python lists, it will
perform concatentation:

```
In [1]: dmatrix("I(x1 + x2)", {"x1": np.array([1, 2, 3]), "x2": np.array([4, 5, 6])})
Out[1]:
DesignMatrix with shape (3, 2)
Intercept I(x1 + x2)
1 5
1 7
1 9
Terms:
'Intercept' (column 0)
'I(x1 + x2)' (column 1)
In [2]: dmatrix("I(x1 + x2)", {"x1": [1, 2, 3], "x2": [4, 5, 6]})
Out[2]:
DesignMatrix with shape (6, 2)
Intercept I(x1 + x2)
1 1
1 2
1 3
1 4
1 5
1 6
Terms:
'Intercept' (column 0)
'I(x1 + x2)' (column 1)
```

Patsy becomes particularly useful when you have categorical data. If you use a predictor that has a categorical type (e.g. strings or bools), it will be automatically coded. Patsy automatically chooses an appropriate way to code categorical data to avoid producing a redundant, overdetermined model.

If there is just one categorical variable alone, the default is to dummy code it:

```
In [1]: dmatrix("0 + a", data)
Out[1]:
DesignMatrix with shape (8, 2)
a[a1] a[a2]
1 0
1 0
0 1
0 1
1 0
1 0
0 1
0 1
Terms:
'a' (columns 0:2)
```

But if you did that and put the intercept back in, you’d get a redundant model. So if the intercept is present, Patsy uses a reduced-rank contrast code (treatment coding by default):

```
In [1]: dmatrix("a", data)
Out[1]:
DesignMatrix with shape (8, 2)
Intercept a[T.a2]
1 0
1 0
1 1
1 1
1 0
1 0
1 1
1 1
Terms:
'Intercept' (column 0)
'a' (column 1)
```

The `T.` notation is there to remind you that these columns are
treatment coded.

Interactions are also easy – they represent the cartesian product of
all the factors involved. Here’s a dummy coding of each *combination*
of values taken by `a` and `b`:

```
In [1]: dmatrix("0 + a:b", data)
Out[1]:
DesignMatrix with shape (8, 4)
a[a1]:b[b1] a[a2]:b[b1] a[a1]:b[b2] a[a2]:b[b2]
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
Terms:
'a:b' (columns 0:4)
```

But interactions also know how to use contrast coding to avoid
redundancy. If you have both main effects and interactions in a model,
then Patsy goes from lower-order effects to higher-order effects,
adding in just enough columns to produce a well-defined model. The
result is that each set of columns measures the *additional*
contribution of this effect – just what you want for a traditional
ANOVA:

```
In [1]: dmatrix("a + b + a:b", data)
Out[1]:
DesignMatrix with shape (8, 4)
Intercept a[T.a2] b[T.b2] a[T.a2]:b[T.b2]
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
Terms:
'Intercept' (column 0)
'a' (column 1)
'b' (column 2)
'a:b' (column 3)
```

Since this is so common, there’s a convenient short-hand:

```
In [1]: dmatrix("a*b", data)
Out[1]:
DesignMatrix with shape (8, 4)
Intercept a[T.a2] b[T.b2] a[T.a2]:b[T.b2]
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
Terms:
'Intercept' (column 0)
'a' (column 1)
'b' (column 2)
'a:b' (column 3)
```

Of course you can use *other coding schemes* too (or even *define your own*). Here’s `orthogonal polynomial coding`:

```
In [1]: dmatrix("C(c, Poly)", {"c": ["c1", "c1", "c2", "c2", "c3", "c3"]})
Out[1]:
DesignMatrix with shape (6, 3)
Intercept C(c, Poly).Linear C(c, Poly).Quadratic
1 -0.70711 0.40825
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 -0.00000 -0.81650
1 0.70711 0.40825
1 0.70711 0.40825
Terms:
'Intercept' (column 0)
'C(c, Poly)' (columns 1:3)
```

You can even write interactions between categorical and numerical
variables. Here we fit two different slope coefficients for `x1`;
one for the `a1` group, and one for the `a2` group:

```
In [1]: dmatrix("a:x1", data)
Out[1]:
DesignMatrix with shape (8, 3)
Intercept a[a1]:x1 a[a2]:x1
1 1.76405 0.00000
1 0.40016 0.00000
1 0.00000 0.97874
1 0.00000 2.24089
1 1.86756 0.00000
1 -0.97728 -0.00000
1 0.00000 0.95009
1 -0.00000 -0.15136
Terms:
'Intercept' (column 0)
'a:x1' (columns 1:3)
```

The same redundancy avoidance code works here, so if you’d rather have
treatment-coded slopes (one slope for the `a1` group, and a second
for the difference between the `a1` and `a2` group slopes), then
you can request it like this:

```
# compare to the difference between "0 + a" and "1 + a"
In [1]: dmatrix("x1 + a:x1", data)
Out[1]:
DesignMatrix with shape (8, 3)
Intercept x1 a[T.a2]:x1
1 1.76405 0.00000
1 0.40016 0.00000
1 0.97874 0.97874
1 2.24089 2.24089
1 1.86756 0.00000
1 -0.97728 -0.00000
1 0.95009 0.95009
1 -0.15136 -0.15136
Terms:
'Intercept' (column 0)
'x1' (column 1)
'a:x1' (column 2)
```

And more complex expressions work too:

```
In [1]: dmatrix("C(a, Poly):center(x1)", data)
Out[1]:
DesignMatrix with shape (8, 3)
Intercept C(a, Poly).Constant:center(x1) C(a, Poly).Linear:center(x1)
1 0.87995 -0.62222
1 -0.48395 0.34220
1 0.09463 0.06691
1 1.35679 0.95939
1 0.98345 -0.69541
1 -1.86138 1.31620
1 0.06598 0.04666
1 -1.03546 -0.73218
Terms:
'Intercept' (column 0)
'C(a, Poly):center(x1)' (columns 1:3)
```