Garbage Can Compiling to Categories with Inspectable Lambdas

There are a couple kinds of functions that we can turn into totally inspectable data.

Linear functions can be reconstituted into a matrix if you give a basis of vectors.

Functions from enumerable types can be turned into a lookup table

Sufficiently polymorphic functions are another example though. forall a. a-> a is commonly known to only be id. The same goes for fst = forall a b. (a,b)->a and snd and swap and all the nesting of . These functions have exactly one inhabiting value (excluding internal churning and the possibility of going into an infinite loop).

So the type directly tells us the implementation

forall a. (a,a)->a is similar. It can only be fst or snd. Types that reuse a type parameter in the input can only be permutations.

I’ve been trying to find a way to take a written lambda and convert it to data automatically and have been having trouble.

An opaque type that we have hidden the contructors to is the same (T,T)->T can only be fst or snd specialized to T since we can’t possibly destruct on T.

We can figure out which one by giving a labeled example to that function and then inspecting a single output.  This gives the permutation and duplication that was done.

Similarly for T -> Either T T

Once we have this, we can (Hopefully) reinterpret this lambda in terms of a monoidal category.



What about TH? Also the new quantified constraints extensions might be helpful?



Ok. A Different approach. This works much better to what I had in mind. you can write aribatrary (\(x,y,) -> (y,x)) tuple like lambdas and it will convert them to a category. I really had to hack around to get the thing to compile. Like that Pick typeclass, what the heck? Why can I get defaults values in type families but not in typeclasses?

It is all decidedly not typesafe. You can get totally nonsensical things to compile to something. However if you stick to lambdas, you’ll be ok. Maybe.




Pytorch Trajectory Optimization

Trajectory optimization is cool. The idea is to take a dynamical problem as a big ole optimization problem, finding the best actions to take to achieve your goals or maximize a reward.

There are a couple of flavors of trajectory optimization (shooting methods, collocation methods)

PyTorch gives a pretty low overhead extension to Numpy that also gives autodifferentiation. It is mainly intended as a neural network library, for which it has a number of facilities.

Gradient Descent is not the preferred method for these problems (According to Boyd’s Convex optimization course). Gradient Descent has shit convergence compared to newton iteration, but is very flexible and easy to implement.

In addition, using a normal ODE solver from Scipy would be much more stable, but it would require cleverness to have the same code work for both scipy and the torch parts. So screw it.

One nicety of this approach is that we don’t even have to have our derivatives solved for. They could be all tied up in a

I thought that maybe I could just weight the dynamics cost enough to have it require the dynamics be satisfied, but that did not seem to work. Maybe with more fiddling? On further review my code had massive bugs in it. I’m not sure that the dynamics cost version wouldn’t work, but the Lagrange multiplier method seems to work well and makes sense too.

In this formulation, we can also train some kind of parametrized controller function f_w(x) by sampling some random initial starting conditions (or even dynamical parameters like mass and length etc, or noise forces). This is quite nice.

Additional bits that may be nice: Backtracking line search, logarithmic potential for inequalities, I wonder if a symplectic style interleaving of position and momentum might be nice even for this global case. Should definitely just tie up all the vars into a single x. Can we use a lagrangian or hamiltonian and then have pytorch differentiate that? It may in fact be nice to use some combinator to be able to hand the same function to ODEInt for a couple reasons (getting good initilizations  of the path for example).

For a simple system, I’m using \dot{x}=v , \dot{v}=f , where you get to control f at every time point and x is starting at 0 and wants to get to 1. I’m using a simple scheme of finite difference in time for the time derivative. x and v are defined at t and f, lx, lv are defined at the half time steps t + \frac{1}{2}. You need at least two time steps to get a derivative. I’m adding a square cost to the force, otherwise it would just get a huge force. lx and lv are Lagrange multipliers enforcing the equations of motion at each time step

Here was an initial pass (just here for for historical reasons, look at the updated one below. This one does not work as is)


goofed up a couple things (inlcuding my xres making no sense. You need to explicility zero gradients. Pretty annoying). Lagrange multiplier method makes total sense.

Could we use a Hamiltonian and use autograd to derive equations of motion? Seems plausible and convenient.

Can I make a custom pytorch layer for sparse Hessians? The data oriented viewpoint would have you pump the gradient and hessian backward. Or could you automatically build an H-matrix structure for the hessian of convnets?

Put a neural controller in there. Each batch could have randomized parameters, noise, and initial conditions.

Is rebuilding total_cost every time bad?



CartPole Maths

Our approach to the Cartpole is not black box. A Cartpole is a pretty simple system all things considered.

The first thing to do is derive the equations of motion. Originally I was using the Lagrangian for the system and deriving the equations of motion that way, which includes the back reaction of the pole back on the acceleration of the cart for example.

But the motor complicates things. I have a tough time in general modeling motors. What physical thing does a command to our motor driver correspond to? Velocity? Power? Torque? I have guesses. The easiest thing to do is just build and measure. It turns out for us that the commands are basically velocity control.

So in our case the back reaction is basically irrelevant. We have direct control over the cart velocity. In that case, one can use some easy enough hand waving to get the equations of motion.

Let’s set the angle \theta = 0 at pole down with positive angle going counter clockwise. We can assume that gravity acts at the center of mass of the pole, which is at the midpoint. This gives a torque \tau = -mg \frac{L}{2} \sin(\theta). One way to see the negative sign is that a slight positive angle should give a negative torque returning it back to the down position. The moment of inertia of the pole is mL^2/3. You can look this up as a pole around one of it’s ends or derive it from I =\int dm r^2. The \frac{1}{3} comes from the integration of the r^2. Putting these together we get mL^2/3 \ddot{\theta} = -mg  \frac{L}{2} \sin(\theta) .

Now we need to actually put in the cart stuff. The cart puts the pole in an accelerating frame, where basically you have a new component of gravity that points horizontally. This adds a torque -ma  \frac{L}{2} \cos(\theta) . As far as all of the signs go, honestly we just fiddled with them until it worked.

Now that we have all that in hand, we can talk about the Linear Quadratic Regulator (LQR) control.

The model that LQR uses is that the equations of motion are linear and the cost function that you want to minimize are quadratic in the controls u and state x. This is plausibly tractable because Quadratic and linear stuff is usually ok. I’m serious.

These letter choices for the various bits are pretty standard.

cost = \int x^TQx + u^TRu dt


If you just look at the wikipedia page, you can already just plug and chug to the solution.

u = -Kx

K= R^{-1} B^T P

A^TP + PA - PBR^{-1}B^TP + Q = 0 

Jesus that equation looks like shit.

which is QUITE CONVENIENTLY solved by the following scipy function



Now given this shit, we need to approximate our nonlinear equations of motion with linear equations in order to use the LQR framework.

mL^2/3 \ddot{\theta} = -m \frac{L}{2}( g\sin(\theta) + a \cos(\theta))

Near the top where we are linearizing

\delta\theta = \theta - \pi




mL^2/3 \ddot{\delta\theta} \approx m \frac{L}{2}(g\delta\theta+a)

We can move some of those constants to the other side to get \ddot{\delta\theta} by itself.

\ddot{\delta\theta} \approx \frac{3}{2L}( g\delta\theta + a)

Another thing you have to do is massage these second order equations into first order form. You do this by introducing a new state variable \omega

\dot{\delta\theta} = \omega

\dot{\omega} \approx \frac{3}{2L}( g\delta\theta + a)

In matrix form this is

\begin{bmatrix}\dot{\delta\theta} \\ \dot{\omega}  \end{bmatrix} =    \begin{bmatrix} 0 & 1 \\    \frac{3}{2L} g & 0    \end{bmatrix}    \begin{bmatrix}\delta\theta \\ \omega \end{bmatrix}    +    \begin{bmatrix} 0 \\ \frac{3}{2L} \end{bmatrix} \begin{bmatrix} a \end{bmatrix}  

In addition to this, it is nice to add the cart dynamics, even though they are pretty trivial. This is because we can then add some weighting terms to discourage the cart from trying to go off the track or go faster than the motors support. There are ways to make it so that the cartpole never tries to go out of bounds, but they are a bit more complicated. I’ve got some blog posts about them.

\begin{bmatrix}\dot{\delta\theta} \\ \dot{\omega} \\ \dot{x} \\ \dot{v}  \end{bmatrix} =    \begin{bmatrix} 0 & 1 & 0 & 0 \\    \frac{3}{2L} g & 0 & 0 & 0 \\    0 & 0 & 0 & 1\\    0 & 0 & 0 & 0 \\    \end{bmatrix}    \begin{bmatrix}\delta\theta \\ \omega \\ x \\ v \end{bmatrix}    +    \begin{bmatrix} 0 \\ \frac{3}{2L} \\ 0 \\ 1 \end{bmatrix} \begin{bmatrix} a \end{bmatrix}  

So we can read off our needed matrices A and B from here.

A =    \begin{bmatrix} 0 & 1 & 0 & 0 \\    \frac{3}{2L} g & 0 & 0 & 0 \\    0 & 0 & 0 & 1\\    0 & 0 & 0 & 0 \\    \end{bmatrix}

B = \begin{bmatrix} 0 \\ \frac{3}{2L} \\ 0 \\ 1 \end{bmatrix}

Now in regards to the weighting matrices Q and R, this is a bit tougher to say what we want. We sort of want all the state variables to be small but the relative important isn’t a priori clear to me. So we picked diagonal matrices and tried out some different values. One thing to note though is that the states variables could possibly have very different scales, since their units are different. The characteristic time of the system is T=\sqrt{\frac{L}{g}}. The characteristic length is the size of our track 1000mm and the rough angle scale is \pi/8 -ish.

Now that we have our matrices we can plug it all into scipy and use it!

One thing to be careful about is that the pole can have swung around multiple times leading to the angle being some multiple of 2\pi. Our hack around this is to just take the \sin of the angle.





Analytic Center in Python using Scipy and Numpy

The analytic center for a set of inequalities \phi(x)<0 is the minimizing position of \sum -\ln (-\phi(x)) . In particular it is often used with linear inequalities. It gives a reasonable and easily computable for convex constraint function center of the region. The hessian at that point can give you a reasonable ellipse that approximates the region too (both interior and exterior approximation).

I wrote a program for linear inequalities. It is not particularly robust. First I get a feasible point using the LP solver in scipy. Then I give the appropriate gradients and Hessians to a newton conjugate gradient solver in scipy. It does return a reasonable center, but I had to fiddle with some epsilons to avoid logarithms exploding and to avoid the hessian being so big it overwhelms the gradient. Possibly a couple burn in steps of gradient descent might help, or getting a feasible point that isn’t optimal since the optimal points being on the boundary is a huge problem. If the newton solver comes back with only 1 or 2 iterations, it probably failed.


Noise and The Fluctuation Dissipation Theorem

I was looking at some slides the other day and they quoted noise power in units of \frac{W}{\sqrt{Hz}}. Being the ignoramus I am, I was wondering why it was scaled

First off, when a Watt is quoted in an electrical measurement, usually you’re measuring Voltage with an instrument with a known input impedance Z. That’s how you convert your fluctutating voltage measurement to Watts.

Second, the sqrt frequency thing? Nowadays, your measurement apparatus is probably a digital sampler and it performs an FFT giving you a spectrum. The width of your FFT is the sampling frequency roughly. Does that make sense that when you increase the width of your taken spectrum the height of the noise signal changes too? It does, but only because implicitly, most sampling circuits take an average of the signal over the same period as the sampling time. These two times are not necessarily intrinsically linked. One could have a system that takes a very fast snapshot and but can only save data or send it over a link at a much slower speed. The noise power is this snapshot time, not the data saving time. The data saving time would be the bandwidth in the FFT.

These two are engineered to be the same to avoid distortion of the frequency signal via aliasing.

But there is an even simpler way to see this. Suppose you have two measurements V1 and V2 that are the averages of time T with variance \sigma. Then the average of these two, V3, is over a time 2T. However, by the standard kind of manipulations (for Gaussian variables the squared variance of a sum = the sum of the squared variances, \sigma^2_{\sum x_i}=\sum \sigma_{x_i} ), the variance of the new signal is \sigma/\sqrt{2} which means it scales with the time window. Hence multiplying you actual measured variances by the square root of your time window gives you a time window invariant quantity.


While I was thinking about that in the car I realized that the fluctuation dissipation theorem is a mean field theory kind of thing. The fluctuation dissipation theorem feels weird and spooky, but I guess it is ultimately simple (or not).

Mean field theory tries to summarize all the complicated interactions with neighbors with a simple summary. For interacting spins, it tries to summarize as an effective B field from the surrounding spins. Then you have a 1-particle model which you can solve and try to find a self-consistent value of B. Here is a sketch in equations.

H= \sum S\cdot S - B_{ext}\cdot S \rightarrow \sum - B_{eff}\cdot  S

Z=\sum_s e^{-\beta H}

M = <S> = \partial_{\beta B} \ln(Z)

B = \alpha M

You can do something similar to find an effective permeability due to your surrounding neighbors. \partial_B M = \chi

The fluctuating force due to your neighbors is like B, a constant forcing term.

The damping is like the permeability. One may want to consider a system that starts with an intrinsic damping, that is one difference between the magnetic case and the fluctuation case, in that free space has a natural permeability but not a natural damping (I suppose there is always some damping, due to radiation and what not, but we have a tendency to totally neglect such things). One could imagine ball bearings being shaken in a cup of molasses or something. You might want to fluctuation due to being hit by other ball bearings, but consider the damping from the molasses to be the dominating damping term (the the thermal fluctuations from the molasses to be ignorable).

Another difference is that I think you really are going to need to work explicitly with time. Just the thermal average isn’t going to cut it I think (at least not conceptually. There might be some dirty tricks you can play, but a typical Hamiltonian can’t have damping terms. As I write this I am doubting it’s truth).

\ddot{x} = -\nu \dot{x}+ f

calculate some averages … Then use the self-consistency

B = \alpha M \rightarrow f = f(\hat{x})

The dissipation will be related to your correlation with your neighbors. When you are moving faster, they have to tend to move in such a way to make you slow down on average.

To Be Continued

Reverse Mode Auto Differentiation is Kind of Like a Lens

Warning: I’m using sketchy uncompiled Haskell pseudocode.

Auto-differentiation is writing a function that also computes the derivative alongside calculating its value. Function composition is done alongside applying the chain rule to the derivative part.

One way to do this is to use a “dual number”. Functions now take a tuple of values and derivatives.

The Jacobean of a function from R^n \rightarrow R^m is a m by n matrix. The chain rule basically says that you need to compose the matrices via multiplication when you compose the value functions.  This is the composition of the linear maps.

Conceptually, you initialize the process with a NxN identity matrix corresponding to the fact that $latex \partial x_i/\partial x_j=\delta_{ij}

Vectorized versions of scalar functions (maps) will often use diag

A couple points:

  1.  Since the Jacobean j is always going to be multiplied in composition, it makes sense to factor this out into a Monad structure (Applicative maybe? Not sure we need full Monad power).
  2. There is an alternative to using explicit Matrix data types for linear maps. We could instead represent the jacobeans using (Vector Double) -> Vector Double. The downside of this is that you can’t inspect elements. You need explicit matrices as far as I know to do Gaussian elimination and QR decomposition. You can sample the function to reconstitute the matrix if need be, but this is somewhat roundabout. On the other hand, if your only objective is to multiply matrices, one can use very efficient versions. Instead of an explicit dense NxN identity matrix, you can use the function id :: a -> a, which only does some minimal pointer manipulation or is optimized away. I think that since we are largely multiplying Jacobeans, this is fine.


What we’ve shown so far is Forward Mode.

When you multiply matrices you are free to associate them in any direction you like. (D(C(BA))) is the association we’re using right now. But you are free to left associate them. ((DC)B)A). You can write this is right associated form using the transpose ((DC)B)A)^T = (A^T(B^T(C^TD^T)))

This form is reverse mode auto differentiation. Its advantage is the number of computations you have to do and the intermediate values you have to hold. If one is going from many variables to a small result, this is preferable.

It is actually exactly the same in implementation except you reverse the order of composition of the derivatives. We forward compose value functions and reverse compose derivative functions (matrices).


We have CPSed our derivative matrices.

Really a better typed version would not unify all the objects into a. While we’ve chosen to use Vector Double as our type, if we could tell the difference between R^n and R^m at the type level the following would make more sense.

However, this will no longer be a monad. Instead you’ll have to specify a Category instance. The way I got down to this stuff is via reading Conal Elliott’s new Automatic Differentiation paper which heavily uses the category interface.  I was trying to remove the need to use constrained categories (it is possible, but I was bogged down in type errors) and make it mesh nice with hmatrix. Let me also mention that using the Arrow style operators *** and dup and &&& and fst, and clever currying that he mentions also seems quite nice here. The Tuple structure is nice for expressing direct sum spaces in matrices. (Vector a, Vector b) is the direct sum of those vector spaces.

Anyway, the arrows for RD are

This is a form I’ve seen before though. It is a lens. Lens’ have a getter (a -> b) that extracts b from a and a setter (a -> b -> a) that given an a and a new b returns the replaced a.

Is an automatic derivative function in some sense extracting an implicit calculable value from the original vector and returning in a sense how to change the original function? It is unclear whether one should take the lens analogy far or not.

The type of Lens’  (forall f. Functor f => (b -> f b) -> a -> f a) means that it is isomorphic to a type like DFun’. The type itself does imply the lens laws of setters and getters, so these functions are definitely not proper lawful lenses. It is just curious that conceptually they are not that far off.

The lens trick of replacing this function with a quantified rank 1 type (forall f. ) or quantified rank-2 (forall p.) profunctor trick seems applicable here. We can then compose reverse mode functions using the ordinary (.) operator and abuse convenience functions from the lens library.

Neat if true.




We have been fighting a problem for weeks. The Serial port was just not reliable, it had sporadic. The problem ended up being a surprising thing, we were using threading to receive the messages nd checking for limit switches. It is not entirely clear why but this was totally screwing up the serial port update in an unpredictable manner. Yikes. What a disaster.

After that though smoooooooth sailing.

With a slight adaptation of the previous Openai gym LQR cartpole code and a little fiddling with parameters we have a VERY stable balancer. We removed the back reaction of the pole dynamics on the cart itself for simplicity. This should be accurate when the pole vastly.

We did find that the motor is exactly velocity control in steady state with a linear response. There is a zero point offset (you need to ask for 100 out of 2046 before you get any movement at all).

We’ll see where we can get with the Lyapunov control next time.



Some random links on uses of Convex relaxations, in particular Semidefinite programming


Problems involving outer products of vector variables can be relaxed into semidefinite programs. That’s a general trick. Then the low rank bit from SVD is an approixmate solution for the vector


convex relaxation for distributed optimal control



graph matching in relation to Image correspondence

Permutation matrices have sum of rows and columns must be 1 constraint, is one relaxation.

quickMatch. Actually, not convex programming but was the root of the chain of references I ‘m digging through



Finding MaxCut approximation of a graph is a classic one



Quantum Semidefinite programming course

Density matrices have a semidefinite constrina (non negative probabilities)



Sum of Squares is a semidefinite program that can guarantee that lyapunov functions actually work



Cart Pole Trajectory optimization using Cvxpy

I’ve been watching Russ Tedrake’s underactuated robotics and been trying some stuff out. The Drake package probably does this stuff better. Also IpOpt and Snopt are the libraries that get mentioned when people want to do this stuff for real.

It’s  too slow to be practical. That is mostly the fault of the cvxpy overhead. The ECOS solver itself says that it solves each iteration in about 0.02 seconds on my macbook.

The idea is to use linearized dynamics as constraints. Then iteratively ask cvxpy to solve for us. Until hopefully it converges. This is Sequential Convex Programming

I used the discretization of the equations of motion using the mehotd described here

It is possible I did it right.

The Hermite Collocation for the trajectory and trapezoidal for the control

If we used just an absolute value cost, this all would be a linear program. Something to consider. Also CVXOPT would probably be faster.

It hurts me that the constraint and cost matrices are banded and could be solved so quickly. But the next level up in programming complexity takes a lot more work it seems to me.


This is the result

Green is cart acceleration. You can see it slamming into the maximum acceleration constraint

Blue is pole angle and orange is angular velocity. So it does a pretty good job. For some settings it is just awful though.


Making a Van Der Graaf Generator

Will has been juicin’ to make a Van der Graaf generator. Similar to the Stirling engines, we followed youtube instructions to make it out of trash (soda bottles and soda cans and tape) and failed miserably.

We were mostly following this video

Yesterday, we took the gloves off and went to home depot. We used a trash fan motor which is awesome. 3d Printed some pieces for mounting the motor and for mounting the roller on the shaft with a press fit. 

Big ole 4″ pvc is the stalk. A flange to mount to the base which is a nice circle wood from the nice wood section of . Used two ikea bowls as the globe. I think it is important to have the upper roller somewhat deeply inside the globe?

We’re using pantyhose as a belt. Sort of the waist area stretched out to be a loop. It is just rubbing on the fixed top pvc roller. Maybe wrapping that in teflon tape would make it even better? The pantyhose is pure nylon. Nylon and pvc are towards opposite ends of the triboelectric series, which is important for the thing to work, but teflon is even farther down.

We used some brushes from home depot as electrical brushes. Might be overkill.

We wrapped the lower roller in more pantyhose to make a bulge. Counterintuitively, this bulge is supposed to keep the belt centered?

We were getting pretty wimpy sparks until we installed to lower brush and properly grounded it. I guess that is really important. Now they are pretty beefy maybe a couple centimeters. You can definitely see them and they sting a bit.

All in all a success!

img_0231 img_5799 img_4759 img_6389