1
$\begingroup$

I would like to be shown how the curvature matrix K on page 7 in the paper "Regularisation Theory Applied to Neurofuzzy Modelling" (Bossley) is computed.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.3295

Any help would be greatly appreciated.

Thanks, Bruce

  • 0
    Hi Willie. Thanks for responding. What I don't understand is how the author arrived at the values that he did for the curvature matrix. For instance, how did the author arrive at a value of 6 for the (0,0) position in the matrix K on page 7? I don't understand how to compute the curvature at the centre of each basis function.2011-05-04
  • 0
    I think the paper doesn't specify precisely how this matrix is obtained. It says that the second derivative isn't defined for the linear splines being used and that a finite-difference approximation is used, but that approximation doesn't seem to be specified. It seems inappropriate to call this an "approximation", anyway, since the second derivative is simply not defined; using a finite difference formula like $(f(x+\Delta x)-2f(x)+f(x-\Delta x))/\Delta x^2$ won't yield an approximation, just some value that goes to infinity as $\Delta x$ goes to zero.2011-05-04
  • 0
    @Joriki: in the discrete setting the second difference is not too bad. However, I don't know if that is what the author did, since if *I* were to compute the second difference I would've ended up with something different. (Also I am not sure of what the notation of the square of the second derivative that is being used in the paper actually means.)2011-05-05
  • 0
    @Bruce: having looked at the paper in slightly more detail, I think you'd be best off sending a note to the authors to enquire.2011-05-05
  • 0
    @Willie: The second difference may well not be bad for the purposes of the paper, but that doesn't make it an approximation of the second derivative :-)2011-05-05
  • 0
    @Willie Wong. I tried to send a note to the author, but this paper is something like 15 years old, and he has moved on. I haven't been able to track him down.2011-05-05
  • 0
    @joriki I've been confused about how the second derivative was computed because it doesn't exist at the knots of these linear basis functions. I came across what I believe is the methodology used in the book "Adaptive Modelling, Estimation, and Fusion from Data: A Neurofuzzy Approach". I was going to reproduce the formula given, but don't know how to embed LaTex into these comments. I suppose if I could embed an image, I could just scan it in as well...2011-05-05
  • 0
    @Bruce: You can use $\TeX$ here in the comments just like in the main post; just enclose it in dollar signs.2011-05-05
  • 0
    @joriki Here is the formula ${d^2 y (x,w) \over dx^2} = \sum_{i=1}^{n} \sum_{b=1}^{p} [ \prod_{j=1,j\ne i} \mu_{A_{k_{j}} ^{b_{j}} } (x_{j}) ] w_{b} {\delta^2 \mu_{A_{k_{i}} ^{b_{i}} } (x_{i}) \over \delta x_{i}^2}$ The y on the left hand side should have a hat over it, but I didn't know how to do that...2011-05-05
  • 0
    @joriki The paragraph that the formula is in goes on to say "due to the lattice structure of neurofuzzy models, the cross product terms should be sufficiently regularized by only constraining the curvature parallel to the inputs. Therefore the output's curvature can be approximated by . This approximation is now a linear function of the input.2011-05-05
  • 0
    @Joriki Here, p = the number of multivariate basis functions in the jth submodel, w = the weight of the basis function, $\mu$ = the membership x within that basis function2011-05-05
  • 0
    @Bruce: I don't think I understand that formula; but quite apart from that, there must be something more fundamentally missing in his description of the calculation of $K$, since $K$ by the definition in (4) on p. 6 should be of the form $\mathbf k_i\mathbf k_i^T$, so its diagonal entries should be squares of curvatures, so the diagonal entry $6$ would have to arise from squaring $\sqrt{6}$, and there's nothing in any of this that indicates how one might arrive at a curvature of $\sqrt{6}$. I can sort of make sense of the values in the middle ($16$, $-4$), but something else must be going on.2011-05-05
  • 0
    @joriki: the 6 probably arises from 4 + 1 + 1: $\mathbf{k}_i$ is a vector, so $\mathbf{k}_i\mathbf{k}_i^T$ is a square matrix with diagonal entries that is square for every $i$. Summing over several such $\mathbb{k}_i$ can give 6 as 4 + 1 + 1.2011-05-05
  • 0
    @Willie: You're right, of course; I hadn't properly understood the formula.2011-05-05

1 Answers 1

5

Okay, a bit of a guess. Writing $\mathbf{K} = \sum_{i = 0}^8\mathbf{k}_i\mathbf{k}_i^T$ we have, solving a system of linear equations, the following set of $\mathbf{k}_i$'s (defined up to $\pm$ sign):

$$\begin{cases} \mathbf{k}_0^T &= [ 2, 0, 0, 0, 0, 0, 0, 0, 0 ] \\ \mathbf{k}_1^T &= [ 1, -1, 1, 0, 0, 0, 0, 0, 0 ] \\ \mathbf{k}_2^T &= [ 0, 0, 2, 0, 0, 0, 0, 0, 0 ] \\ \mathbf{k}_3^T &= [ 1, 0, 0, -1, 0, 0, 1, 0, 0 ] \\ \mathbf{k}_4^T &= [ 0, 1, 0, 1, -4, 1, 0, 1, 0 ] \\ \mathbf{k}_5^T &= [ 0, 0, 1, 0, 0, -1, 0, 0, 1 ] \\ \mathbf{k}_6^T &= [ 0, 0, 0, 0, 0, 0, 2, 0, 0 ] \\ \mathbf{k}_7^T &= [ 0, 0, 0, 0, 0, 0, 1, -1, 1 ] \\ \mathbf{k}_8^T &= [ 0, 0, 0, 0, 0, 0, 0, 0, 2 ] \end{cases}$$

(Note that in the paper, $\mathbf{K}$ should be a symmetric matrix, but there is a typo in the matrix on page 7: the element in 3rd row, 6th column should be -1, like the element in the 6th row, 3rd column. In the paper in the spot it is 0 instead. )

This set is essentially unique if you consider $\mathbf{k}_i$ to be something like the second difference, in that only adjacent nodes on the lattice can influence each other.

Now, the expression for $\mathbf{k}_4^T$ quite clearly arose from the second difference formula on a planar square lattice. And one clearly gets $\mathbf{k}_{\{2,6,8\}}$ from $\mathbf{k}_{0}$ via symmetry transform of the lattice (by rotation), and similarly $\mathbf{k}_{\{3,5,7\}}$ from $\mathbf{k}_1$.

However, for the second difference formula, if the grid were finite, I would've written

$$ \mathbf{k}_0^T = [-2,1,0,1,0,0,0,0,0] $$

and

$$ \mathbf{k}_1^T = [1, -3, 1, 0, 1, 0, 0, 0,0] $$

which would change the diagonal elements to $[ 6, 12, 6, 12, 20, 12, 6, 12, 6]$ instead of what was given.

So I am guessing that some sort of unspecified boundary condition is imposed. Interestingly, this boundary condition seems to be compatible with the one that generates the $\mathbf{K}$ matrix on page 33 of the same document, this time for the one dimensional problem. (Normally for the second difference on the lattice of integers, you write $$\Delta^2 f(x) = f(x+1) + f(x-1) - 2 f(x)$$ Taking the cartesian product gives you the one for the planar lattice $$\Delta^2 f(x,y) = f(x+1,y) + f(x-1,y) + f(x,y+1) + f(x,y-1) - 4 f(x)$$ But if you impose the "boundary condition" that on the nonnegative integers, you define the second difference at any interior point $x \geq 1$ as above, but at the boundary $x=0$ as $$\Delta^2 f(0) = f(0)$$ then the cartesian product will give you that on an "edge" of the planar lattice with boundary, you have $$\Delta^2 f(0,y) = f(0,y+1) + f(0,y-1) - f(0)$$ and on a "corner" $$\Delta^2 f(0,0) = 2f(0,0)$$ which would agree with that $\mathbf{k}_i$ as written above. Now why this boundary condition is chosen I cannot tell you. The expression "I would've written" corresponds to the Neumann condition where on the boundary $\Delta^2f(0) = f(1) - f(0)$.)

Unfortunately this is as much as I can help. To say more will require me somehow reading the mind of the author.


Edit with more details as requested by OP

Let me re-write the mathematical notation, which I hope will clear up the issues a bit. The goal is to compute the "total squared curvature", which is used in some fashion to train the modeling scheme, which is used to approximate some data. We don't need to know what that is exactly. The point is that in your scheme, you are given a list of basis functions $\chi_0, \chi_1, \ldots, \chi_N$ which are centered at a bunch of (vector) points $\mathbf{x}_0, \mathbf{x}_1, \ldots , \mathbf{x}_n$. The output function $f(\mathbf{x}; w_0, w_1, \ldots, w_N)$ depends on a bunch of weights. And in fact for our purpose we consider the linear case

$$ f(\mathbf{x};w_0, w_1, \ldots, w_N) = \sum_{j = 0}^N w_j \chi_j(\mathbf{x}) $$

We define the total curvature of the function $f$ associated to the weights $w_0, \ldots, w_N$, to be the sum

$$ \text{Total curvature}(w_0, w_1, \ldots, w_N) = \sum_{k = 0}^N \left| \frac{\partial^2}{\partial x_i \partial x_j} f(\mathbf{x_k}; w_0, w_1, \ldots, w_N ) \right|^2 $$

The paper asserts that it is reasonable to ignore the off diagonal terms in the Hessian matrix, and so the quantity you want to compute is actually

$$ \text{Total curvature redux}(w_0, w_1, \ldots, w_N) = \sum_{k = 0}^N \left| \triangle f(\mathbf{x_k}; w_0, w_1, \ldots, w_N ) \right|^2 $$

where $\triangle$ is the Laplacian. The important thing is that the total curvature is computed from the output function $f$, yet is a function of the weights.

Now, going back to the representation of the function $f$ as a linear combination of the functions $\chi_j$ with coefficients $w_j$, it becomes clear that the derivatives of $f$ should also depend on $w_j$ linearly: at a fixed point $\mathbf{x}$, you have

$$ \triangle f(\mathbf{x}; w_0, w_1, \ldots, w_N) = \sum_{j = 0}^N w_j [\triangle \chi_j(\mathbf{x})] $$

(since $\mathbf{x}$ is fixed, $\triangle \chi(\mathbf{x})$ is just some number). Or, in matrix notation, we can write

$$ \triangle f(\mathbf{x}; w_0, w_1, \ldots, w_N) = \begin{bmatrix} \triangle \chi_0 (\mathbf{x}) & \cdots & \triangle \chi_N(\mathbf{x}) \end{bmatrix} \begin{bmatrix} w_0 \\ \vdots \\ w_N \end{bmatrix} $$

And so

$$ |\triangle f(\mathbf{x}; \ldots)|^2 = \left(\begin{bmatrix} w_0 & \cdots & w_N \end{bmatrix}\begin{bmatrix} \triangle \chi_0 (\mathbf{x}) \\ \vdots \\ \triangle \chi_N(\mathbf{x}) \end{bmatrix}\right)\cdot \left( \begin{bmatrix} \triangle \chi_0 (\mathbf{x}) & \cdots & \triangle \chi_N(\mathbf{x}) \end{bmatrix} \begin{bmatrix} w_0 \\ \vdots \\ w_N \end{bmatrix}\right) $$

Using the associativity of matrix multiplications, we can multiply the middle two things together first. To cleanup notation we write

$$ \mathbf{k}(\mathbf{x}) = \begin{bmatrix} \triangle \chi_0 (\mathbf{x}) \\ \vdots \\ \triangle \chi_N(\mathbf{x}) \end{bmatrix} \qquad \qquad \mathbf{w} = \begin{bmatrix} w_0 \\ \vdots \\ w_N \end{bmatrix} $$

we have that

$$ |\triangle f(\mathbf{x}; \ldots)|^2 = \mathbf{w}^T (\mathbf{k}(\mathbf{x}) \mathbf{k}(\mathbf{x})^T) \mathbf{w} $$

Now plugging this into the formula for the "total curvature redux". You see that what is written as $\mathbf{k}_i$ is nothing more than the vector $\mathbf{k}(\mathbf{x})$ evaluated at $\mathbf{x} = \mathbf{x}_i$.

In words, the $j$th entry of the vector $\mathbf{k}_i$ is "the 'curvature' (second derivative) of the basis function $\chi_j$ evaluated at the point $\mathbf{x}_i$", so that when multiplied by the weight $w_j$, it indicates "the contribution, made by the basis function $\chi_j$ at weight $w_j$, to the curvature of the output function $f$ at the point $\mathbf{x}_i$."

In the case where you are working over a discrete lattice, and the set $\mathbf{x}_0, \ldots \mathbf{x}_N$ exhausts all points in the domain, then one can show that $\mathbf{k}_i$ actually is a "matrix representation" of the second difference operator at the point $\mathbf{x}_i$. In the more general case where $\mathbf{x}_0, \ldots, \mathbf{x}_N$ do not exhaust all the points in the domain (as would be the case when working over a continuum), you would have to compute the $\mathbf{k}_i$ vectors using the method outlined above.

  • 0
    @Willie I haven't been able to reproduce the values you arrived at for the $k_0^T$ and $k_1^T$ equations. I appear to be getting some signs wrong. For example, I'm getting [2, -1,...] as the first two values of $k_0^T$ (with the 'boundary condition' you laid out). Would it be possible for you to show an example of how you arrived at those values? I hope this isn't too much of an imposition.2011-05-09
  • 0
    For $k_0^T$, it is on a corner, so on both "$x$" and "$y$" directions you apply the boundary condition. So $\Delta^2 f(0,0) = f(0,0) + f(0,0)$. For $k_1^T$, it is on an edge, so you only apply the boundary condition to the "$y$" direction, and so $\Delta^2 f(1,0) = f(1,0) + ( f(0,0) + f(2,0) - 2 f(1,0))$.2011-05-09
  • 0
    Applying "my" boundary condition, again you apply it for both $x$ and $y$ directions for $k_0^T$, and so $\Delta^2f(0,0) = (f(1,0) - f(0,0)) + (f(0,1) - f(0,0))$ and so on. (Note that here $k_i^T$ are the transformations that send the values [ f(0,0), f(1,0), f(2,0), f(0,1), f(1,1), f(1,2), f(0,2), f(1,2), f(2,2) ] to the second differences, so are filled by the coefficients in the second difference formulae.)2011-05-09
  • 0
    @Willie Wong. Aren't the values of the f() just the basis functions evaluated at the knots? That is, they take on a value of 1 or 0 at each node location? If so, then what I've been doing is $\Delta^2 f(0,0) = f(0,0) + f(0,0) = 1 + 1 = 2$ and $\Delta^2 f(1,0) = f(1,0) + (f(0.0) + f(2,0) - 2f(1,0) = 1 + 0 + 0 -2(1) = 1 - 2 = -1$. Obviously, I have a fundamental thinking error.2011-05-09
  • 0
    @Bruce $f$ is composed of many basis functions. So if the weight to the basis function at $(0,0)$ is $w_0$, and at $(1,0)$ is $w_1$ and so on, you have $f(0,0) = w_0$ and $f(1,0) = w_1$ and so on (using the fact that your basis functions take value of 1 at those points and zero elsewhere), so $\Delta^2 f(1,0) = w_1 + ( w_0 + w_2 - 2 w_0 )$. If you take the second variation _of the basis function_ you will get what you just wrote. But that is not what you want, I don't think. What you want is the function $f$ which is represented as a sum of basis functions with some weights. Does this help?2011-05-10
  • 0
    @Willie Wong. What I understand is that the output is a sum of weighted basis functions, that the basis function at each node is the product of the individual basis functions in each dimension, and that I need the weights (derived through a training process) in order to compute the curvature (penalty term). What I am not grasping is how the individual k vectors are computed in order to compute the K matrix, in order to use the weights to compute the penalty term. You don't have the weights that I'm using, but you were still able to compute the k vectors. How EXACTLY did you do this? Thanks2011-05-10
  • 0
    @Willie Wong. I was hoping you could expand on what you said a few comments ago. Pretend I'm even MORE clueless than I appear ;-)2011-05-10
  • 0
    @Bruce. The weights are only needed for derivation purposes, they are not present in the final vector/matrix. The point is that by construction you know that the curvature will be a linear function in the weights (the total, or squared, curvature penalty would then by quadratic in the weights). What I wrote above is demonstrating precisely that. And the idea is that any linear function can be written as "dot product against a vector". See the edit above for more details.2011-05-11
  • 0
    @Willie Wong. Thanks for the additional details. I now get it. I really appreciate your help.2011-05-11