3
$\begingroup$

I want to generate a Gaussian dataset. The dataset includes a total of 800 samples drawn randomly from four two-dimensional Gaussian classes with following distribution:

How can I do that in MATLAB? I'm not expert in MATLAB.

  • 0
    Rather than "800 samples", one should speak of "800 observations", which might make up four samples, each containing 200 observations.2012-04-09
  • 0
    I disagree. Speaking of samples is perfectly natural and extremely common in the literature related to drawing values from a random variable. See most introductory texts on Metropolis-Hastings, Acceptance-Rejection, Gibbs sampling, etc., and the language is usually phrased in terms of "drawing samples" where 'sample' does not refer to the entire population of observations. If these data were coming from an experiment, rather than a simulation, I might be more inclined to agree, though even still it would be a mostly pedantic distinction that you won't see in the literature.2012-04-09

2 Answers 2

2

This question is more appropriate for Stack Overflow, but that's OK. It still is a mathematics question, after all.

The right way to do this in Matlab is to use the mvnrnd() function. It accepts a vector of the coordinate means and a covariance matrix, and can return the results into an array of any shape that you'd like. In your care, you'd do the following:

mi = -3;                         % Or the other values you want to use. mu = [mi 0];                     % The mean vector. cov_mat = [0.5 0.05; 0.05 0.5];  % The covariance matrix.  num_samples = 800;               % The number of samples you want.  % Generate the draws. generated_data = mvnrnd(mu, cov_mat, num_samples); 

Now, the array generated_data will be an 800-by-2 matrix, where each row is a random draw from the distribution. See this link for more details.

Note that this claims to be part of the Matlab Statistics Toolbox. This should be a standard part of most Matlab licenses at an academic or professional institution. However, if you're using a personal copy of Matlab (such as the student version), you may not have access. In that case, there are several basic implementations of the same function available at the Matlab Central File Exchange, which you can download and use.

If that still doesn't work, let me know in a comment and I can go over the algorithm for actually drawing from a multivariate Gaussian based only on the inverse-CDF method and uniform draws.

Alternatively, much of the same functionality is provided in SciPy/NumPy for Python. It's free and is a good alternative to learn given that not much practical mathematical software is ever developed in Matlab.

Added:

As you mentioned not having the Statistics Toolbox, and then downloading the linked mvg() function from the file exchange, here is the code that would work with that function:

mi = -3;                        % Or the other values you want to use. mu = [mi 0];                    % The mean vector. cov_mat = [0.5 0.05; 0.05 0.5]  % The covariance matrix.  num_samples = 800;              % The number of samples you want.  % Generate the draws. generated_data = mvg(mu, cov_mat, num_samples) 

This time, generated_data will be a 2-by-800 array, so each column will be a random sample, instead of the rows as listed above.

You may want to try this for a smaller number of samples, like 10 or 15, and then just print out the result by typing:

generated_data 

without a semicolon, to see what the output array looks like.

  • 0
    Thanks a lot. I got the following error: "??? Undefined function or method 'mvnrand' for input arguments of type 'double'."2012-04-09
  • 0
    This is probably because you don't have the Statistics Toolbox. As an alternative, you can download the function [from this link](http://www.mathworks.com/matlabcentral/fileexchange/21279-mvg-multivariate-gaussian-random-number-generator) and put it into your working directory. Then you can use that function in a basically similar way as above.2012-04-09
  • 0
    I downloaded "mvg.m" from mentioned link and added it to working directory. But the file define mvg function and "??? Undefined function or variable 'mvnrand'."2012-04-09
  • 0
    You've got to try a little harder. Obviously if you're downloading a new function named `mvg`, then using `mvnrand` isn't sensible. It should be clear that downloading a file won't cause `mvnrand()` to work. You'll need to modify the code above to make `mvg` work. I will add some code to help.2012-04-09
  • 0
    "Number of samples" is an incorrect usage. "Number of observations in the sample" or "size of the sample" or "sample size" is correct.2012-04-09
  • 0
    No, I disagree. It is extremely common in the stochastic sampling literature to use the term "random sample" to refer to a draw from a random variable.2012-04-09
  • 0
    Thanks again. Did you get result with above code?, I got following error:"??? Error using ==> plus Matrix dimensions must agree. Error in ==> mvg at 50 y = R'*randn(m,N) + repmat(mu,1,N);"2012-04-10
  • 0
    @Reza, with what arguments did you call the function? That error would suggest that you are not using a properly sized mean vector or covariance matrix. Note that your covariance is 2x2, so you have to draw your samples for each different value of $m_i$ separately. You can't feed a 4-vector of the means yet have only a 2x2 covariance. Perhaps edit the original post to show the arguments you're calling it with so I can check for the error.2012-04-10
  • 0
    @EMS In your first answer, you misspelled mvnrnd as "mvnrand", and I just copied and pasted your code! I wish you did test your code before place it as an answer, Thanks.2012-04-12
  • 0
    Thanks @Reza for pointing out the typo. Though I know Matlab pretty well from my first job, I don't use it anymore, so sometimes I am a little rusty. I mostly boycott Matlab in favor of the open source options, R and Python with NumPy/SciPy and third-party libraries. However, you should never merely copy and paste an answer, and it should be easy with Matlab's help options to correct for typos.2012-04-12
  • 0
    Also I should mention that MATLAB error messages are really misleading.2012-04-12
  • 0
    Yeah, that is true. It's unfortunate given how much they charge that they don't provide more descriptive error handling ability for the built-ins. Feeling out these things comes with time though.2012-04-12
1

The finite-dimensional version of the spectral theorem implies that a symmetric nonnegative-definite matrix $\Sigma$ with real entries as a symmetric nonnegative-definite square root $\Sigma^{1/2}$. Say the variance (or "covariance matrix", if you like) is a $2\times2$ matrix $$ \Sigma = \begin{bmatrix} \sigma^2 & \rho\sigma\tau \\ \rho\sigma\tau & \tau^2 \end{bmatrix} = \begin{bmatrix} \sigma & 0 \\ 0 & \tau \end{bmatrix}\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\begin{bmatrix} \sigma & 0 \\ 0 & \tau \end{bmatrix}.\tag{1} $$ If $U$, $V$ are independent standard $1$-dimensional Gaussian random variables (thus the variance of the random vector $\begin{bmatrix} U \\ V \end{bmatrix}$ is the $2\times2$ identity matrix), then $$ \Sigma^{1/2} \begin{bmatrix} U \\ V \end{bmatrix} $$ has variance $\Sigma$. (It also has expected value $\begin{bmatrix} 0 \\ 0 \end{bmatrix}$, and you can just add whatever vector you want as the expected value to the random vector that has the right variance and expected value $\begin{bmatrix} 0 \\ 0 \end{bmatrix}$.

A sample of $200$ from this $2$-dimensional Gaussian distribution will on average have a sample mean and a sample variance equal to this given population mean and the given population variance respectively, but with probability $1$, the sample mean and sample variance will differ from those. But with similar methods, one can get the sample mean and sample variance exactly equal to prescribed values. The part of doing that that I haven't yet described here is how to get the sample correlation to be exactly $0$. The way to do that is to regress the vector of $y$-values in the sample on the vector of $x$-values, then replace the $y$-values with the observed residuals.

Finding the square root of the middle matrix in $(1)$ can be done as follows: write it as $$ \alpha\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} + \beta \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix} $$ then use the fact that these latter two matrices are symmetric and idempotent.

I don't know the matlab commands, though.

  • 0
    Note that this is precisely what the `mvg()` and `mvnrand()` functions do, and can be stated more simply by appealing to the Cholesky decomposition of the covariance. I don't think the question has any relevance to population means, after all the poster says the need is to draw actual random samples.2012-04-09
  • 0
    Population means are relevant to the question, because the question explicitly says there are to be four samples, each with a prescribed population mean.2012-04-10
  • 0
    That's misleading. I'm not familiar with anyone in the applied statistics literature who would refer to the means of generated basic random variables as population means. Unless those parameters were estimated from data, I think this would be extremely uncommon. 'Population' is almost always used to refer to the superset of individuals from which your observations have come, and it is not used to refer to random values that you have yourself generated. Of course there will be occasional places where this convention doesn't apply, but using 'population mean' here is not useful.2012-04-10
  • 0
    I was NOT using it to refer to the mean of the set of observations, but rather to a superset, i.e. to the expected values. I suggest you are confused.2012-04-10
  • 0
    I disagree. The superset that I'm referring to is the idealized population that something is drawn from. In this case, the specified normal distributions would *not* qualify as a population description. A population description is something like saying that my observations of wage data come from an infinite population of American citizens, and *then* I will place an assumed distribution over the observations, *after* specifying the larger population.2012-04-10
  • 0
    In a case like this, referring to the normal variable that you're drawing from as a 'population' is merely pedantic and totally not necessary. And referring to the draws from such a distribution as samples is totally standard. See, e.g. Jun Liu's standard book [Monte Carlo Strategies in Scientific Computing](http://www.amazon.com/Monte-Carlo-Strategies-Scientific-Computing/dp/0387952306), where draws from any generic random variable are consistently referred to as 'random samples', and 'population' is reserved for identifying assumptions *beyond* functional forms.2012-04-10
  • 0
    @EMS : I was distinguishing between the expected value and the sample average. That is a distinction that had to be drawn. I'm sorry it upset you, but I'm not sorry I did it.2012-04-11
  • 0
    @EMS : "I'm not familiar with anyone in the applied statistics literature who would refer to the means of generated basic random variables as population means." I did not do so. I used the term "population mean" precisely for the purpose of making it clear that I DID NOT mean the "means of generated basic random variables".2012-04-11
  • 0
    I think we're talking past each other. When I said "the means of generated basic random variables," I meant for that to refer to the terms $m_{1}$ through $m_{4}$ above. I did not ever think you were referring to generating a bunch of draws and then taking the mean of those draws (which I would call the sample mean). I am disputing that the terms $m_{i}$ should be called 'population means'... they should not because they are not given by constructing a population of anything; they are just parameters of some random variables with functional forms.2012-04-11
  • 0
    In order for me to agree that the $m_{i}$ should be called by the term 'population mean', they would need to signify the difference between the set of things you wish to simulate draws from and some *even bigger* set of things of which they form a sub-population. This problem has absolutely no context for treating the random draws as if they were a sub population of some bigger world of 2-dimensional vectors, hence using the term 'population' is totally pedantic and not at all needed here. Calling a single draw from a normal by the name 'a single sample' is just fine and clear from context.2012-04-11
  • 0
    If I say that $X_1,\ldots,X_n \sim N(\mu,\sigma^2)$, then $(X_1+\cdots+X_n)/n$ is the "sample mean" and $\mu$ is the "population mean". And it is indeed the mean of a bigger set $\{X_1(\omega) : \omega\in\Omega\}$, where $\Omega$ is the probability space that is the domain of $X_1,\ldots,X_n$.2012-04-11
  • 0
    Right, and that is precisely what I'm disputing. Saying that $\mu$ should be called the 'population' mean in that setup is purely pedantic and unproductive. I would refer to a realized value of $X_{1}$ as a single sample, and also refer to the collection of realized valued $\{X_{1},\cdots , X_{n}\}$ as *the* sample. And this would jive just fine with almost all modern books that cover sampling. If there was reason to think the normal draws related to some higher level semantic description of a problem, *then* making a careful distinction between 'population' and 'sample' *might* be warranted.2012-04-11
  • 0
    And also, no serious statistics book refers to the $\sigma$-algebra / filtration / probability space set-up when expressing the difference between a situation where you need to identify a population apart from its super-population and one where you do not. That is, claiming that the whole ambient probability space counts as a super-population is totally disingenuous. No one would do that, it would literally convey zero information to the reader about why some certain set of random variables is being indicated as the population of interest.2012-04-11
  • 0
    I would also call $\{X_1\}$ a sample. But it's not the same thing as $\mu=\mathbb{E}(X_1)$, which is the population mean.2012-04-11
  • 0
    As long as it's clear that $\mu$ is only the population mean when there's a real context in which to call the superset a population, and that there is no such context in this particular elementary question that deals only with generating values from a basic, contextless random variable, thus making it totally and solely a pedantic issue to question whether the vocabulary in my answer was appropriate. Also, your earlier comment disagrees with you now, because you earlier said that "number of samples" is inapppropriate usage, but now agree that calling a draw from $X_{1}$ as 'a sample' is fine.2012-04-11
  • 0
    "Number of samples" is incorrect if you mean $n$ is the number of samples when $X_1,\ldots,X_n$ are observed. Together they make up _one_ sample. But one can also have a sample of size $1$.2012-04-11
  • 0
    No, "number of samples" is just fine for referring to $n$ in your notation, and most standard texts use "number of samples" exactly that way, and interchangeably with the size of the set of sets of observations whenever the context implies that the language of observations makes sense, which is not the case in this question, hence the uselessness of your continued pedantry.2012-04-11
  • 0
    WHICH "standard texts" say that?2012-04-11
  • 0
    As already linked, Jun Liu's book on Monte Carlo methods uses notation that way, and just yesterday I was reading the text "Bayesian Data Analysis" by Gelman, Carlin, Rubin, and Stern, and chapter 7 definitely mixes and matches the use of 'population' and 'number of samples' as I have suggested. I agree with what you're saying when it is in the context of a specific set-up of a model, but not when authors are just discussing generic random variables that have no model context.2012-04-11
  • 0
    I don't have that book but on google books I find some occurrences of the phrases "small sample" and "large sample" within that book. What would those terms mean if they're not following the usage I'm describing here?2012-04-12
  • 0
    When the context of the problem obviously includes modeling decisions, and isn't just about generic random variables, then using the phrase like that is appropriate, as I've already said. I'm not surprised that it's used that way in some cases. My point is that it's by no means necessary to use that vocabulary in all cases when draws from a random variable are involved, and it would just be nitpicky and not constructive to insist on always using that word convention when authors often do not adhere to it.2012-04-12