Generating data with random Gaussian noise

Generating data with random Gaussian noise

I recently needed to generate some data for y as a function of x, with some added Gaussian noise. This comes in handy when you want to generate data with an underlying regularity that you want to discover, for example when testing different machine learning algorithms.

What I wanted to get is a mechanism that will allow me to specify a range for x and then generate data using

y = f(x) + \epsilon

with capability to control the function f(x) and the parameters of the Gaussian noise \epsilon.

I came up with this simple function, which allows me to specify f(x), the x interval and step, and the Gaussian distribution parameters (\mu and \sigma).

def corr_vars( start=-10, stop=10, step=0.5, mu=0, sigma=3, func=lambda x: x ):
    # Generate x
    x = np.arange(start, stop, step)    
    # Generate random noise
    e = np.random.normal(mu, sigma, x.size)
    # Generate y values as y = func(x) + e
    y = np.zeros(x.size)
    for ind in range(x.size):
        y[ind] = func(x[ind]) + e[ind]
    return (x,y)

Here are two examples of using the function to generate two data sets - one using y = x + \epsilon, the other - y = 2 * \pi * sin(x) + \epsilon.


(x0,y0) = corr_vars(sigma=3)   
(x1,y1) = corr_vars(sigma=3, func=lambda x: 2*pi*sin(x))   

f, axarr = plt.subplots(2, sharex=True, figsize=(7,7))

axarr[0].scatter(x0, y0)        
axarr[0].plot(x0, x0, color='r')
axarr[0].set_title('y = x + e')

axarr[1].scatter(x1, y1)        
axarr[1].plot(x1, 2*pi*np.sin(x1), color='r')
axarr[1].set_title('y = 2*π*sin(x) + e')

The snippet above plots the resulting data sets, together with the noiseless function (in red) for comparison.

Two plots of correlated variables with Gaussian noise

The full source code is available on GitHub.