1. Working with Dense Data¶

In [3]:

            
                Copied!
                
                    
                    
                
                

        
import seaborn as sns
import matplotlib.ticker as ticker
from neighbors import (
    NNMF_sgd,
    estimate_performance,
    load_toymat,
)

def plot_mat(mat):
    "Quick helper function to nicely plot a user x item matrix"

    ax = sns.heatmap(mat, cmap="Blues", vmin=1, vmax=100)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
    ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
    ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
    ax.yaxis.set_major_formatter(ticker.ScalarFormatter())
import seaborn as sns
import matplotlib.ticker as ticker
from neighbors import (
    NNMF_sgd,
    estimate_performance,
    load_toymat,
)

def plot_mat(mat):
    "Quick helper function to nicely plot a user x item matrix"

    ax = sns.heatmap(mat, cmap="Blues", vmin=1, vmax=100)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
    ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
    ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
    ax.yaxis.set_major_formatter(ticker.ScalarFormatter())

All toolbox algorithms operate on 2d pandas dataframes with rows as unique users and columns as unique items. Models distinguish between two kinds of datasets:

Dense data, in which all users rated all items. Such datasets are useful for estimating the performance of an algorithm by testing how well some % of ratings can be masked out and then recovered via prediction. This is useful for benchmarking model performance and simulating a situations with datasets of varying sparsity. Conceptually this is equivalent to supervised learning where we make predictions with knowledge of the "correct answers" that can be used to compute model performance.
Sparse data, in which some user-item ratings were never observed. This is the primary intended use case of the toolbox. A model can be trained on the observed ratings using various collaborative filterating algorithms to generate predictions about these missing ("unobserved") ratings.

In this tutorial we'll demonstrate basic toolbox features on dense data. The load_toymat function can be used to generate some sample data for our purposes. Let's generate a dataset in which each of 50 users rated 50 items on a scale from 1-100.

Note: the numbers chosen are just for illustrative purposes and the number of users and items doesn't have to be equal

In [8]:

            
                Copied!
                
toy_data = load_toymat(users=50, items=50, random_state=0)
plot_mat(toy_data)
toy_data = load_toymat(users=50, items=50, random_state=0)
plot_mat(toy_data)

Fitting a model¶

Fitting a model works similarily to libraries like sklearn. You just need to initialize a model object and call its .fit method. When working with dense data, i.e. every user rated every item, we need initialize the model with a mask or value between 0-1 that indicates what proportion of the observed data should be treated as "missing." This allows us to simulate a situation in which we hadn't observed these ratings at all.

Using n_mask_items we can masking out 25% of the ratings and retain 75%. Notice how some user-item combinations are now set to NaN.

In [3]:

            
                Copied!
                
model = NNMF_sgd(toy_data, n_mask_items=.25, random_state=0)

# Take a look the first 10 user x item predictions
model.masked_data.iloc[:10,:10]
model = NNMF_sgd(toy_data, n_mask_items=.25, random_state=0)

# Take a look the first 10 user x item predictions
model.masked_data.iloc[:10,:10]

Out[3]:

Item	0	1	2	3	4	5	6	7	8	9
User
0	27.440675	35.759468	30.138169	NaN	21.182740	37.294706	NaN	44.588650	48.183138	19.172076
1	28.509839	21.930076	49.418692	5.102241	10.443838	13.065476	32.655416	NaN	23.315539	12.221280
2	33.890827	13.500399	36.759701	NaN	12.437657	33.807867	NaN	28.612595	11.154082	47.637451
3	7.472415	43.406303	NaN	30.777978	6.190999	NaN	40.365948	28.455037	20.359165	3.458350
4	15.589794	34.817174	NaN	8.980184	NaN	8.362482	33.969639	22.684842	26.828961	NaN
5	17.780637	NaN	38.266263	37.433181	NaN	9.171122	27.609623	NaN	48.096819	NaN
6	45.327775	38.702367	16.657258	4.055069	NaN	16.611707	NaN	2.671359	36.279718	0.571373
7	32.278512	NaN	21.520122	NaN	26.808875	NaN	13.879805	6.443028	NaN	NaN
8	NaN	46.464571	NaN	47.265077	NaN	27.708120	NaN	NaN	NaN	NaN
9	14.827813	49.600562	12.471002	NaN	47.547631	16.671013	34.488413	2.917818	36.535455	44.086011

Now we can try to predict these missing ratings by fitting the model and plotting its predictions.
The left matrix is the input data after masking. The middle is model predictions for the missing ratings + the ratings we did observe. The right is a scatter plot model predictions for missing ratings vs the true values of the these ratings.

For convenience the plot title contains the RMSE and correlation of the missing ratings (averaged across users to account for user-level clustering). RMSE is interpretable as the average misprediction on the same scale as the original ratings. In this case 1-100.

In [10]:

            
                Copied!
                
model.fit()
model.plot_predictions();
model.fit()
model.plot_predictions();

To retrieve the matrix containing the model predictions we can use the .transform method. By default this will return a matrix containing ratings for values that were observed and predictions for values that were missing (i.e. masked out). To return predictions for the observed values as well, i.e. not passing forward these values, set return_only_predictions=True.

Now the masked out ratings have been replaced with model predictions:

In [11]:

            
                Copied!
                
predictions = model.transform()
# Take a look the first 10 user x items after masking
predictions.iloc[:10,:10]
predictions = model.transform()
# Take a look the first 10 user x items after masking
predictions.iloc[:10,:10]

Out[11]:

Item	0	1	2	3	4	5	6	7	8	9
User
0	27.440675	35.759468	30.138169	36.660519	21.182740	37.294706	56.100570	44.588650	48.183138	19.172076
1	28.509839	21.930076	49.418692	5.102241	10.443838	13.065476	32.655416	21.191779	23.315539	12.221280
2	33.890827	13.500399	36.759701	23.337345	12.437657	33.807867	30.578051	28.612595	11.154082	47.637451
3	7.472415	43.406303	15.179831	30.777978	6.190999	38.134221	40.365948	28.455037	20.359165	3.458350
4	15.589794	34.817174	5.257692	8.980184	55.577116	8.362482	33.969639	22.684842	26.828961	25.627142
5	17.780637	41.777774	38.266263	37.433181	61.015818	9.171122	27.609623	40.282928	48.096819	36.694523
6	45.327775	38.702367	16.657258	4.055069	26.051788	16.611707	23.999081	2.671359	36.279718	0.571373
7	32.278512	13.093412	21.520122	31.101120	26.808875	23.499822	13.879805	6.443028	17.163445	36.159055
8	18.240560	46.464571	14.868483	47.265077	25.477137	27.708120	29.443827	28.951392	41.470059	32.561340
9	14.827813	49.600562	12.471002	12.572948	47.547631	16.671013	34.488413	2.917818	36.535455	44.086011

For NNMF models it's easy to inspect and debug model training using the .plot_learning function. It's also possible to get more detail while fitting, by passing verbose=True to .fit.
The plot title below also displays the final RMSE on the observed ratings during training and indicates whether the model fit converged within the number of iterations.

In [6]:

            
                Copied!
                
model.plot_learning();
model.plot_learning();

Scoring a model's predictions¶

Working with dense data affords us a ground truth that can be used to asses the model's performance. We support a number of different metrics to do this (RMSE and correlation in the plots above are just two). To return a model's performance you can use the .score method. However, the .summary method maybe more convenient as it returns all supported metrics along with separate scoring for both the observed ratings (model training performance) and missing values (model testing performance)

Additionally, metrics are scored in two different ways. user metrics below score performance separately by each user first and then average these scores. This approach is more common to the social science where observations are treated as "clustered" by user. all simply scores all ratings ignoring the fact that multiple ratings come from each user. This method is more common in machine-learning where overall model performance is of primary interest.

In [13]:

            
                Copied!
                
model.summary(verbose=True)
model.summary(verbose=True)

User performance results (not returned) are accessible using .user_results
Overall performance results (returned) are accesible using .overall_results

Out[13]:

	algorithm	dataset	group	metric	score
0	NNMF_sgd	missing	all	correlation	0.360521
1	NNMF_sgd	missing	all	mae	16.760530
2	NNMF_sgd	missing	all	mse	441.263698
3	NNMF_sgd	missing	all	rmse	21.006278
4	NNMF_sgd	missing	user	correlation	0.297074
5	NNMF_sgd	missing	user	mae	16.760530
6	NNMF_sgd	missing	user	mse	441.263698
7	NNMF_sgd	missing	user	rmse	20.445180
8	NNMF_sgd	observed	all	correlation	0.995488
9	NNMF_sgd	observed	all	mae	1.519559
10	NNMF_sgd	observed	all	mse	4.655738
11	NNMF_sgd	observed	all	rmse	2.157716
12	NNMF_sgd	observed	user	correlation	0.994861
13	NNMF_sgd	observed	user	mae	1.519559
14	NNMF_sgd	observed	user	mse	4.655738
15	NNMF_sgd	observed	user	rmse	2.081369

Benchmarking a model's performance¶

The performance above is specific the the exact ratings we masked out. But how does the model perform in general when 25% of the data is missing?

While we could repeat the procedure above for different random masks of the same size, doing so by hand is a bit tedious. Fortunately, the estimate_performance function is designed exactly for this purpose. Just pass it a model class (not a model object), some data, and the amount of masking and it will repeatedly refit the model with new random masks and return the average performance across all iterations. This is functionally equivalent to randomized cross-validation, where the size of the training and testing splits are controlled via the n_mask_items argument. In the example below, masking 25% of the data is equivalent to 4-fold cross-validation where training is done using 3 folds and testing is performed on the left out fold.

In [14]:

            
                Copied!
                
overall_results, user_results = estimate_performance(
    NNMF_sgd, toy_data, n_iter=10, n_mask_items=.25
)
overall_results
overall_results, user_results = estimate_performance(
    NNMF_sgd, toy_data, n_iter=10, n_mask_items=.25
)
overall_results

Data sparsity is 0.0%. Using random masking...

Out[14]:

	algorithm	dataset	group	metric	mean	std
0	NNMF_sgd	missing	all	correlation	0.361321	0.039224
1	NNMF_sgd	missing	all	mae	17.722848	0.532381
2	NNMF_sgd	missing	all	mse	493.397942	34.549618
3	NNMF_sgd	missing	all	rmse	22.200547	0.770019
4	NNMF_sgd	missing	user	correlation	0.322631	0.038444
5	NNMF_sgd	missing	user	mae	17.722848	0.532381
6	NNMF_sgd	missing	user	mse	493.397942	34.549618
7	NNMF_sgd	missing	user	rmse	21.695961	0.672878

We can also see if predictive performance varied by user to identify some users that were particularly difficult to generate predictions for.

In [15]:

            
                Copied!
                
user_results.head()
user_results.head()

Out[15]:

	rmse_missing	mse_missing	mae_missing	correlation_missing
user
0	19.988851	408.599276	16.442676	0.306068
1	25.840342	748.642060	20.213965	0.192995
2	18.997861	375.325530	15.596407	0.482213
3	21.942705	499.427873	17.603425	0.271063
4	22.217112	506.409075	18.474751	0.358182

By default estimate_performance only returns performance missing data. To see performance on all subsets use return_full_performance = True. You can also use return_agg=False if you want to see performance for each iteration separately rather than the mean and std across all iterations.

Summary¶

Those are the basics of working with models and dense data. The next tutorial illustrates the primary use case for the toolbox: working with sparse data.