Collaborative filtering

Collaborative filtering (CF) is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including sensing and monitoring data - such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data - such as financial service institutions that integrate many financial sources; or in electronic commerce and web 2.0 applications where the focus is on user data, etc. The remainder of this discussion focuses on collaborative filtering for user data, although some of the methods and approaches may apply to the other major applications as well.

The method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating). The underlying assumption of CF approach is that those who agreed in the past tend to agree again in the future. For example, a collaborative filtering or recommendation system for television tastes could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes). Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average (non-specific) score for each item of interest, for example based on its number of votes.

Methodology
Collaborative filtering systems have many forms, but many common systems can be reduced to two steps: This falls under the category of user-based collaborative filtering. A specific application of this is the user-based Nearest Neighbor algorithm.
 * 1) Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
 * 2) Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user

Alternatively, item-based collaborative filtering popularized by Amazon.com (users who bought x also bought y) and first proposed in the context of rating-based collaborative filtering by Vucetic and Obradovic in 2000, proceeds in an item-centric manner: See, for example, the Slope One item-based collaborative filtering family.
 * 1) Build an item-item matrix determining relationships between pairs of items
 * 2) Using the matrix, and the data on the current user, infer his taste

Another form of collaborative filtering can be based on implicit observations of normal user behavior (as opposed to the artificial behavior imposed by a rating task). In these systems you observe what a user has done together with what all users have done (what music they have listened to, what items they have bought) and use that data to predict the user's behavior in the future or to predict how a user might like to behave if only they were given a chance. These predictions then have to be filtered through business logic to determine how these predictions might affect what a business system ought to do. It is, for instance, not useful to offer to sell somebody some music if they already have demonstrated that they own that music or, considering another example, it is not useful to suggest more travel guides for Paris to someone who already bought a travel guide for this city.

In the age of information explosion such techniques can prove very useful as the number of items in only one category (such as music, movies, books, news, web pages) have become so large that a single person cannot possibly view them all in order to select relevant ones. Relying on a scoring or rating system which is averaged across all users ignores specific demands of a user, and is particularly poor in tasks where there is large variation in interest, for example in the recommendation of music. However, there are other methods to combat information explosion, for example web search, data clustering, and more.

History
Collaborative filtering stems from the earlier system of information filtering, where relevant information is brought to the attention of the user by observing patterns in previous behaviour and building a user profile. This system was essentially unable to help with exploration of the web and suffered from the cold-start problem that new users had to build up tendencies before the filtering was effective.

The first system to use collaborative filtering was the Information Tapestry project at Xerox PARC. This system allowed users to find documents based on previous comments by other users. There were many problems with this system as it only worked for small groups of people and had to be accessed through word specific queries which largely defeated the purpose of collaborative filtering.

The first system with proven results was the Bellcore Video Recommender

USENET Net news furthered collaborative filtering such that it was available for a mass scale of users while having a simpler method for accessing articles. The system allowed users to rate material based on popularity, which then allowed other users to search for articles based on these ratings.

One of the largest early collaborative filtering services for music recommendations widely available on the World Wide Web was Firefly, which evolved from early MIT Media Lab research projects. Firefly was bought by Microsoft in 1998. The service itself was closed down in 1999 with much of its technology and staff helping to create Microsoft Passport.

Memory-Based
This mechanism uses user rating data to compute similarity between users or items. This is used for making recommendations. This was the earlier mechanism and is used in many commercial systems. It is easy to implement and is effective. Typical examples of this mechanism are neighborhood based CF and item-based/user-based top-N recommendations.

The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user taking the weighted average of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple mechanisms such as Pearson correlation and vector cosine based similarity are used for this.

The user based top-N recommendation algorithm identifies the k most similar users to an active user using similarity based vector model. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the Locality sensitive hashing, which implements the nearest neighbor mechanism in linear time.

The advantages with this approach is the explainability of the results, which is an important aspect of recommendation systems. It is easy to create and use. New data can be added easily and incrementally. It need not consider the content of the items being recommended. The mechanism scales well with co-rated items.

There are several disadvantages with this approach. First, it depends on human ratings. Second, its performance decreases when data gets sparse, which is frequent with web related items. This prevents the scalability of this approach and has problems with large datasets. Third, it cannot handle new users or new items.

Model-Based
Models are developed using data mining, machine learning algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model based CF algorithms. These include Bayesian Networks, clustering models, latent semantic models such as singular value decomposition, probabilistic latent semantic analysis, Multiple Multiplicative Factor, Latent Dirichlet allocation, markov decision process based models.

This approach has a more holistic goal to uncover latent factors that explain observed ratings. Most of the models are based on creating a classification or clustering technique to identify the user based on the test set. The number of the parameters can be reduced based on types of principal component analysis.

There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations.

The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.

Hybrid
A number of applications combines the memory-based and the model-based CF algorithms. These overcome the limitations of native CF approaches. It improves the prediction performance. Importantly, it overcomes the CF problems such as sparsity and loss of information. However, they have increased complexity and are expensive to implement.

In commercial systems
Commercial sites that implement collaborative filtering systems include:
 * Amazon
 * Amie Street
 * Barilliance
 * Barnes and Noble
 * Baynote
 * ChoiceStream
 * Collarity
 * Digg.com
 * Directed Edge
 * eBay
 * Google News
 * Gravity R&D
 * half.ebay.com
 * Heeii
 * Hollywood Video
 * Hulu
 * iLike - music
 * Internet Movie Database - movies
 * iTunes - music
 * Last.fm - music
 * LibraryThing - books
 * Loomia - software-as-a-service provider of recommendation technologies
 * Musicmatch
 * Netflix - In order to improve its algorithm Netflix has launched a competition, the Netflix Prize.
 * Simania - Book recommendation site
 * Scarab Research
 * Strands Labs - Utilizes and commercializes its own social recommendation engine for social networks and eCommerce.
 * StumbleUpon - websites
 * Swelen - Swelen leverages Collaborative Filtering methods based on users' satisfaction to improve its advertising platform.
 * Threadless - T-shirt
 * TiVo
 * Yelp
 * Ramkol - Sophisticated recommendation for local search in Israel

In non-commercial systems
Non-commercial sites that implement collaborative filtering systems include:

Software libraries
Below are links to software libraries that allow developers to add collaborative filtering to applications or web sites:
 * Auguri Corporation - Optimized Collaborative Filtering
 * Mahout (now incorporates project formerly called "Taste") - open-source, Java
 * Cofi - open-source, Java (last updated in 2005)
 * Strands Recommender API - Public REST API. Includes item-item, user-item, user-user.
 * CoFE - open-source, Java (last updated in 2004)
 * ColFi - open-source, Java
 * Jumper 2.0 - open-source enterprise social bookmarking engine, Javascript & PHP
 * Duine Recommender Framework - open-source, Java
 * RACOFI - open-source, Java (last updated in 2003)
 * SUGGEST - Free, written in C. (A library, not open source.)
 * RecommenderAPI for Drupal - open-source, PHP (to be used with Drupal)
 * Rating-Based Item-to-Item - public domain, PHP (last updated in 2005)
 * Vogoo PHP Lib - open-source (Pro version requires licensing), PHP
 * C/Matlab Toolkit for Collaborative Filtering - open-source, Matlab, C
 * Fast Maximum Margin Matrix Factorization - Matlab/Octave
 * Filmaster's film recommendation algorithm - open source (AGPLv3), written in C++

Innovations

 * New algorithms have been developed for CF as a result of the NetFlix prize.
 * Cross-System Collaborative Filtering where user profiles across multiple recommender systems are combined in a privacy preserving manner.
 * Robust Collaborative Filtering, where recommendation is stable towards efforts of manipulation. This research area is still active and not completely solved.