Recommendation systems are essential for providing personalized suggestions to users. Whether it’s suggesting movies, products, or music, these systems enhance user experience by tailoring content based on individual preferences. The real challenge comes when the implementation is done in a real-world scenario. Using the ML framework in Python, a single machine will be unable to handle the ever-growing size of data as the model needs to be fed with more and more recent data.
Why Distributed Computing is Essential for Big Data in Recommendation Systems
This is where we need a distributed computing approach to run algorithms in parallel. With terabytes or even petabytes of data, it’s impossible to load data of such size into a single machine. Now, all algorithms might not work with a cluster of machines, or even if they work, it is useless if they cannot utilize the distributed computing approach to run the algorithm. So, we need a machine learning model (or framework) that can train on dataset spreading across from cluster of machines.
Method 1: ALS with Spark ML
Alternating Least Square (ALS) is a matrix factorization algorithm. Its main advantage is that it can run in parallel, which satisfies our requirement of using a distributed computing technique to handle big data. ALS is implemented in Apache Spark ML and built for a large-scale collaborative filtering problem.
The working mechanism of ALS
As mentioned earlier, ALS is a matrix factorization algorithm and uses gradient descent with a twist. Let’s take an example of 2 users and 3 products,
The goal of Alternating Least Squares is to find two matrixes, U and P, whose product is approximately equal to the original matrix of users and products. Once such matrices have been found, we can predict what user i will think of product j by multiplying row i of U with row j of P.
ALS minimizes two loss functions alternatively. First, it holds the user matrix fixed and runs gradient descent with the product matrix, and then it holds the product matrix fixed and runs gradient descent with the user matrix.
So, how does this run in parallel?
ALS runs its gradient descent in parallel across multiple partitions of the underlying training data from a cluster of machines.
Implementation
You should have Spark setup in your system to use this. Starting with a set of imports:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.recommendation import ALS
Once the data is read and split into training and testing, the ALS model must be set-up:
als = ALS(maxIter=20,
userCol="userId",
itemCol="productId",
rank=5,
ratingCol="rating",
regParam=1,
coldStartStrategy="drop",
seed=42)
The most important hyper-parameters are:
- maxIter: the maximum number of iterations to run (defaults to 10)
- rank: the number of latent factors in the model (defaults to 10)
- regParam: the regularization parameter in ALS (defaults to 1.0)
A pipeline can be built on top of this followed by GridSearch to optimize the hyper-parameters to get the best model.
Once the model is finalized, recommendations can be made on test data for validation and then can be implemented in a workflow. The steps involved in the workflow can be as follows:
- A new user inputs ratings for different products, and then system creates new user-product interaction samples for the model
- ALS model is retrained on data with the new inputs
- Product data for inference is created
- Rating predictions on all products for that user are made
- Top N product recommendations for that user based on the ranking of product rating predictions are shown to the user
Method 2: Vector DBs
Vector DBs have a very simple implementation with respect to the Recommendation System. It works on similarity searches and using this similarity search can be useful in our earlier example of product recommendations, where the goal is to find products that are similar to those a user has already viewed, bought or rated.
The working mechanism of Vector DBs
By representing products as vectors in a high-dimensional space, we can employ distance metrics (such as cosine similarity or Euclidean distance) to identify products that are ‘close’ to each other, indicating similarity.
As the number of movies and users grows, so does the size of the database. Vector databases are designed to handle large-scale data while maintaining high query performance. This scalability is essential for movie recommendation systems, especially those used by large streaming platforms with extensive movie libraries and user bases.
The architecture is divided into 2 parts:
- Candidate Generation: This basically means embedding these product ratings into text based on some initial filtering by product category, type, etc.
- Re-Ranking: Re-ranking is essentially carried out in the recommender system to arrange products according to the sentiments expressed in the textual information. With the assistance of large language models, we can obtain the opinion score of the textual data. The products will be re-ranked for the recommendation based on the opinion score.
A Variation of Vector DB: ScaNN
ScaNN is scalable Nearest Neighbors. For modern recommendation systems, at the retrieval stage we need to find the nearest datasets embedding for a given query embedding. Usually, the set if embedding is often too large for exhaustive search, so a tool like ScaNN is used to approximate neighborhood search.
ScaNN is a scalable nearest-neighbor search. We need to find the nearest dataset embedding for a given query embedding at the retrieval stage for modern recommendation systems. The set of embeddings is often too large for exhaustive search, so a tool like ScaNN is used to approximate neighborhood search.
Benefits:
ScaNN provides highly efficient vector similarity search, namely faster matching and retrieval of similar items from massive and moderate size databases.
It includes state-of-the-art implementation of modern ANN techniques.
- Challenge of Exhaustive Search: In a recommendation system, you want to find items like a user's preference or past interactions. With a vast dataset, exhaustively comparing the user's profile to every item becomes computationally expensive and impractical.
- Approximate Nearest Neighbors (ANN): ScaNN excels at performing ANN searches. It efficiently identifies items that are close enough (similar) to the user's preference, even if not the absolute exact match. This "good enough" approach significantly reduces processing time while maintaining high recommendation quality.
- Vector Embeddings: ScaNN relies on vector embeddings to represent both users and items. These embeddings are dense, low-dimensional vectors that capture the essential characteristics of users and items.
- Similarity Search: The core function of ScaNN is to find similar vectors in the embedding space. It employs a combination of techniques like quantization (reducing embedding size), vector decomposition (breaking down complex vectors), and graph-based search (leveraging relationships between items) to achieve fast and accurate similarity searches.
- Benefits for Recommendations: By efficiently finding similar items (nearest neighbors) to a user's preference, ScaNN enables recommendation systems to:
- Scale Effectively: Handle massive datasets of users and items without compromising responsiveness.
- Real-time Recommendations: Generate personalized recommendations quickly, enhancing user experience.
- Accuracy with Efficiency: Maintain a high degree of accuracy in recommendations while optimizing processing speed.
This chart shows a tradeoff between speed and accuracy. For our nearest neighbor search, high accuracy comes up with more expensive search, which means your speed will be slower. We can see that ScaNN is outperforming other alternatives.
Method 3: Multilevel Classification
Instead of candidate generation and re-ranking at the granular level we could categorize it into multiple levels and perform a classification.
Multilevel Classification: While traditional recommendation systems focus on single-level item categories (e.g., movies, books), multilevel classification takes a more granular approach for large datasets. Following are the steps for a multi-level classification:
- Building a Hierarchy: First, a hierarchical structure is created for items. This could involve categories, subcategories, and even sub-subcategories. Imagine a hierarchy for movies: Action > Sci-Fi > Space Opera.
- Multilevel Classification Models: Machine learning models are then trained to classify items at different levels of the hierarchy. These models analyze various attributes of the items (e.g., plot, director, actors, reviews) to predict their category membership.
- Enhanced Recommendation Process: During recommendation, the system considers a user's preferences at various levels. If a user enjoys action movies, the system might recommend space operas (sub-subcategory) but could also suggest broader categories like sci-fi or action films.
Benefits of Multilevel Classification:
- Improved Accuracy: By considering multiple levels of detail, the system can make more precise recommendations that cater to user preferences across various granularities.
- Handling Large Datasets: The hierarchical structure helps manage vast item catalogs by organizing them into manageable categories. This improves processing efficiency for large datasets.
- Flexibility: The system can recommend items at different levels based on the user's context or browsing behavior. For example, a user casually browsing might appreciate broader category suggestions, while someone actively searching for a specific movie might benefit from sub-subcategory recommendations.
- Cold Start Problem Mitigation: For new items with limited interaction history, multilevel classification can leverage category information to make relevant recommendations.
Example: Movie Recommendation System
Imagine a user who loves action movies with strong female leads. Here's how multilevel classification can help:
- Level 1: The system identifies the user's preference for "Action" movies.
- Level 2: Based on previous interactions with action movies, the system might recommend subcategories like "Sci-Fi Action" or "Martial Arts Action."
- Level 3 (Optional): If the user's history includes movies with strong female leads, the system could delve deeper and suggest specific sub-subcategories like "Action with Female Protagonist."
Conclusion
In this post we saw a couple of ways of implementing a recommendation system while dealing with big data. We could use ALS, which is widely used, or we could go with Vector databases with a twist of converting the recommendation problem statement to a multi-level classification problem. While there could be more ways of scaling a recommendation system, it requires more experimentation with different varieties of data and use cases.
Scale your recommendation systems efficiently using distributed computing techniques! Discover powerful data science services designed to enhance user experiences—reach out today!
AUTHOR - FOLLOW
Sourav Kumar Mishra
Associate Manager, Data Science
Topic Tags